-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
I've noticed that serializing an arrow table (an ArrowTabular object) from R using arrow::write_to_raw can take about 10x the amount of time that it took to first read in the dataset from disk (just a regular nvme ssd).
Not sure whether this counts as a bug report or a feature enhancement request, but in any case, this seems excessive and currently makes Arrow a no-go for inter-process communication in R, e.g. for parallel processing with the mirai package.
Here's a minimal example:
library("arrow")
library("profvis")
data <- data.frame(i = rep(1:10, times=1e5))
for (v in 1:100) {
data[, paste0("v", v)] <- rnorm(1e6)
}
# 790 MiB on disk
write_parquet(data, "sandbox/random.parquet")
file.info("sandbox/random.parquet")$size / 1024 / 1024
profvis({
query <- open_dataset("sandbox/random.parquet")
atbl <- as_arrow_table(query) # 70 ms
tbl <- collect(atbl) # 10 ms
ser <- arrow::write_to_raw(atbl, format = "stream") # 810 ms
# - as.raw.Buffer # (660 ms)
# - write_ipc_stream # (120 ms)
# - buffer # (20 ms)
des <- read_ipc_stream(ser, as_data_frame = FALSE) # 10 ms
})As you can see, it takes 80 ms (70+10) to read the data into R, but 810 ms to serialize it for IPC. as.raw.Buffer seems to be the major culprit, but even write_ipc_stream takes more time than a full read from disk.
I have observed this same behavior on a 2019 Macbook Air (MacOS, Intel) as well as on a 2025 workstation (Linux, AMD Zen 5). The speed is also similar whether using an R arrow 22.0 binary or a compiled arrow 24... with the latter being maybe a tad faster (600-650 ms instead of 800-810ms) but that could be noise in my benchmark.
For completeness, here are the two package versions I tested with. The binary:
Arrow package version: 22.0.0.1
Capabilities:
acero TRUE
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE
Memory:
Allocator mimalloc
Current 5 Gb
Max 5.76 Gb
Runtime:
SIMD Level avx512
Detected SIMD Level avx512
Build:
C++ Library Version 22.0.0
C++ Compiler GNU
C++ Compiler Version 8.3.1
... and a version compiled using install_arrow(nightly = TRUE)
Arrow package version: 23.0.0.100000000
Capabilities:
acero TRUE
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE
Memory:
Allocator mimalloc
Current 1.76 Gb
Max 1.76 Gb
Runtime:
SIMD Level avx512
Detected SIMD Level avx512
Build:
C++ Library Version 24.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 11.4.0As always, thanks for your help, I love arrow/parquet.
Component(s)
R