Skip to content

[R] write_to_raw is very slow #48908

@debrouwere

Description

@debrouwere

Describe the bug, including details regarding any error messages, version, and platform.

I've noticed that serializing an arrow table (an ArrowTabular object) from R using arrow::write_to_raw can take about 10x the amount of time that it took to first read in the dataset from disk (just a regular nvme ssd).

Not sure whether this counts as a bug report or a feature enhancement request, but in any case, this seems excessive and currently makes Arrow a no-go for inter-process communication in R, e.g. for parallel processing with the mirai package.

Here's a minimal example:

library("arrow")
library("profvis")

data <- data.frame(i = rep(1:10, times=1e5))
for (v in 1:100) {
  data[, paste0("v", v)] <- rnorm(1e6)
}

# 790 MiB on disk
write_parquet(data, "sandbox/random.parquet")
file.info("sandbox/random.parquet")$size / 1024 / 1024

profvis({
  query <- open_dataset("sandbox/random.parquet")
  atbl <- as_arrow_table(query)                       # 70 ms
  tbl <- collect(atbl)                                # 10 ms
  ser <- arrow::write_to_raw(atbl, format = "stream") # 810 ms
  # - as.raw.Buffer                                   # (660 ms)  
  # - write_ipc_stream                                # (120 ms)
  # - buffer                                          # (20 ms)
  des <- read_ipc_stream(ser, as_data_frame = FALSE)  # 10 ms
})

As you can see, it takes 80 ms (70+10) to read the data into R, but 810 ms to serialize it for IPC. as.raw.Buffer seems to be the major culprit, but even write_ipc_stream takes more time than a full read from disk.

I have observed this same behavior on a 2019 Macbook Air (MacOS, Intel) as well as on a 2025 workstation (Linux, AMD Zen 5). The speed is also similar whether using an R arrow 22.0 binary or a compiled arrow 24... with the latter being maybe a tad faster (600-650 ms instead of 800-810ms) but that could be noise in my benchmark.

For completeness, here are the two package versions I tested with. The binary:

Arrow package version: 22.0.0.1

Capabilities:
               
acero      TRUE
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator mimalloc
Current       5 Gb
Max        5.76 Gb

Runtime:
                          
SIMD Level          avx512
Detected SIMD Level avx512

Build:
                           
C++ Library Version  22.0.0
C++ Compiler            GNU
C++ Compiler Version  8.3.1

... and a version compiled using install_arrow(nightly = TRUE)

Arrow package version: 23.0.0.100000000

Capabilities:
               
acero      TRUE
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator mimalloc
Current    1.76 Gb
Max        1.76 Gb

Runtime:
                          
SIMD Level          avx512
Detected SIMD Level avx512

Build:
                                    
C++ Library Version  24.0.0-SNAPSHOT
C++ Compiler                     GNU
C++ Compiler Version          11.4.0

As always, thanks for your help, I love arrow/parquet.

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions