feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer by pchintar · Pull Request #9836 · apache/arrow-rs

pchintar · 2026-04-26T21:42:28Z

Which issue does this PR close?

Closes Avoid repeated heap allocations and buffer copies in IPC writer #9835 .

Rationale for this change

The current IPC writer path performs repeated heap allocations and buffer copies for every record batch, even when writing batches with identical structure.

In arrow-ipc/src/writer.rs, the writer path is structured as:

RecordBatch
  → encode() → EncodedData
  → write_message()

The key issue is that EncodedData owns its buffers:

pub struct EncodedData {
    pub ipc_message: Vec<u8>,
    pub arrow_data: Vec<u8>,
}

This forces:

allocation of new Vec<u8> buffers per batch
copying of flatbuffer data via to_vec() into Vec<u8>
destruction of all intermediate buffers after each write

This leads to unnecessary latency overhead in common scenarios such as iterative or high-frequency batch writes. The issue is purely performance-related and does not affect correctness.

What changes are included in this PR?

This PR introduces a private, writer-specific fast path for non-dictionary batches in StreamWriter and FileWriter:

Reuses a writer-owned scratch structure across batches:
- FlatBufferBuilder
- arrow_data buffer
- metadata vectors (nodes, buffers, variadic_buffer_counts)
Avoids copying flatbuffer data by writing directly from finished_data()
Bypasses the EncodedData path for non-dictionary batches while preserving it for dictionary cases
Adds a helper to detect dictionary usage and safely fall back to the existing implementation when needed

Are these changes tested?

Yes.

Benchmark results of ipc_writer (cargo bench -p arrow-ipc --bench ipc_writer --features zstd -- --sample-size 50) show improvements for non-compressed workloads
Existing IPC writer tests (cargo test -p arrow-ipc --lib writer) pass without modification, confirming overall correctness and compatibility.

Are there any user-facing changes?

No.

pchintar · 2026-04-26T22:13:04Z

@alamb and @etseidl the benchmark results of ipc_writer show between 30-45% improvements for non-zstd where writer overhead is larger portion of runtime

@adriangb could you pls run benchmark ipc_writer

adriangb · 2026-04-27T02:24:22Z

run benchmark ipc_writer

adriangbot · 2026-04-27T02:27:55Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4323772156-1847-d4rd4 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing ipc-writer-avoid-repetitive-allocs (8ce051e) to 4fa8d2f (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-27T02:29:41Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                 ipc-writer-avoid-repetitive-allocs     main
-----                                                 ----------------------------------     ----
arrow_ipc_stream_writer/FileWriter/write_10           1.00    159.7±1.40µs        ? ?/sec    1.17    186.2±1.61µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         1.00    155.8±1.62µs        ? ?/sec    1.18    183.7±1.77µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.00      7.2±0.07ms        ? ?/sec    1.03      7.4±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	35.0s
Peak memory	2.7 GiB
Avg memory	2.6 GiB
CPU user	30.6s
CPU sys	0.7s
Peak spill	0 B

branch

Metric	Value
Wall time	35.0s
Peak memory	2.6 GiB
Avg memory	2.6 GiB
CPU user	31.0s
CPU sys	0.1s
Peak spill	0 B

File an issue against this benchmark runner

pchintar · 2026-04-27T02:47:53Z

Ok, so the results are positive (~15% improvement for non-compressed paths), but they're different from what I got, probably because of the different processors. I ran mine on my personal Mac, which has an Intel Core i9 x86_64 processor.

pchintar · 2026-04-27T05:34:02Z

@alamb Also, it is clear that the compressed path is much slower than non-compressed overall, so I took a look into the zstd compressed path and I found out that in arrow-ipc/src/compression.rs, the current compress_zstd function was doing:

compress() → allocates a new Vec
extend_from_slice() → copies into output

That's one extra allocation + one extra copy per buffer. Zstd actually provides compress_to_buffer() which writes directly into an existing buffer. So, we can change current implementation from:

Current code (alloc --> compress --> copy)

#[cfg(feature = "zstd")]
fn compress_zstd(
    input: &[u8],
    output: &mut Vec<u8>,
    context: &mut CompressionContext,
) -> Result<(), ArrowError> {
    let result = context.zstd_compressor().compress(input)?;
    output.extend_from_slice(&result);
    Ok(())
}

New approach (compress --> direct write)

#[cfg(feature = "zstd")]
fn compress_zstd(
    input: &[u8],
    output: &mut Vec<u8>,
    context: &mut CompressionContext,
) -> Result<(), ArrowError> {
    use zstd_safe::compress_bound;

    let compressor = context.zstd_compressor();

    // Compute maximum compressed size
    let bound = compress_bound(input.len());

    // Reserve space and extend buffer to allow in-place write
    let offset = output.len();
    output.resize(offset + bound, 0);

    // Compress directly into output buffer
    let written = compressor
        .compress_to_buffer(input, &mut output[offset..])
        .map_err(|e| ArrowError::ExternalError(Box::new(e)))?;

    // Truncate to actual compressed size
    output.truncate(offset + written);

    Ok(())
}

I'll make this modification later today and re-test

alamb · 2026-04-27T18:51:01Z

I'll make this modification later today and re-test

Thanks @pchintar

I'll mark this PR as draft -- please let me know when it is ready for a look

alamb · 2026-04-27T18:54:50Z

Or maaybe I misunderstood your intent -- do you want me to review this PR now? Or shall I wait for your next proposal?

pchintar · 2026-04-27T20:25:54Z

Hi @alamb,

So, I took a closer look at the ipc_writer benchmark's zstd path and the main cost seems to come from repeated calls to:

compress_to_vec(buffer, ...)  // ~52k calls to compress_to_vec per `ipc_writer` run

Right now the flow is strictly serial:

write_array_data
  → write_buffer
      → compress_to_vec (zstd)

i.e.

buffer1 → compress → write
buffer2 → compress → write
...

Since buffers are independent, I’m considering restructuring this to:

collect buffers → compress in parallel → write in order

Conceptually:

[buffer1, buffer2, buffer3]
        ↓
parallel compress
        ↓
append results (same order)

This would keep the same IPC format and compression behavior, while avoiding any output-size tradeoff.

Implementation-wise, something like:

let parallelism = thread::available_parallelism()
    .map(|n| n.get())
    .unwrap_or(1)
    .min(4);

Then process bounded chunks:

for chunk in pending_buffers.chunks(parallelism) {
    let compressed = compress_chunk_in_parallel(chunk)?;
    append_in_original_order(compressed)?;
}

where each worker owns its compression context:

let mut ctx = CompressionContext::default();
codec.compress_to_vec(buffer.as_slice(), &mut out, &mut ctx)?;

Would this kind of bounded per-batch parallelism be acceptable in arrow-ipc, or would it introduce any new hidden costs? I'm really unsure of the impacts of parallelism here for the write process and so I'm really curious

Thanks!

alamb · 2026-04-29T20:57:48Z

Would this kind of bounded per-batch parallelism be acceptable in arrow-ipc, or would it introduce any new hidden costs? I'm really unsure of the impacts of parallelism here for the write process and so I'm really curious

I don't think adding parallelism (eg. threasd) to the underlying library calls acceptable as it comes with threading and memory overhead (more buffering) that are not appropriate for all usecases

What probably would be useful is if you could find some way (maybe an example) to show people who wanted to make this tradeoff how it could be done

alamb · 2026-04-29T20:59:02Z

It seems like hte approach of this PR (avoid allocations) still is a benefit -- do you think it is something worth pursuing (aka should I find time to review this PR?)

Thank you for your help (as always)

Avoid repetitive heap allocations and buffer copies in IPC writer

8ce051e

github-actions Bot added the arrow Changes to the arrow crate label Apr 26, 2026

pchintar changed the title ~~feat(ipc): Avoid repetitive heap allocations and buffer copies in IPC writer~~ feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer Apr 26, 2026

alamb marked this pull request as draft April 27, 2026 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer#9836

feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer#9836
pchintar wants to merge 1 commit intoapache:mainfrom
pchintar:ipc-writer-avoid-repetitive-allocs

pchintar commented Apr 26, 2026 •

edited

Loading

Uh oh!

pchintar commented Apr 26, 2026 •

edited

Loading

Uh oh!

adriangb commented Apr 27, 2026

Uh oh!

adriangbot commented Apr 27, 2026

Uh oh!

adriangbot commented Apr 27, 2026

Uh oh!

pchintar commented Apr 27, 2026

Uh oh!

pchintar commented Apr 27, 2026 •

edited

Loading

Uh oh!

alamb commented Apr 27, 2026

Uh oh!

alamb commented Apr 27, 2026

Uh oh!

pchintar commented Apr 27, 2026 •

edited

Loading

Uh oh!

alamb commented Apr 29, 2026

Uh oh!

alamb commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pchintar commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pchintar commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Apr 27, 2026

Uh oh!

adriangbot commented Apr 27, 2026

Uh oh!

adriangbot commented Apr 27, 2026

Uh oh!

pchintar commented Apr 27, 2026

Uh oh!

pchintar commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current code (alloc --> compress --> copy)

New approach (compress --> direct write)

Uh oh!

alamb commented Apr 27, 2026

Uh oh!

alamb commented Apr 27, 2026

Uh oh!

pchintar commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Apr 29, 2026

Uh oh!

alamb commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pchintar commented Apr 26, 2026 •

edited

Loading

pchintar commented Apr 26, 2026 •

edited

Loading

pchintar commented Apr 27, 2026 •

edited

Loading

pchintar commented Apr 27, 2026 •

edited

Loading