Skip to content

feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer#9836

Draft
pchintar wants to merge 1 commit intoapache:mainfrom
pchintar:ipc-writer-avoid-repetitive-allocs
Draft

feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer#9836
pchintar wants to merge 1 commit intoapache:mainfrom
pchintar:ipc-writer-avoid-repetitive-allocs

Conversation

@pchintar
Copy link
Copy Markdown
Contributor

@pchintar pchintar commented Apr 26, 2026

Which issue does this PR close?

Rationale for this change

The current IPC writer path performs repeated heap allocations and buffer copies for every record batch, even when writing batches with identical structure.

In arrow-ipc/src/writer.rs, the writer path is structured as:

RecordBatch
  → encode() → EncodedData
  → write_message()

The key issue is that EncodedData owns its buffers:

pub struct EncodedData {
    pub ipc_message: Vec<u8>,
    pub arrow_data: Vec<u8>,
}

This forces:

  • allocation of new Vec<u8> buffers per batch
  • copying of flatbuffer data via to_vec() into Vec<u8>
  • destruction of all intermediate buffers after each write

This leads to unnecessary latency overhead in common scenarios such as iterative or high-frequency batch writes. The issue is purely performance-related and does not affect correctness.

What changes are included in this PR?

This PR introduces a private, writer-specific fast path for non-dictionary batches in StreamWriter and FileWriter:

  • Reuses a writer-owned scratch structure across batches:

    • FlatBufferBuilder
    • arrow_data buffer
    • metadata vectors (nodes, buffers, variadic_buffer_counts)
  • Avoids copying flatbuffer data by writing directly from finished_data()

  • Bypasses the EncodedData path for non-dictionary batches while preserving it for dictionary cases

  • Adds a helper to detect dictionary usage and safely fall back to the existing implementation when needed

Are these changes tested?

Yes.

  • Benchmark results of ipc_writer (cargo bench -p arrow-ipc --bench ipc_writer --features zstd -- --sample-size 50) show improvements for non-compressed workloads
  • Existing IPC writer tests (cargo test -p arrow-ipc --lib writer) pass without modification, confirming overall correctness and compatibility.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 26, 2026
@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Apr 26, 2026

@alamb and @etseidl the benchmark results of ipc_writer show between 30-45% improvements for non-zstd where writer overhead is larger portion of runtime
Image 4-26-26 at 5 14 PM

@adriangb could you pls run benchmark ipc_writer

@pchintar pchintar changed the title feat(ipc): Avoid repetitive heap allocations and buffer copies in IPC writer feat(ipc): Avoid repeated heap allocations and buffer copies in IPC writer Apr 26, 2026
@adriangb
Copy link
Copy Markdown
Contributor

run benchmark ipc_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4323772156-1847-d4rd4 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing ipc-writer-avoid-repetitive-allocs (8ce051e) to 4fa8d2f (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                 ipc-writer-avoid-repetitive-allocs     main
-----                                                 ----------------------------------     ----
arrow_ipc_stream_writer/FileWriter/write_10           1.00    159.7±1.40µs        ? ?/sec    1.17    186.2±1.61µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         1.00    155.8±1.62µs        ? ?/sec    1.18    183.7±1.77µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.00      7.2±0.07ms        ? ?/sec    1.03      7.4±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 35.0s
Peak memory 2.7 GiB
Avg memory 2.6 GiB
CPU user 30.6s
CPU sys 0.7s
Peak spill 0 B

branch

Metric Value
Wall time 35.0s
Peak memory 2.6 GiB
Avg memory 2.6 GiB
CPU user 31.0s
CPU sys 0.1s
Peak spill 0 B

File an issue against this benchmark runner

@pchintar
Copy link
Copy Markdown
Contributor Author

Ok, so the results are positive (~15% improvement for non-compressed paths), but they're different from what I got, probably because of the different processors. I ran mine on my personal Mac, which has an Intel Core i9 x86_64 processor.

@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Apr 27, 2026

@alamb Also, it is clear that the compressed path is much slower than non-compressed overall, so I took a look into the zstd compressed path and I found out that in arrow-ipc/src/compression.rs, the current compress_zstd function was doing:

  1. compress() → allocates a new Vec
  2. extend_from_slice() → copies into output

That's one extra allocation + one extra copy per buffer. Zstd actually provides compress_to_buffer() which writes directly into an existing buffer. So, we can change current implementation from:

Current code (alloc --> compress --> copy)

#[cfg(feature = "zstd")]
fn compress_zstd(
    input: &[u8],
    output: &mut Vec<u8>,
    context: &mut CompressionContext,
) -> Result<(), ArrowError> {
    let result = context.zstd_compressor().compress(input)?;
    output.extend_from_slice(&result);
    Ok(())
}

New approach (compress --> direct write)

#[cfg(feature = "zstd")]
fn compress_zstd(
    input: &[u8],
    output: &mut Vec<u8>,
    context: &mut CompressionContext,
) -> Result<(), ArrowError> {
    use zstd_safe::compress_bound;

    let compressor = context.zstd_compressor();

    // Compute maximum compressed size
    let bound = compress_bound(input.len());

    // Reserve space and extend buffer to allow in-place write
    let offset = output.len();
    output.resize(offset + bound, 0);

    // Compress directly into output buffer
    let written = compressor
        .compress_to_buffer(input, &mut output[offset..])
        .map_err(|e| ArrowError::ExternalError(Box::new(e)))?;

    // Truncate to actual compressed size
    output.truncate(offset + written);

    Ok(())
}

I'll make this modification later today and re-test

@alamb alamb marked this pull request as draft April 27, 2026 18:50
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 27, 2026

I'll make this modification later today and re-test

Thanks @pchintar

I'll mark this PR as draft -- please let me know when it is ready for a look

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 27, 2026

Or maaybe I misunderstood your intent -- do you want me to review this PR now? Or shall I wait for your next proposal?

@pchintar
Copy link
Copy Markdown
Contributor Author

pchintar commented Apr 27, 2026

Hi @alamb,

So, I took a closer look at the ipc_writer benchmark's zstd path and the main cost seems to come from repeated calls to:

compress_to_vec(buffer, ...)  // ~52k calls to compress_to_vec per `ipc_writer` run

Right now the flow is strictly serial:

write_array_data
  → write_buffer
      → compress_to_vec (zstd)

i.e.

buffer1 → compress → write
buffer2 → compress → write
...

Since buffers are independent, I’m considering restructuring this to:

collect buffers → compress in parallel → write in order

Conceptually:

[buffer1, buffer2, buffer3]
        ↓
parallel compress
        ↓
append results (same order)

This would keep the same IPC format and compression behavior, while avoiding any output-size tradeoff.

Implementation-wise, something like:

let parallelism = thread::available_parallelism()
    .map(|n| n.get())
    .unwrap_or(1)
    .min(4);

Then process bounded chunks:

for chunk in pending_buffers.chunks(parallelism) {
    let compressed = compress_chunk_in_parallel(chunk)?;
    append_in_original_order(compressed)?;
}

where each worker owns its compression context:

let mut ctx = CompressionContext::default();
codec.compress_to_vec(buffer.as_slice(), &mut out, &mut ctx)?;

Would this kind of bounded per-batch parallelism be acceptable in arrow-ipc, or would it introduce any new hidden costs? I'm really unsure of the impacts of parallelism here for the write process and so I'm really curious

Thanks!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

Would this kind of bounded per-batch parallelism be acceptable in arrow-ipc, or would it introduce any new hidden costs? I'm really unsure of the impacts of parallelism here for the write process and so I'm really curious

I don't think adding parallelism (eg. threasd) to the underlying library calls acceptable as it comes with threading and memory overhead (more buffering) that are not appropriate for all usecases

What probably would be useful is if you could find some way (maybe an example) to show people who wanted to make this tradeoff how it could be done

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

It seems like hte approach of this PR (avoid allocations) still is a benefit -- do you think it is something worth pursuing (aka should I find time to review this PR?)

Thank you for your help (as always)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid repeated heap allocations and buffer copies in IPC writer

4 participants