Skip to content

Avoid repeated heap allocations and buffer copies in IPC writer #9835

@pchintar

Description

@pchintar

Description

When writing IPC data using StreamWriter or FileWriter, the current implementation performs repeated heap allocations and full buffer copies for every record batch, even when writing batches with identical schema and structure.

This leads to unnecessary latency overhead, especially in high-frequency batch writes and streaming pipelines.


Root Cause

Currently in arrow-ipc/src/writer.rs, the writer path is structured as:

RecordBatch
  → encode() → EncodedData
  → write_message()

The key issue is that EncodedData owns its buffers:

pub struct EncodedData {
    pub ipc_message: Vec<u8>,
    pub arrow_data: Vec<u8>,
}

This forces:

  • allocation of new buffers per batch
  • copying of flatbuffer data into Vec<u8>
  • destruction of all intermediate buffers after each write

Current Behavior

For every batch, the following occurs:

1. Build FlatBuffer (fbb)
2. Copy it → ipc_message.to_vec() (Full Copy)
3. Allocate arrow_data Vec      
4. Allocate metadata vectors  
5. Return EncodedData (owned)
6. write_message() writes data
7. All buffers dropped

Implications

  • repeated heap allocations
  • repeated memory growth/reallocation
  • full flatbuffer copy per batch
  • memory churn (alloc → free → alloc)

Proposed Solution

For repeated batch writes, the writer should ideally, without any nightly APIs or unsafe code:

1. Reuse FlatBufferBuilder
2. Reuse arrow_data buffer
3. Reuse metadata vectors
4. Avoid copying flatbuffer data
5. Write directly from existing buffers

cc @alamb and @etseidl

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions