Description
When writing IPC data using StreamWriter or FileWriter, the current implementation performs repeated heap allocations and full buffer copies for every record batch, even when writing batches with identical schema and structure.
This leads to unnecessary latency overhead, especially in high-frequency batch writes and streaming pipelines.
Root Cause
Currently in arrow-ipc/src/writer.rs, the writer path is structured as:
RecordBatch
→ encode() → EncodedData
→ write_message()
The key issue is that EncodedData owns its buffers:
pub struct EncodedData {
pub ipc_message: Vec<u8>,
pub arrow_data: Vec<u8>,
}
This forces:
- allocation of new buffers per batch
- copying of flatbuffer data into
Vec<u8>
- destruction of all intermediate buffers after each write
Current Behavior
For every batch, the following occurs:
1. Build FlatBuffer (fbb)
2. Copy it → ipc_message.to_vec() (Full Copy)
3. Allocate arrow_data Vec
4. Allocate metadata vectors
5. Return EncodedData (owned)
6. write_message() writes data
7. All buffers dropped
Implications
- repeated heap allocations
- repeated memory growth/reallocation
- full flatbuffer copy per batch
- memory churn (alloc → free → alloc)
Proposed Solution
For repeated batch writes, the writer should ideally, without any nightly APIs or unsafe code:
1. Reuse FlatBufferBuilder
2. Reuse arrow_data buffer
3. Reuse metadata vectors
4. Avoid copying flatbuffer data
5. Write directly from existing buffers
cc @alamb and @etseidl
Description
When writing IPC data using
StreamWriterorFileWriter, the current implementation performs repeated heap allocations and full buffer copies for every record batch, even when writing batches with identical schema and structure.This leads to unnecessary latency overhead, especially in high-frequency batch writes and streaming pipelines.
Root Cause
Currently in
arrow-ipc/src/writer.rs, the writer path is structured as:The key issue is that
EncodedDataowns its buffers:This forces:
Vec<u8>Current Behavior
For every batch, the following occurs:
Implications
Proposed Solution
For repeated batch writes, the writer should ideally, without any nightly APIs or unsafe code:
cc @alamb and @etseidl