You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you slice RecordBatch and serialize it with StreamWriter, it produces an incorrect result. I'm using arrow = "11.1.0"
To reproduce once can use the following test:
#[cfg(test)]mod tests {use std::sync::Arc;use arrow::array::{Int32Array,StringArray};use arrow::datatypes::{DataType,Field,Schema};use arrow::ipc::writer::StreamWriter;use arrow::record_batch::RecordBatch;#[test]fnit_works(){pubfnserialize(record:&RecordBatch) -> Vec<u8>{let buffer:Vec<u8> = Vec::new();letmut stream_writer = StreamWriter::try_new(buffer,&record.schema()).unwrap();
stream_writer.write(record).unwrap();
stream_writer.finish().unwrap();let serialized_batch = stream_writer.into_inner().unwrap();
serialized_batch
}fncreate_batch(rows:usize) -> RecordBatch{let schema = Schema::new(vec![Field::new("a", DataType::Int32, false),
Field::new("b", DataType::Utf8, false),
]);let expected_schema = schema.clone();let a = Int32Array::from(vec![1; rows]);let b = StringArray::from(vec!["a"; rows]);let record_batch = RecordBatch::try_new(Arc::new(schema),vec![Arc::new(a), Arc::new(b)]).unwrap();
record_batch
}let big_record_batch = create_batch(65536);println!("big_record_batch with dimension ({}, {}) (rows x cols) serialized as Apache Arrow IPC in {} bytes", big_record_batch.num_rows(),
big_record_batch.num_columns(), serialize(&big_record_batch).len());let length = 5;let small_record_batch = create_batch(length);println!("small_record_batch with dimension ({}, {}) (rows x cols) serialized as Apache Arrow IPC in {} bytes", small_record_batch.num_rows(),
small_record_batch.num_columns(), serialize(&small_record_batch).len());let offset = 2;let record_batch_slice = big_record_batch.slice(offset, length);println!("(Sliced): record_batch_slice with dimension ({}, {}) (rows x cols) serialized as Apache Arrow IPC in {} bytes", record_batch_slice.num_rows(),
record_batch_slice.num_columns(), serialize(&record_batch_slice).len());}}
As you can see the sliced one has almost the same size as big_record_batch, but I would expect it to be the same size as small_record_batch:
big_record_batch with dimension (65536, 2) (rows x cols) serialized as Apache Arrow IPC in 606608 bytes
small_record_batch with dimension (5, 2) (rows x cols) serialized as Apache Arrow IPC in 464 bytes
(Sliced): record_batch_slice with dimension (5, 2) (rows x cols) serialized as Apache Arrow IPC in 590240 bytes
Can you confirm that the issue is just the size of the written file, and not a correctness problem - i.e. the data is larger than it could be, but still round-trips correctly? If so, I think as you've suggested this might be a duplicate of #208.
Is there any plan to resolve this issue? For my use case, I care specifically that I can write multiple smaller IPC messages rather than a single large one. I hoped to achieve this by slicing the large RecordBatch and writing each slice separately. It seems like a similar issue existed in arrow2 but was resolved last year.
As stated above this isn't a bug per se, but rather that the IPC format faithfully sends the representation of the arrays over the wire - even if some portion of the values have been logically sliced away. I think some feature that truncates buffers, rewriting offsets, etc... is definitely possible as described in #208.
I personally have very limited time to spend on this, but perhaps @nevi-me or @viirya might have some spare cycles?
When you slice
RecordBatch
and serialize it withStreamWriter
, it produces an incorrect result. I'm usingarrow = "11.1.0"
To reproduce once can use the following test:
As you can see the sliced one has almost the same size as
big_record_batch
, but I would expect it to be the same size assmall_record_batch
:This can be related to Add support for writing sliced arrays and flight_data_from_arrow_batch sends too much data
The text was updated successfully, but these errors were encountered: