-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] ArrowWriter with CDC panics on nested ListArrays #9637
Copy link
Copy link
Open
Labels
Description
Describe the bug
- Found while testing feat(parquet): stream-encode definition/repetition levels incrementally #9447
Writing nested list data with parquet::arrow::ArrowWriter and content-defined chunking enabled can panic inside the parquet column writer with an
out-of-bounds slice access.
This appears to be a regression from:
- bad: bc74c71 (feat(parquet): add content defined chunking for arrow writer (feat(parquet): add content defined chunking for arrow writer #9450))
- good: 39dda22
To Reproduce
This currently fails on main:
nice cargo bench -p parquet --bench arrow_writer -- list_primitive/cdcFails like this:
Benchmarking list_primitive/cdc: Warming up for 3.0000 s
thread 'main' (11848300) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 59344 out of range for slice of length 58905
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that codex made for me:
#[test]
fn test_arrow_writer_cdc_list_roundtrip_regression() {
let schema = Arc::new(Schema::new(vec![
Field::new(
"_1",
DataType::List(Arc::new(Field::new_list_field(DataType::Int32, true))),
true,
),
Field::new(
"_2",
DataType::List(Arc::new(Field::new_list_field(DataType::Boolean, true))),
true,
),
Field::new(
"_3",
DataType::LargeList(Arc::new(Field::new_list_field(DataType::Utf8, true))),
true,
),
]));
let props = WriterProperties::builder()
.set_content_defined_chunking(Some(CdcOptions::default()))
.build();
let batch = create_random_batch(schema, 2, 0.25, 0.75).unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, batch.schema(), Some(props)).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
let read = reader.next().unwrap().unwrap();
assert_eq!(batch, read);
}Run like
cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regressionResults:
running 1 test
test test_arrow_writer_cdc_list_roundtrip_regression ... FAILED
failures:
---- test_arrow_writer_cdc_list_roundtrip_regression stdout ----
thread 'test_arrow_writer_cdc_list_roundtrip_regression' (11845398) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 1 out of range for slice of length 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Expected behavior
No panics, tests should pass
Additional context
Reactions are currently unavailable