Skip to content

[Parquet] ArrowWriter with CDC panics on nested ListArrays #9637

@alamb

Description

@alamb

Describe the bug

Writing nested list data with parquet::arrow::ArrowWriter and content-defined chunking enabled can panic inside the parquet column writer with an
out-of-bounds slice access.

This appears to be a regression from:

To Reproduce
This currently fails on main:

nice cargo bench -p parquet --bench arrow_writer -- list_primitive/cdc

Fails like this:

Benchmarking list_primitive/cdc: Warming up for 3.0000 s
thread 'main' (11848300) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 59344 out of range for slice of length 58905
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that codex made for me:

#[test]
fn test_arrow_writer_cdc_list_roundtrip_regression() {
    let schema = Arc::new(Schema::new(vec![
        Field::new(
            "_1",
            DataType::List(Arc::new(Field::new_list_field(DataType::Int32, true))),
            true,
        ),
        Field::new(
            "_2",
            DataType::List(Arc::new(Field::new_list_field(DataType::Boolean, true))),
            true,
        ),
        Field::new(
            "_3",
            DataType::LargeList(Arc::new(Field::new_list_field(DataType::Utf8, true))),
            true,
        ),
    ]));
    let props = WriterProperties::builder()
        .set_content_defined_chunking(Some(CdcOptions::default()))
        .build();
    let batch = create_random_batch(schema, 2, 0.25, 0.75).unwrap();

    let mut buffer = Vec::new();
    let mut writer = ArrowWriter::try_new(&mut buffer, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();

    let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
    let read = reader.next().unwrap().unwrap();
    assert_eq!(batch, read);
}

Run like

cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regression

Results:

running 1 test
test test_arrow_writer_cdc_list_roundtrip_regression ... FAILED

failures:

---- test_arrow_writer_cdc_list_roundtrip_regression stdout ----

thread 'test_arrow_writer_cdc_list_roundtrip_regression' (11845398) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 1 out of range for slice of length 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior
No panics, tests should pass

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions