Skip to content

Record reader panics with "index out of bounds" when row group num_rows exceeds actual column data #9992

@BoazC-MSFT

Description

@BoazC-MSFT

Describe the bug

When reading a Parquet file whose row group metadata reports more rows than a column chunk actually contains, the record reader (RowIter / get_row_iter) panics instead of returning an error. The panic message is something like index out of bounds: the len is 0 but the index is 70093 at parquet/src/record/triplet.rs inside TypedTripletIter::current_def_level.

This happens because TypedTripletIter::read_next clears the internal def_levels, rep_levels, and values buffers when the underlying column reader is exhausted, but returns Ok(false) without resetting curr_triplet_index to 0. The higher-level Reader::read_field then calls current_def_level() unconditionally (in OptionReader, RepeatedReader, and KeyValueReader variants), which indexes into an empty vector with the stale index from the previous batch.

ReaderIter trusts row_group.metadata().num_rows() to drive iteration without cross-checking whether the leaf column readers have actually been exhausted, so a mismatch between metadata and actual data triggers the panic.

To Reproduce

Construct a ReaderIter via TreeBuilder with num_records set to one more than the actual number of values in the column, and iterate to completion. For example using nulls.snappy.parquet (8 rows):

let reader = TreeBuilder::new().build(descr, &*row_group_reader).unwrap();
let iter = ReaderIter::new(reader, 9).unwrap(); // actual data has 8 rows
for row in iter {
    let _ = row.unwrap(); // panics on the 9th iteration
}

In production this is triggered by third-party Parquet files where the row group footer declares a num_rows value larger than the actual encoded column data.

Expected behavior

The iterator should return Err with a descriptive message like "Unexpected end of column data" instead of panicking.

Additional context

The bug affects all Reader variants that call current_def_level() before checking has_next(): OptionReader, GroupReader (for optional children), RepeatedReader, and KeyValueReader. The PrimitiveReader variant is also affected for required columns where current_value() indexes into the empty values buffer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions