Describe the bug
When reading a Parquet file whose row group metadata reports more rows than a column chunk actually contains, the record reader (RowIter / get_row_iter) panics instead of returning an error. The panic message is something like index out of bounds: the len is 0 but the index is 70093 at parquet/src/record/triplet.rs inside TypedTripletIter::current_def_level.
This happens because TypedTripletIter::read_next clears the internal def_levels, rep_levels, and values buffers when the underlying column reader is exhausted, but returns Ok(false) without resetting curr_triplet_index to 0. The higher-level Reader::read_field then calls current_def_level() unconditionally (in OptionReader, RepeatedReader, and KeyValueReader variants), which indexes into an empty vector with the stale index from the previous batch.
ReaderIter trusts row_group.metadata().num_rows() to drive iteration without cross-checking whether the leaf column readers have actually been exhausted, so a mismatch between metadata and actual data triggers the panic.
To Reproduce
Construct a ReaderIter via TreeBuilder with num_records set to one more than the actual number of values in the column, and iterate to completion. For example using nulls.snappy.parquet (8 rows):
let reader = TreeBuilder::new().build(descr, &*row_group_reader).unwrap();
let iter = ReaderIter::new(reader, 9).unwrap(); // actual data has 8 rows
for row in iter {
let _ = row.unwrap(); // panics on the 9th iteration
}
In production this is triggered by third-party Parquet files where the row group footer declares a num_rows value larger than the actual encoded column data.
Expected behavior
The iterator should return Err with a descriptive message like "Unexpected end of column data" instead of panicking.
Additional context
The bug affects all Reader variants that call current_def_level() before checking has_next(): OptionReader, GroupReader (for optional children), RepeatedReader, and KeyValueReader. The PrimitiveReader variant is also affected for required columns where current_value() indexes into the empty values buffer.
Describe the bug
When reading a Parquet file whose row group metadata reports more rows than a column chunk actually contains, the record reader (
RowIter/get_row_iter) panics instead of returning an error. The panic message is something likeindex out of bounds: the len is 0 but the index is 70093atparquet/src/record/triplet.rsinsideTypedTripletIter::current_def_level.This happens because
TypedTripletIter::read_nextclears the internaldef_levels,rep_levels, andvaluesbuffers when the underlying column reader is exhausted, but returnsOk(false)without resettingcurr_triplet_indexto 0. The higher-levelReader::read_fieldthen callscurrent_def_level()unconditionally (inOptionReader,RepeatedReader, andKeyValueReadervariants), which indexes into an empty vector with the stale index from the previous batch.ReaderItertrustsrow_group.metadata().num_rows()to drive iteration without cross-checking whether the leaf column readers have actually been exhausted, so a mismatch between metadata and actual data triggers the panic.To Reproduce
Construct a
ReaderIterviaTreeBuilderwithnum_recordsset to one more than the actual number of values in the column, and iterate to completion. For example usingnulls.snappy.parquet(8 rows):In production this is triggered by third-party Parquet files where the row group footer declares a
num_rowsvalue larger than the actual encoded column data.Expected behavior
The iterator should return
Errwith a descriptive message like "Unexpected end of column data" instead of panicking.Additional context
The bug affects all Reader variants that call
current_def_level()before checkinghas_next():OptionReader,GroupReader(for optional children),RepeatedReader, andKeyValueReader. ThePrimitiveReadervariant is also affected for required columns wherecurrent_value()indexes into the emptyvaluesbuffer.