Record reader panics with "index out of bounds" when row group num_rows exceeds actual column data

**Describe the bug**

When reading a Parquet file whose row group metadata reports more rows than a column chunk actually contains, the record reader (`RowIter` / `get_row_iter`) panics instead of returning an error. The panic message is something like `index out of bounds: the len is 0 but the index is 70093` at `parquet/src/record/triplet.rs` inside `TypedTripletIter::current_def_level`.

This happens because `TypedTripletIter::read_next` clears the internal `def_levels`, `rep_levels`, and `values` buffers when the underlying column reader is exhausted, but returns `Ok(false)` without resetting `curr_triplet_index` to 0. The higher-level `Reader::read_field` then calls `current_def_level()` unconditionally (in `OptionReader`, `RepeatedReader`, and `KeyValueReader` variants), which indexes into an empty vector with the stale index from the previous batch.

`ReaderIter` trusts `row_group.metadata().num_rows()` to drive iteration without cross-checking whether the leaf column readers have actually been exhausted, so a mismatch between metadata and actual data triggers the panic.

**To Reproduce**

Construct a `ReaderIter` via `TreeBuilder` with `num_records` set to one more than the actual number of values in the column, and iterate to completion. For example using `nulls.snappy.parquet` (8 rows):

```rust
let reader = TreeBuilder::new().build(descr, &*row_group_reader).unwrap();
let iter = ReaderIter::new(reader, 9).unwrap(); // actual data has 8 rows
for row in iter {
    let _ = row.unwrap(); // panics on the 9th iteration
}
```

In production this is triggered by third-party Parquet files where the row group footer declares a `num_rows` value larger than the actual encoded column data.

**Expected behavior**

The iterator should return `Err` with a descriptive message like "Unexpected end of column data" instead of panicking.

**Additional context**

The bug affects all Reader variants that call `current_def_level()` before checking `has_next()`: `OptionReader`, `GroupReader` (for optional children), `RepeatedReader`, and `KeyValueReader`. The `PrimitiveReader` variant is also affected for required columns where `current_value()` indexes into the empty `values` buffer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record reader panics with "index out of bounds" when row group num_rows exceeds actual column data #9992

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Record reader panics with "index out of bounds" when row group num_rows exceeds actual column data #9992

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions