Skip to content

Expose row group index in Parquet readers #3411

@uros7251brick

Description

@uros7251brick

Describe the enhancement requested

Description

Similar to how getCurrentRowIndex() was introduced to expose the current row's file-level index, this adds getCurrentRowGroupIndex() to expose the index of the row group currently being read.

New API

  • ParquetFileReader.getCurrentRowGroupIndex() — returns the 0-based index of the last row group read via readNextRowGroup() / readNextFilteredRowGroup(). Returns -1 before any row group has been read.
  • ParquetReader.getCurrentRowGroupIndex() — same semantics, for the high-level record reader.
  • ParquetRecordReader.getCurrentRowGroupIndex() — same, for the Hadoop MapReduce record reader.

The returned index is the actual file-level row group index, meaning it correctly reflects gaps when empty row groups are skipped (e.g. if row group 1 is empty, the indices reported will be 0, 2, ... not 0, 1, ...).

Motivation

Engines like Apache Spark need to know which row group a record belongs to — for example, to expose row group metadata as a hidden column, or to correlate records with row group-level statistics. Without this API, callers have no way to determine the current row group index during sequential reads.

Component(s)

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions