Skip to content

Expose parquet row-number virtual column (RowNumber) on ParquetSource/ParquetOpener #22517

@schenksj

Description

@schenksj

Is your feature request related to a problem or challenge?

ParquetSource / ParquetOpener (in datafusion-datasource-parquet) cannot emit the parquet reader's row-number virtual column, even though the underlying parquet crate (58.x) fully supports it:

let row_number = Field::new("row_number", DataType::Int64, false)
    .with_extension_type(parquet::arrow::...::RowNumber);
let builder = builder.with_virtual_columns(vec![row_number_field])?;

The row-number virtual column gives each row its true physical position within the file even under row-group / page / row-filter pruning. This is exactly what engines need to reconstruct stable per-row identity while still benefiting from predicate pushdown.

Concretely, this blocks Delta Lake row tracking (_metadata.row_id = baseRowId + physical_row_index) on top of DataFusion: to keep the synthesized row_id/row_index correct, an integrating engine must currently disable data-filter pushdown (so the reader returns every row in physical order and a running counter stays aligned). That defeats row-group skipping whenever _metadata.row_id is projected alongside a selective filter.

There is no hook to inject this today:

  • ParquetOpener never calls with_virtual_columns, and its expr_adapter_factory field is pub(crate), so the opener can't be reused/extended from outside the crate.
  • ParquetSource exposes no builder-customization hook.
  • The ParquetFileReaderFactory provides only the AsyncFileReader, not builder configuration.

So the only workaround is to re-implement a custom FileOpener (duplicating projection / row-filter / pruning plumbing), which is what we're doing downstream in Apache DataFusion Comet (apache/datafusion-comet — Delta contrib).

Describe the solution you'd like

Expose virtual columns on ParquetSource / ParquetOpener, e.g.:

let source = ParquetSource::new(schema)
    .with_virtual_columns(vec![row_number_field]); // RowNumber-extension field(s)

…and have ParquetOpener forward them to ParquetRecordBatchStreamBuilder::with_virtual_columns(...) and include them in the projected output schema, so the rest of the existing pruning/row-filter/projection logic is reused unchanged.

Describe alternatives you've considered

  • Re-implementing a custom FileOpener that builds the stream with with_virtual_columns (our current downstream approach — works, but duplicates a lot of well-tested opener logic and is a maintenance burden).
  • A reader-factory hook — insufficient, since virtual columns are configured on the stream builder, not the reader.

Additional context

Downstream consumer: Apache DataFusion Comet's native Delta Lake scan (apache/datafusion-comet#4366). We'd be happy to contribute a PR if the API shape above is agreeable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions