Is your feature request related to a problem or challenge?
ParquetSource / ParquetOpener (in datafusion-datasource-parquet) cannot emit the parquet reader's row-number virtual column, even though the underlying parquet crate (58.x) fully supports it:
let row_number = Field::new("row_number", DataType::Int64, false)
.with_extension_type(parquet::arrow::...::RowNumber);
let builder = builder.with_virtual_columns(vec![row_number_field])?;
The row-number virtual column gives each row its true physical position within the file even under row-group / page / row-filter pruning. This is exactly what engines need to reconstruct stable per-row identity while still benefiting from predicate pushdown.
Concretely, this blocks Delta Lake row tracking (_metadata.row_id = baseRowId + physical_row_index) on top of DataFusion: to keep the synthesized row_id/row_index correct, an integrating engine must currently disable data-filter pushdown (so the reader returns every row in physical order and a running counter stays aligned). That defeats row-group skipping whenever _metadata.row_id is projected alongside a selective filter.
There is no hook to inject this today:
ParquetOpener never calls with_virtual_columns, and its expr_adapter_factory field is pub(crate), so the opener can't be reused/extended from outside the crate.
ParquetSource exposes no builder-customization hook.
- The
ParquetFileReaderFactory provides only the AsyncFileReader, not builder configuration.
So the only workaround is to re-implement a custom FileOpener (duplicating projection / row-filter / pruning plumbing), which is what we're doing downstream in Apache DataFusion Comet (apache/datafusion-comet — Delta contrib).
Describe the solution you'd like
Expose virtual columns on ParquetSource / ParquetOpener, e.g.:
let source = ParquetSource::new(schema)
.with_virtual_columns(vec![row_number_field]); // RowNumber-extension field(s)
…and have ParquetOpener forward them to ParquetRecordBatchStreamBuilder::with_virtual_columns(...) and include them in the projected output schema, so the rest of the existing pruning/row-filter/projection logic is reused unchanged.
Describe alternatives you've considered
- Re-implementing a custom
FileOpener that builds the stream with with_virtual_columns (our current downstream approach — works, but duplicates a lot of well-tested opener logic and is a maintenance burden).
- A reader-factory hook — insufficient, since virtual columns are configured on the stream builder, not the reader.
Additional context
Downstream consumer: Apache DataFusion Comet's native Delta Lake scan (apache/datafusion-comet#4366). We'd be happy to contribute a PR if the API shape above is agreeable.
Is your feature request related to a problem or challenge?
ParquetSource/ParquetOpener(indatafusion-datasource-parquet) cannot emit the parquet reader's row-number virtual column, even though the underlyingparquetcrate (58.x) fully supports it:The row-number virtual column gives each row its true physical position within the file even under row-group / page / row-filter pruning. This is exactly what engines need to reconstruct stable per-row identity while still benefiting from predicate pushdown.
Concretely, this blocks Delta Lake row tracking (
_metadata.row_id=baseRowId + physical_row_index) on top of DataFusion: to keep the synthesizedrow_id/row_indexcorrect, an integrating engine must currently disable data-filter pushdown (so the reader returns every row in physical order and a running counter stays aligned). That defeats row-group skipping whenever_metadata.row_idis projected alongside a selective filter.There is no hook to inject this today:
ParquetOpenernever callswith_virtual_columns, and itsexpr_adapter_factoryfield ispub(crate), so the opener can't be reused/extended from outside the crate.ParquetSourceexposes no builder-customization hook.ParquetFileReaderFactoryprovides only theAsyncFileReader, not builder configuration.So the only workaround is to re-implement a custom
FileOpener(duplicating projection / row-filter / pruning plumbing), which is what we're doing downstream in Apache DataFusion Comet (apache/datafusion-comet — Delta contrib).Describe the solution you'd like
Expose virtual columns on
ParquetSource/ParquetOpener, e.g.:…and have
ParquetOpenerforward them toParquetRecordBatchStreamBuilder::with_virtual_columns(...)and include them in the projected output schema, so the rest of the existing pruning/row-filter/projection logic is reused unchanged.Describe alternatives you've considered
FileOpenerthat builds the stream withwith_virtual_columns(our current downstream approach — works, but duplicates a lot of well-tested opener logic and is a maintenance burden).Additional context
Downstream consumer: Apache DataFusion Comet's native Delta Lake scan (apache/datafusion-comet#4366). We'd be happy to contribute a PR if the API shape above is agreeable.