feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener#22026
feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener#22026mbutrovich wants to merge 7 commits intoapache:mainfrom
Conversation
…and ParquetOpener, gated behind a tested-only extension-type allowlist, to unblock Comet's native-DataFusion support for Spark's _tmp_metadata_row_index.
|
My main concern is #22026 (comment). The various schemas in |
Thanks for the review @adriangb! Agreed it could make things more complicated, but if DataFusion is ever going to support these virtual columns it might be unavoidable. I think it's good to hash this stuff out in the smallest possible PR at the opener level. I'll push an update later today. |
|
Thanks again for the review @adriangb! Hopefully I addressed all of the feedback, but happy to keep chatting about it. Mixed virtual/file predicates with Confirmed the silent-drop bug with failing tests. Root cause: Arrow-rs can't accept virtual-column refs in a Fix: added Defense-in-depth in the opener for callers who bypass the optimizer (e.g. manual plan builders): Tests: Ordering doc on Struct field doc now spells out the Enum + Added |
|
I think this would then have a negative interaction with the goal of turning filter pushdown on by default. Maybe we'll always have to apply some filters as a |
Comet conservatively never removes Wouldn't this only prevent filter pushdown for filters that reference virtual columns? |
Yeah but it means we'll have to keep the split forever. Which might have been the case anyway and maybe a non issue. And that any filter that does reference virtual columns cannot be pushed down even if a part of it would benefit from doing so, e..g |
|
I plan to give this another review tomorrow. |
|
run benchmark tpch tpcds |
|
@mbutrovich from high level perspective how |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing virtual-columns-table-schema (bd513ec) to 2c7af17 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing virtual-columns-table-schema (bd513ec) to 2c7af17 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Which issue does this PR close?
_tmp_metadata_row_index).Rationale for this change
arrow-rs 57.1.0+ supports Parquet virtual columns (
row_number,row_group_index) viaArrowReaderOptions::with_virtual_columns, and DataFusion pins a new-enough arrow-rs for the API to be available. DataFusion does not yet plumb the option throughParquetOpener, so consumers (notably Comet) cannot project Spark's_tmp_metadata_row_indexthrough the native_datafusion scan path.This PR adds the minimal opener-boundary plumbing so
TableSchemacan carry virtual columns and the Parquet reader produces them. UX / SQL-layer surface for virtual columns stays deferred to the epic in #20135 — this follows the same framing alamb blessed for #20071 (theinput_file_name()UDF).What changes are included in this PR?
TableSchema::with_virtual_columns(...)builder +virtual_columns()getter. Layout:[file, partition, virtual]. Composable withwith_table_partition_colsin either order.TableSchema::schema_without_virtual_columns()— file + partition schema used by pushdown-planning paths that can't evaluate virtual-col refs.ParquetOpenerforwards the fields toArrowReaderOptions::with_virtual_columns; augments the schemas passed to the expr-adapter / simplifier with virtual fields so virtual-col refs identity-rewrite; strips them from the projection fed toProjectionMask::roots(which only understands file columns) and appends them tostream_schemasoreassign_expr_columnsresolves them by name.ParquetVirtualColumnenum withTryFrom<&FieldRef>(indatasource-parquet::virtual_column) gates which arrow-rs virtual extension types are accepted. Currently onlyRowNumber; adding a variant (e.g.RowGroupIndex) is a compile-time obligation. Replaces the earlier runtime string-allowlist so the contract lives in the type system.ParquetSource::try_pushdown_filtersclassifies filters against the file+partition schema (not the full table schema) so predicates referencing virtual columns are reported asPushedDown::Noand theFilterExecstays above the scan — arrow-rs'sRowFilteraddresses parquet leaves only and can't evaluate virtual-column refs, so silently pushing them would produce wrong results.prepare_open_filerejectspushdown_filters=true+ a predicate that references a virtual column, with a clear remediation message. This catches callers that bypass the optimizer and set the predicate onParquetSourcedirectly.arrow-schemaadded as a direct dep (previously transitive viaarrow) so the enum referencesRowNumber::NAMEfrom arrow-rs instead of hardcoding the string.ListingTable/ SQL-layer surface, a three-arg constructor onTableSchema,ParquetSource::with_virtual_columns, andRowGroupIndexsupport.Are these changes tested?
Yes. New unit tests in
opener.rs:test_row_index_basic— single row group, select data + row_number.test_row_index_projection_only— select only row_number.test_row_index_multi_row_group— 3 × 100 rows, verify absolute 0..300 across boundaries.test_row_index_with_row_group_skip— predicate stats-prunes the middle row group; verify row numbers stay absolute (0..100 ++ 200..300). Critical correctness gate for Spark (and for FixRowNumberReaderwhen not all row groups are selected arrow-rs#8863).test_row_index_with_partition_cols— partition + virtual + data columns compose correctly.test_row_index_nullable_int64— nullability flag flows through unchanged (matches Spark's_tmp_metadata_row_indexdeclaration).test_unsupported_virtual_extension_type_rejected— usingRowGroupIndex(a real arrow-rs type deliberately not in the enum yet) errors withNotImplementedinstead of silently forwarding.test_row_index_predicate_pushdown_mixed_or_errors/_virtual_only_errors/_allowed_when_pushdown_disabled— exercise the opener's defensive check for virtual-col predicate refs withpushdown_filters=true, and confirm thepushdown_filters=falsepath is unaffected.In
source.rs:test_try_pushdown_filters_rejects_virtual_column_refspins the planner-boundary contract — file-col filters arePushedDown::Yes, virtual-only and mixed filters arePushedDown::No.In
virtual_column.rs: unit tests coveringTryFrom<&FieldRef>for valid, missing-extension-type, and unsupported-extension-type inputs.Plus a
TableSchemaunit test verifying the[file, partition, virtual]layout is stable regardless of builder-call order.Are there any user-facing changes?
Public API additions:
TableSchema::with_virtual_columns(...),TableSchema::virtual_columns(),TableSchema::schema_without_virtual_columns(), andParquetVirtualColumn(re-exported fromdatafusion-datasource-parquet). No existing API changed; no breaking changes.