feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener by mbutrovich · Pull Request #22026 · apache/datafusion

mbutrovich · 2026-05-05T17:49:11Z

Which issue does this PR close?

Part of [EPIC] A collection of support for metadata columns in ListingTable #20135 (epic: virtual / metadata columns). Does not close that epic; see this comment describing the scope split.
Revives Expose virtual columns from the Arrow Parquet reader in datasource-parquet #20133 (auto-closed stale) — same core plumbing, credit to @jkylling.
Unblocks [native_datafusion] Add support for reading row index metadata columns datafusion-comet#3432 (remove native_datafusion fallback for Spark's _tmp_metadata_row_index).

Rationale for this change

arrow-rs 57.1.0+ supports Parquet virtual columns (row_number, row_group_index) via ArrowReaderOptions::with_virtual_columns, and DataFusion pins a new-enough arrow-rs for the API to be available. DataFusion does not yet plumb the option through ParquetOpener, so consumers (notably Comet) cannot project Spark's _tmp_metadata_row_index through the native_datafusion scan path.

This PR adds the minimal opener-boundary plumbing so TableSchema can carry virtual columns and the Parquet reader produces them. UX / SQL-layer surface for virtual columns stays deferred to the epic in #20135 — this follows the same framing alamb blessed for #20071 (the input_file_name() UDF).

What changes are included in this PR?

TableSchema::with_virtual_columns(...) builder + virtual_columns() getter. Layout: [file, partition, virtual]. Composable with with_table_partition_cols in either order.
TableSchema::schema_without_virtual_columns() — file + partition schema used by pushdown-planning paths that can't evaluate virtual-col refs.
ParquetOpener forwards the fields to ArrowReaderOptions::with_virtual_columns; augments the schemas passed to the expr-adapter / simplifier with virtual fields so virtual-col refs identity-rewrite; strips them from the projection fed to ProjectionMask::roots (which only understands file columns) and appends them to stream_schema so reassign_expr_columns resolves them by name.
New ParquetVirtualColumn enum with TryFrom<&FieldRef> (in datasource-parquet::virtual_column) gates which arrow-rs virtual extension types are accepted. Currently only RowNumber; adding a variant (e.g. RowGroupIndex) is a compile-time obligation. Replaces the earlier runtime string-allowlist so the contract lives in the type system.
ParquetSource::try_pushdown_filters classifies filters against the file+partition schema (not the full table schema) so predicates referencing virtual columns are reported as PushedDown::No and the FilterExec stays above the scan — arrow-rs's RowFilter addresses parquet leaves only and can't evaluate virtual-column refs, so silently pushing them would produce wrong results.
Defensive check in the opener: prepare_open_file rejects pushdown_filters=true + a predicate that references a virtual column, with a clear remediation message. This catches callers that bypass the optimizer and set the predicate on ParquetSource directly.
arrow-schema added as a direct dep (previously transitive via arrow) so the enum references RowNumber::NAME from arrow-rs instead of hardcoding the string.
Explicitly not in scope (follow-ups): ListingTable / SQL-layer surface, a three-arg constructor on TableSchema, ParquetSource::with_virtual_columns, and RowGroupIndex support.

Are these changes tested?

Yes. New unit tests in opener.rs:

test_row_index_basic — single row group, select data + row_number.
test_row_index_projection_only — select only row_number.
test_row_index_multi_row_group — 3 × 100 rows, verify absolute 0..300 across boundaries.
test_row_index_with_row_group_skip — predicate stats-prunes the middle row group; verify row numbers stay absolute (0..100 ++ 200..300). Critical correctness gate for Spark (and for Fix RowNumberReader when not all row groups are selected arrow-rs#8863).
test_row_index_with_partition_cols — partition + virtual + data columns compose correctly.
test_row_index_nullable_int64 — nullability flag flows through unchanged (matches Spark's _tmp_metadata_row_index declaration).
test_unsupported_virtual_extension_type_rejected — using RowGroupIndex (a real arrow-rs type deliberately not in the enum yet) errors with NotImplemented instead of silently forwarding.
test_row_index_predicate_pushdown_mixed_or_errors / _virtual_only_errors / _allowed_when_pushdown_disabled — exercise the opener's defensive check for virtual-col predicate refs with pushdown_filters=true, and confirm the pushdown_filters=false path is unaffected.

In source.rs: test_try_pushdown_filters_rejects_virtual_column_refs pins the planner-boundary contract — file-col filters are PushedDown::Yes, virtual-only and mixed filters are PushedDown::No.

In virtual_column.rs: unit tests covering TryFrom<&FieldRef> for valid, missing-extension-type, and unsupported-extension-type inputs.

Plus a TableSchema unit test verifying the [file, partition, virtual] layout is stable regardless of builder-call order.

Are there any user-facing changes?

Public API additions: TableSchema::with_virtual_columns(...), TableSchema::virtual_columns(), TableSchema::schema_without_virtual_columns(), and ParquetVirtualColumn (re-exported from datafusion-datasource-parquet). No existing API changed; no breaking changes.

…and ParquetOpener, gated behind a tested-only extension-type allowlist, to unblock Comet's native-DataFusion support for Spark's _tmp_metadata_row_index.

adriangb · 2026-05-05T19:20:52Z

My main concern is #22026 (comment).

The various schemas in opener.rs are already quite complex, this risks making it worse.

mbutrovich · 2026-05-05T20:08:37Z

My main concern is #22026 (comment).

The various schemas in opener.rs are already quite complex, this risks making it worse.

Thanks for the review @adriangb! Agreed it could make things more complicated, but if DataFusion is ever going to support these virtual columns it might be unavoidable. I think it's good to hash this stuff out in the smallest possible PR at the opener level. I'll push an update later today.

mbutrovich · 2026-05-05T20:52:29Z

Thanks again for the review @adriangb! Hopefully I addressed all of the feedback, but happy to keep chatting about it.

Mixed virtual/file predicates with pushdown_filters=true

Confirmed the silent-drop bug with failing tests. Root cause: ParquetSource::try_pushdown_filters called can_expr_be_pushed_down_with_schemas with the full table schema (now including virtual columns), so filters referencing row_number were marked PushedDown::Yes → FilterExec removed → the scan's build_row_filter couldn't resolve the virtual-col ref against physical_file_schema and silently dropped the conjunct.

Arrow-rs can't accept virtual-column refs in a RowFilter at all: ArrowPredicate::projection() returns a ProjectionMask over parquet leaves only, and virtual columns are synthesized after filter evaluation. So virtual columns are projectable but never pushable.

Fix: added TableSchema::schema_without_virtual_columns() (file + partition, excluding virtual) and try_pushdown_filters uses that. Virtual-col filters are now reported PushedDown::No and the FilterExec stays above the scan.

Defense-in-depth in the opener for callers who bypass the optimizer (e.g. manual plan builders): prepare_open_file rejects pushdown_filters=true + virtual-col predicate with a clear error pointing at with_pushdown_filters(false) or keeping the filter above the scan.

Tests: source.rs::test_try_pushdown_filters_rejects_virtual_column_refs (planner boundary), plus three opener-level tests covering mixed OR, virtual-only, and the allowed pushdown_filters=false case.

Ordering doc on virtual_columns

Struct field doc now spells out the [file, partition, virtual] layout, matching the builder methods.

Enum + TryFrom

Added ParquetVirtualColumn with TryFrom<&FieldRef> in a new virtual_column.rs. The runtime allowlist in the opener is replaced with ParquetVirtualColumn::try_from(field)?. Adding a new variant (e.g. RowGroupIndex) is now a compile-time obligation, and consumers can pattern-match instead of string-comparing extension-type names. Exposed as pub use ParquetVirtualColumn at the crate root.

adriangb · 2026-05-05T21:03:00Z

I think this would then have a negative interaction with the goal of turning filter pushdown on by default. Maybe we'll always have to apply some filters as a FilterExec and that's fine...

mbutrovich · 2026-05-05T21:24:54Z

I think this would then have a negative interaction with the goal of turning filter pushdown on by default. Maybe we'll always have to apply some filters as a FilterExec and that's fine...

Comet conservatively never removes FilterExec nodes above scans with pushed down filters, though that maybe shouldn't be the case.

Wouldn't this only prevent filter pushdown for filters that reference virtual columns?

adriangb · 2026-05-05T22:03:48Z

Wouldn't this only prevent filter pushdown for filters that reference virtual columns?

Yeah but it means we'll have to keep the split forever. Which might have been the case anyway and maybe a non issue.

And that any filter that does reference virtual columns cannot be pushed down even if a part of it would benefit from doing so, e..g row_id = 1 and pk = 1, but I'm not sure that's a realistic scenario. In the past we prevented pushdown of projection columns and that was a real issue, we'd see queries in prod from users along the lines of day = '...' OR pk = 1 that could not get pushed down.

adriangb · 2026-05-05T22:04:05Z

I plan to give this another review tomorrow.

comphead · 2026-05-05T23:28:32Z

run benchmark tpch tpcds

comphead · 2026-05-05T23:29:47Z

@mbutrovich from high level perspective how row_number virtual column would work when reading multiple parquet files?

adriangbot · 2026-05-05T23:31:21Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4383929017-2034-5dnfv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing virtual-columns-table-schema (bd513ec) to 2c7af17 (merge-base) diff using: tpcds
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-05T23:31:52Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4383929017-2033-f8cjt 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing virtual-columns-table-schema (bd513ec) to 2c7af17 (merge-base) diff using: tpch
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-05T23:44:45Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and virtual-columns-table-schema
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query     ┃                           HEAD ┃   virtual-columns-table-schema ┃    Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1  │ 40.03 / 41.45 ±1.55 / 43.79 ms │ 39.49 / 40.62 ±1.19 / 42.39 ms │ no change │
│ QQuery 2  │ 21.07 / 21.56 ±0.69 / 22.87 ms │ 20.75 / 20.84 ±0.10 / 21.04 ms │ no change │
│ QQuery 3  │ 35.68 / 38.24 ±1.33 / 39.30 ms │ 35.44 / 37.61 ±1.64 / 39.23 ms │ no change │
│ QQuery 4  │ 18.04 / 18.37 ±0.17 / 18.52 ms │ 18.06 / 18.12 ±0.05 / 18.19 ms │ no change │
│ QQuery 5  │ 43.56 / 45.18 ±2.01 / 48.95 ms │ 43.18 / 44.20 ±0.87 / 45.76 ms │ no change │
│ QQuery 6  │ 17.06 / 17.19 ±0.14 / 17.45 ms │ 17.06 / 17.16 ±0.08 / 17.28 ms │ no change │
│ QQuery 7  │ 49.86 / 50.64 ±0.54 / 51.27 ms │ 50.04 / 52.14 ±2.32 / 56.63 ms │ no change │
│ QQuery 8  │ 46.49 / 46.74 ±0.14 / 46.88 ms │ 46.50 / 46.95 ±0.65 / 48.22 ms │ no change │
│ QQuery 9  │ 51.78 / 52.17 ±0.28 / 52.54 ms │ 51.72 / 52.33 ±0.52 / 53.01 ms │ no change │
│ QQuery 10 │ 65.29 / 65.42 ±0.11 / 65.57 ms │ 65.11 / 65.91 ±1.20 / 68.29 ms │ no change │
│ QQuery 11 │ 13.62 / 14.10 ±0.63 / 15.35 ms │ 13.68 / 14.39 ±1.31 / 17.00 ms │ no change │
│ QQuery 12 │ 26.16 / 26.42 ±0.24 / 26.78 ms │ 26.36 / 26.73 ±0.28 / 27.10 ms │ no change │
│ QQuery 13 │ 35.63 / 36.37 ±0.51 / 36.97 ms │ 35.10 / 36.02 ±0.71 / 36.92 ms │ no change │
│ QQuery 14 │ 26.54 / 27.04 ±0.62 / 28.24 ms │ 26.64 / 26.83 ±0.15 / 27.07 ms │ no change │
│ QQuery 15 │ 32.68 / 32.81 ±0.10 / 32.95 ms │ 32.57 / 33.23 ±0.62 / 34.39 ms │ no change │
│ QQuery 16 │ 15.17 / 15.27 ±0.06 / 15.36 ms │ 15.10 / 15.24 ±0.11 / 15.42 ms │ no change │
│ QQuery 17 │ 75.04 / 76.49 ±0.95 / 77.33 ms │ 75.97 / 77.19 ±1.14 / 79.00 ms │ no change │
│ QQuery 18 │ 67.84 / 68.82 ±0.96 / 70.42 ms │ 67.31 / 68.81 ±0.94 / 69.99 ms │ no change │
│ QQuery 19 │ 37.52 / 37.65 ±0.13 / 37.90 ms │ 37.42 / 37.70 ±0.22 / 38.08 ms │ no change │
│ QQuery 20 │ 38.52 / 38.72 ±0.15 / 38.88 ms │ 38.62 / 39.10 ±0.33 / 39.53 ms │ no change │
│ QQuery 21 │ 58.33 / 59.44 ±0.83 / 60.37 ms │ 59.62 / 60.74 ±0.71 / 61.68 ms │ no change │
│ QQuery 22 │ 23.78 / 23.97 ±0.18 / 24.28 ms │ 23.64 / 24.06 ±0.42 / 24.80 ms │ no change │
└───────────┴────────────────────────────────┴────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary                           ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 854.08ms │
│ Total Time (virtual-columns-table-schema)   │ 855.91ms │
│ Average Time (HEAD)                         │  38.82ms │
│ Average Time (virtual-columns-table-schema) │  38.90ms │
│ Queries Faster                              │        0 │
│ Queries Slower                              │        0 │
│ Queries with No Change                      │       22 │
│ Queries with Failure                        │        0 │
└─────────────────────────────────────────────┴──────────┘

Resource Usage

tpch — base (merge-base)

Metric	Value
Wall time	5.0s
Peak memory	5.5 GiB
Avg memory	5.0 GiB
CPU user	32.0s
CPU sys	2.2s
Peak spill	0 B

tpch — branch

Metric	Value
Wall time	5.0s
Peak memory	5.5 GiB
Avg memory	5.0 GiB
CPU user	31.9s
CPU sys	2.3s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-05-05T23:47:00Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and virtual-columns-table-schema
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃          virtual-columns-table-schema ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │           6.47 / 7.04 ±0.91 / 8.85 ms │           6.37 / 6.84 ±0.83 / 8.50 ms │     no change │
│ QQuery 2  │        82.70 / 83.63 ±0.58 / 84.33 ms │        83.93 / 84.87 ±0.51 / 85.28 ms │     no change │
│ QQuery 3  │        31.32 / 31.64 ±0.18 / 31.82 ms │        31.09 / 31.27 ±0.15 / 31.53 ms │     no change │
│ QQuery 4  │    563.85 / 580.17 ±13.51 / 595.57 ms │     573.63 / 590.61 ±9.15 / 599.40 ms │     no change │
│ QQuery 5  │        55.02 / 56.17 ±1.00 / 57.36 ms │        55.18 / 55.96 ±0.51 / 56.58 ms │     no change │
│ QQuery 6  │        38.16 / 38.73 ±0.55 / 39.46 ms │        38.26 / 39.97 ±2.02 / 43.82 ms │     no change │
│ QQuery 7  │     116.31 / 118.29 ±1.77 / 121.63 ms │     115.76 / 117.67 ±1.98 / 121.39 ms │     no change │
│ QQuery 8  │        41.49 / 41.70 ±0.24 / 42.13 ms │        40.89 / 40.96 ±0.07 / 41.06 ms │     no change │
│ QQuery 9  │        55.77 / 59.88 ±2.62 / 63.46 ms │        54.56 / 57.98 ±2.05 / 60.75 ms │     no change │
│ QQuery 10 │        86.67 / 87.58 ±0.77 / 88.54 ms │        85.72 / 86.23 ±0.52 / 87.06 ms │     no change │
│ QQuery 11 │    351.22 / 364.88 ±11.43 / 377.34 ms │     363.72 / 370.19 ±3.52 / 373.64 ms │     no change │
│ QQuery 12 │        30.78 / 31.07 ±0.18 / 31.24 ms │        30.51 / 30.80 ±0.26 / 31.28 ms │     no change │
│ QQuery 13 │     138.03 / 138.75 ±0.79 / 140.19 ms │     134.83 / 136.57 ±1.54 / 138.86 ms │     no change │
│ QQuery 14 │     527.83 / 534.55 ±3.64 / 538.71 ms │     532.98 / 536.60 ±1.87 / 538.43 ms │     no change │
│ QQuery 15 │        63.93 / 65.02 ±1.28 / 67.43 ms │        66.70 / 69.00 ±1.77 / 71.51 ms │  1.06x slower │
│ QQuery 16 │           7.18 / 7.40 ±0.23 / 7.83 ms │           7.29 / 7.43 ±0.11 / 7.57 ms │     no change │
│ QQuery 17 │        86.73 / 88.11 ±1.57 / 90.98 ms │        85.31 / 87.53 ±2.44 / 92.00 ms │     no change │
│ QQuery 18 │     163.53 / 164.50 ±0.72 / 165.59 ms │     159.80 / 163.91 ±2.25 / 166.47 ms │     no change │
│ QQuery 19 │        44.35 / 44.67 ±0.26 / 45.13 ms │        44.77 / 45.03 ±0.39 / 45.80 ms │     no change │
│ QQuery 20 │        37.61 / 38.23 ±0.45 / 38.98 ms │        37.94 / 38.58 ±0.43 / 39.23 ms │     no change │
│ QQuery 21 │        19.01 / 19.36 ±0.20 / 19.59 ms │        19.36 / 19.60 ±0.18 / 19.88 ms │     no change │
│ QQuery 22 │        64.17 / 65.22 ±0.94 / 66.71 ms │        68.63 / 69.51 ±0.58 / 70.44 ms │  1.07x slower │
│ QQuery 23 │    504.82 / 524.83 ±20.74 / 562.39 ms │     510.61 / 522.29 ±9.80 / 533.95 ms │     no change │
│ QQuery 24 │     249.60 / 252.27 ±2.50 / 255.57 ms │     250.70 / 259.29 ±7.06 / 271.39 ms │     no change │
│ QQuery 25 │     120.57 / 122.44 ±1.79 / 125.60 ms │     121.80 / 123.77 ±1.59 / 126.18 ms │     no change │
│ QQuery 26 │        76.56 / 77.43 ±0.76 / 78.68 ms │        76.28 / 77.69 ±0.90 / 78.91 ms │     no change │
│ QQuery 27 │           7.11 / 7.26 ±0.14 / 7.53 ms │           7.38 / 7.65 ±0.15 / 7.77 ms │  1.05x slower │
│ QQuery 28 │        65.63 / 67.13 ±0.81 / 68.00 ms │        65.96 / 67.41 ±0.74 / 67.93 ms │     no change │
│ QQuery 29 │     105.15 / 107.19 ±1.27 / 109.13 ms │     106.33 / 107.90 ±2.12 / 111.97 ms │     no change │
│ QQuery 30 │                                  FAIL │                                  FAIL │  incomparable │
│ QQuery 31 │     117.44 / 118.81 ±1.06 / 120.34 ms │     117.93 / 120.44 ±1.44 / 121.93 ms │     no change │
│ QQuery 32 │        22.60 / 22.92 ±0.18 / 23.14 ms │        22.60 / 22.97 ±0.23 / 23.24 ms │     no change │
│ QQuery 33 │        42.06 / 42.87 ±0.61 / 43.74 ms │        41.40 / 42.57 ±1.83 / 46.21 ms │     no change │
│ QQuery 34 │        10.86 / 11.45 ±0.43 / 12.08 ms │        10.67 / 11.08 ±0.33 / 11.53 ms │     no change │
│ QQuery 35 │        85.94 / 87.24 ±1.76 / 90.64 ms │        85.23 / 85.60 ±0.34 / 86.16 ms │     no change │
│ QQuery 36 │           6.91 / 7.06 ±0.10 / 7.21 ms │           6.55 / 6.71 ±0.13 / 6.92 ms │     no change │
│ QQuery 37 │           7.69 / 7.82 ±0.09 / 7.93 ms │           7.52 / 7.76 ±0.16 / 7.98 ms │     no change │
│ QQuery 38 │        73.83 / 74.10 ±0.29 / 74.63 ms │        76.04 / 76.73 ±0.57 / 77.56 ms │     no change │
│ QQuery 39 │     105.55 / 107.98 ±2.12 / 110.73 ms │     109.68 / 111.93 ±1.47 / 114.04 ms │     no change │
│ QQuery 40 │        24.25 / 24.44 ±0.10 / 24.51 ms │        24.98 / 25.22 ±0.19 / 25.57 ms │     no change │
│ QQuery 41 │        14.39 / 14.59 ±0.13 / 14.77 ms │        15.22 / 15.33 ±0.07 / 15.42 ms │  1.05x slower │
│ QQuery 42 │        25.59 / 26.12 ±0.36 / 26.66 ms │        26.28 / 26.66 ±0.33 / 27.10 ms │     no change │
│ QQuery 43 │           5.65 / 5.76 ±0.10 / 5.89 ms │           5.84 / 6.56 ±0.91 / 8.35 ms │  1.14x slower │
│ QQuery 44 │        11.66 / 11.80 ±0.08 / 11.91 ms │        11.75 / 12.08 ±0.25 / 12.51 ms │     no change │
│ QQuery 45 │        45.21 / 47.41 ±1.80 / 49.02 ms │        47.82 / 48.61 ±1.29 / 51.19 ms │     no change │
│ QQuery 46 │        14.16 / 14.51 ±0.27 / 14.87 ms │        14.85 / 15.15 ±0.23 / 15.47 ms │     no change │
│ QQuery 47 │     252.56 / 265.14 ±7.41 / 275.21 ms │     250.20 / 253.65 ±2.99 / 258.15 ms │     no change │
│ QQuery 48 │     109.27 / 110.30 ±1.01 / 112.03 ms │     109.47 / 110.67 ±1.38 / 113.27 ms │     no change │
│ QQuery 49 │        85.89 / 86.30 ±0.24 / 86.62 ms │        86.03 / 87.00 ±0.62 / 87.98 ms │     no change │
│ QQuery 50 │        63.08 / 64.30 ±1.68 / 67.59 ms │        63.11 / 65.72 ±2.31 / 69.81 ms │     no change │
│ QQuery 51 │       93.81 / 97.35 ±2.10 / 100.26 ms │       96.59 / 98.01 ±1.29 / 100.06 ms │     no change │
│ QQuery 52 │        26.20 / 27.15 ±1.01 / 29.08 ms │        25.82 / 26.11 ±0.25 / 26.41 ms │     no change │
│ QQuery 53 │        32.39 / 32.49 ±0.08 / 32.62 ms │        32.17 / 33.24 ±1.49 / 36.18 ms │     no change │
│ QQuery 54 │        57.61 / 58.25 ±0.51 / 59.05 ms │        56.43 / 58.52 ±2.13 / 62.45 ms │     no change │
│ QQuery 55 │        25.19 / 25.68 ±0.51 / 26.65 ms │        25.73 / 26.27 ±0.31 / 26.66 ms │     no change │
│ QQuery 56 │        41.62 / 42.07 ±0.57 / 43.19 ms │        42.96 / 43.28 ±0.24 / 43.64 ms │     no change │
│ QQuery 57 │     187.59 / 191.07 ±2.00 / 193.20 ms │     191.85 / 193.30 ±1.38 / 195.41 ms │     no change │
│ QQuery 58 │     123.84 / 124.66 ±0.44 / 125.07 ms │     120.61 / 123.17 ±1.51 / 124.88 ms │     no change │
│ QQuery 59 │     121.67 / 122.20 ±0.56 / 122.97 ms │     120.57 / 121.92 ±0.88 / 123.10 ms │     no change │
│ QQuery 60 │        41.96 / 42.50 ±0.39 / 43.13 ms │        42.25 / 42.78 ±0.38 / 43.32 ms │     no change │
│ QQuery 61 │        14.24 / 14.30 ±0.07 / 14.43 ms │        14.44 / 14.53 ±0.07 / 14.64 ms │     no change │
│ QQuery 62 │        49.34 / 49.86 ±0.29 / 50.24 ms │        48.86 / 49.80 ±1.52 / 52.82 ms │     no change │
│ QQuery 63 │        32.72 / 33.05 ±0.19 / 33.27 ms │        32.19 / 32.43 ±0.28 / 32.97 ms │     no change │
│ QQuery 64 │     495.24 / 501.59 ±6.70 / 513.86 ms │     492.56 / 497.63 ±3.75 / 502.42 ms │     no change │
│ QQuery 65 │     149.29 / 152.59 ±2.31 / 155.63 ms │     153.14 / 156.85 ±2.60 / 161.08 ms │     no change │
│ QQuery 66 │        86.71 / 88.91 ±1.30 / 90.44 ms │        86.27 / 90.26 ±4.05 / 98.06 ms │     no change │
│ QQuery 67 │     262.50 / 269.09 ±4.74 / 274.49 ms │     266.01 / 272.73 ±4.14 / 278.81 ms │     no change │
│ QQuery 68 │        14.25 / 14.64 ±0.23 / 14.85 ms │        14.85 / 15.03 ±0.21 / 15.38 ms │     no change │
│ QQuery 69 │        81.94 / 84.11 ±2.12 / 88.00 ms │        82.21 / 85.06 ±5.13 / 95.32 ms │     no change │
│ QQuery 70 │     110.46 / 112.49 ±2.02 / 116.35 ms │     109.60 / 115.95 ±6.54 / 124.14 ms │     no change │
│ QQuery 71 │        38.30 / 39.55 ±1.99 / 43.46 ms │        37.36 / 37.54 ±0.15 / 37.77 ms │ +1.05x faster │
│ QQuery 72 │ 2175.03 / 2325.52 ±88.51 / 2444.48 ms │ 2314.95 / 2373.68 ±38.31 / 2425.89 ms │     no change │
│ QQuery 73 │        10.79 / 11.10 ±0.29 / 11.51 ms │        10.45 / 10.62 ±0.12 / 10.76 ms │     no change │
│ QQuery 74 │     206.09 / 208.67 ±1.49 / 210.17 ms │     195.10 / 200.32 ±6.41 / 211.59 ms │     no change │
│ QQuery 75 │     155.97 / 158.32 ±1.80 / 160.67 ms │     156.02 / 158.77 ±1.86 / 161.77 ms │     no change │
│ QQuery 76 │        37.66 / 38.75 ±1.68 / 42.04 ms │        37.89 / 38.50 ±0.47 / 39.26 ms │     no change │
│ QQuery 77 │        64.99 / 66.20 ±0.67 / 66.91 ms │        64.74 / 65.94 ±0.70 / 66.89 ms │     no change │
│ QQuery 78 │     202.83 / 206.45 ±3.27 / 210.58 ms │     201.87 / 206.98 ±4.10 / 210.88 ms │     no change │
│ QQuery 79 │        69.64 / 71.02 ±1.23 / 72.96 ms │        71.07 / 71.50 ±0.39 / 72.21 ms │     no change │
│ QQuery 80 │     106.92 / 109.00 ±2.04 / 112.87 ms │     106.83 / 108.13 ±1.09 / 109.68 ms │     no change │
│ QQuery 81 │        26.49 / 27.59 ±1.67 / 30.86 ms │        26.37 / 26.78 ±0.23 / 27.03 ms │     no change │
│ QQuery 82 │        18.25 / 18.61 ±0.21 / 18.87 ms │        18.61 / 18.73 ±0.10 / 18.91 ms │     no change │
│ QQuery 83 │        39.97 / 40.40 ±0.29 / 40.88 ms │        40.16 / 41.17 ±1.42 / 43.96 ms │     no change │
│ QQuery 84 │        45.46 / 46.40 ±1.58 / 49.54 ms │        45.58 / 45.84 ±0.33 / 46.49 ms │     no change │
│ QQuery 85 │     145.11 / 146.33 ±1.23 / 148.46 ms │     144.07 / 144.94 ±0.48 / 145.39 ms │     no change │
│ QQuery 86 │        27.17 / 27.58 ±0.27 / 27.96 ms │        26.17 / 26.48 ±0.26 / 26.84 ms │     no change │
│ QQuery 87 │        72.71 / 74.93 ±1.56 / 76.67 ms │        71.79 / 72.42 ±0.39 / 72.87 ms │     no change │
│ QQuery 88 │        66.63 / 67.67 ±1.03 / 69.60 ms │        67.28 / 68.12 ±0.93 / 69.86 ms │     no change │
│ QQuery 89 │        38.56 / 38.95 ±0.33 / 39.55 ms │        38.47 / 39.04 ±0.69 / 40.37 ms │     no change │
│ QQuery 90 │        19.05 / 19.39 ±0.20 / 19.68 ms │        18.98 / 19.13 ±0.09 / 19.22 ms │     no change │
│ QQuery 91 │        55.48 / 56.13 ±0.40 / 56.69 ms │        55.26 / 55.55 ±0.32 / 56.15 ms │     no change │
│ QQuery 92 │        32.86 / 33.09 ±0.13 / 33.24 ms │        31.83 / 33.11 ±1.92 / 36.93 ms │     no change │
│ QQuery 93 │        54.32 / 56.40 ±1.53 / 58.12 ms │        54.62 / 56.82 ±2.18 / 60.32 ms │     no change │
│ QQuery 94 │        42.09 / 42.63 ±0.45 / 43.37 ms │        42.15 / 43.01 ±0.74 / 44.10 ms │     no change │
│ QQuery 95 │        91.09 / 91.95 ±0.72 / 93.02 ms │        92.95 / 93.70 ±0.51 / 94.19 ms │     no change │
│ QQuery 96 │        25.62 / 25.81 ±0.13 / 25.94 ms │        25.27 / 25.68 ±0.31 / 26.18 ms │     no change │
│ QQuery 97 │        48.19 / 49.05 ±0.78 / 50.37 ms │        48.75 / 49.22 ±0.31 / 49.68 ms │     no change │
│ QQuery 98 │        44.16 / 44.72 ±0.39 / 45.16 ms │        44.17 / 45.28 ±0.74 / 46.37 ms │     no change │
│ QQuery 99 │        72.41 / 73.55 ±1.25 / 75.89 ms │        71.23 / 71.83 ±0.39 / 72.45 ms │     no change │
└───────────┴───────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 11275.90ms │
│ Total Time (virtual-columns-table-schema)   │ 11351.25ms │
│ Average Time (HEAD)                         │   115.06ms │
│ Average Time (virtual-columns-table-schema) │   115.83ms │
│ Queries Faster                              │          1 │
│ Queries Slower                              │          5 │
│ Queries with No Change                      │         92 │
│ Queries with Failure                        │          1 │
└─────────────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric	Value
Wall time	60.0s
Peak memory	6.9 GiB
Avg memory	6.2 GiB
CPU user	258.3s
CPU sys	6.7s
Peak spill	0 B

tpcds — branch

Metric	Value
Wall time	60.0s
Peak memory	6.7 GiB
Avg memory	6.0 GiB
CPU user	261.9s
CPU sys	7.2s
Peak spill	0 B

File an issue against this benchmark runner

mbutrovich added 2 commits May 5, 2026 13:21

Plumb Parquet virtual columns (e.g., row_number) through TableSchema …

09d32e9

…and ParquetOpener, gated behind a tested-only extension-type allowlist, to unblock Comet's native-DataFusion support for Spark's _tmp_metadata_row_index.

Cleanup.

f54b003

github-actions Bot added the datasource Changes to the datasource crate label May 5, 2026

Merge branch 'main' into virtual-columns-table-schema

62882a6

mbutrovich requested review from adriangb, andygrove and comphead and removed request for adriangb May 5, 2026 18:14

mbutrovich mentioned this pull request May 5, 2026

[native_datafusion] Add support for reading row index metadata columns apache/datafusion-comet#3432

Open

Fix cargo docs.

59fc97f

adriangb reviewed May 5, 2026

View reviewed changes

Comment thread datafusion/datasource-parquet/src/opener.rs

Comment thread datafusion/datasource/src/table_schema.rs

Comment thread datafusion/datasource/src/table_schema.rs

asolimando mentioned this pull request May 5, 2026

feat: pushdown OFFSET to parquet for RG-level skipping #21828

Open

mbutrovich mentioned this pull request May 5, 2026

Add with_virtual_columns to ParquetSource for reading virtual columns #20132

Open

mbutrovich added 3 commits May 5, 2026 16:24

Address PR feedback.

8d455c5

Address PR feedback.

dbb8f3b

Address PR feedback.

bd513ec

mbutrovich requested a review from adriangb May 5, 2026 20:52

Conversation

mbutrovich commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adriangb commented May 5, 2026

Uh oh!

mbutrovich commented May 5, 2026

Uh oh!

mbutrovich commented May 5, 2026

Uh oh!

adriangb commented May 5, 2026

Uh oh!

mbutrovich commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented May 5, 2026

Uh oh!

adriangb commented May 5, 2026

Uh oh!

comphead commented May 5, 2026

Uh oh!

comphead commented May 5, 2026

Uh oh!

adriangbot commented May 5, 2026

Uh oh!

adriangbot commented May 5, 2026

Uh oh!

adriangbot commented May 5, 2026

Uh oh!

adriangbot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbutrovich commented May 5, 2026 •

edited

Loading

mbutrovich commented May 5, 2026 •

edited

Loading