fix: fall back for shredded Variant scans on Spark 4.0 by andygrove · Pull Request #4084 · apache/datafusion-comet

andygrove · 2026-04-25T15:24:59Z

Which issue does this PR close?

Closes #2209.

Rationale for this change

Spark 4.0's PushVariantIntoScan optimizer rewrites a VariantType column into a StructType whose fields each carry __VARIANT_METADATA_KEY metadata, then pushes variant_get paths down as ordinary struct field accesses. By the time CometScanRule runs, the requiredSchema looks like a normal struct of primitives, so Comet scans natively but does not honor the on-disk Parquet variant shredding layout. The result is silent data corruption: typed paths read back as nulls (or, with some shapes, a hard FAILED_READ_FILE).

This is data correctness, so we need to fall back to Spark for these reads rather than continuing to ignore the suites.

What changes are included in this PR?

CometTypeShim (Spark 4.0): add isVariantStruct that delegates to Spark's VariantMetadata.isVariantStruct, which checks for the __VARIANT_METADATA_KEY marker on every field.
CometTypeShim (Spark 3.x): stub returning false; variant shredding does not exist pre-4.0.
CometScanTypeChecker.isTypeSupported: add a case s: StructType if isVariantStruct(s) => false arm with a fallback reason, so both auto/native_datafusion and native_iceberg_compat scan paths fall back to Spark on shredded Variant reads.
dev/diffs/4.0.1.diff: stop ignoring VariantShreddingSuite and ParquetVariantShreddingSuite.

How are these changes tested?

Ran sql/testOnly org.apache.spark.sql.VariantShreddingSuite org.apache.spark.sql.execution.datasources.parquet.ParquetVariantShreddingSuite against patched Spark v4.0.1 with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true:

COMET_PARQUET_SCAN_IMPL=auto: 13/13 pass (was 5/13).
COMET_PARQUET_SCAN_IMPL=native_iceberg_compat: 13/13 pass (was 5/13).

The Spark SQL test workflows already cover both scan impls on every PR, so the unignored suites give us ongoing protection against regressions.

Spark 4.0's PushVariantIntoScan rewrites VariantType columns into a StructType whose fields carry __VARIANT_METADATA_KEY metadata, then pushes variant_get paths down as ordinary struct field accesses. By the time CometScanRule runs, the requiredSchema looks like a normal struct of primitives, so Comet scans natively but does not honor the on-disk variant shredding layout, returning nulls for typed paths. Detect the marker via VariantMetadata.isVariantStruct in the Spark 4.0 type shim and reject those structs in CometScanTypeChecker so the scan falls back to Spark. Stop ignoring VariantShreddingSuite and ParquetVariantShreddingSuite in the 4.0.1 diff. Closes apache#2209.

mbutrovich

LGTM, thanks @andygrove!

Mirrors the spark-4.0 CometTypeShim helper that apache#4084 added to the scan rule. VariantMetadata.isVariantStruct exists in Spark 4.1.1 (in PushVariantIntoScan.scala) so the implementation is identical to spark-4.0.

andygrove added spark 4 correctness bug Something isn't working labels Apr 25, 2026

mbutrovich approved these changes Apr 26, 2026

View reviewed changes

mbutrovich merged commit fdf00d4 into apache:main Apr 26, 2026
170 of 171 checks passed

andygrove deleted the worktree-unignore-variant-shredding branch April 26, 2026 14:37

andygrove mentioned this pull request Apr 26, 2026

build: add spark-4.1 Maven profile and shim sources #4097

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fall back for shredded Variant scans on Spark 4.0#4084

fix: fall back for shredded Variant scans on Spark 4.0#4084
mbutrovich merged 1 commit intoapache:mainfrom
andygrove:worktree-unignore-variant-shredding

andygrove commented Apr 25, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 25, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants