Skip to content

fix: fall back for shredded Variant scans on Spark 4.0#4084

Merged
mbutrovich merged 1 commit intoapache:mainfrom
andygrove:worktree-unignore-variant-shredding
Apr 26, 2026
Merged

fix: fall back for shredded Variant scans on Spark 4.0#4084
mbutrovich merged 1 commit intoapache:mainfrom
andygrove:worktree-unignore-variant-shredding

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #2209.

Rationale for this change

Spark 4.0's PushVariantIntoScan optimizer rewrites a VariantType column into a StructType whose fields each carry __VARIANT_METADATA_KEY metadata, then pushes variant_get paths down as ordinary struct field accesses. By the time CometScanRule runs, the requiredSchema looks like a normal struct of primitives, so Comet scans natively but does not honor the on-disk Parquet variant shredding layout. The result is silent data corruption: typed paths read back as nulls (or, with some shapes, a hard FAILED_READ_FILE).

This is data correctness, so we need to fall back to Spark for these reads rather than continuing to ignore the suites.

What changes are included in this PR?

  • CometTypeShim (Spark 4.0): add isVariantStruct that delegates to Spark's VariantMetadata.isVariantStruct, which checks for the __VARIANT_METADATA_KEY marker on every field.
  • CometTypeShim (Spark 3.x): stub returning false; variant shredding does not exist pre-4.0.
  • CometScanTypeChecker.isTypeSupported: add a case s: StructType if isVariantStruct(s) => false arm with a fallback reason, so both auto/native_datafusion and native_iceberg_compat scan paths fall back to Spark on shredded Variant reads.
  • dev/diffs/4.0.1.diff: stop ignoring VariantShreddingSuite and ParquetVariantShreddingSuite.

How are these changes tested?

Ran sql/testOnly org.apache.spark.sql.VariantShreddingSuite org.apache.spark.sql.execution.datasources.parquet.ParquetVariantShreddingSuite against patched Spark v4.0.1 with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true:

  • COMET_PARQUET_SCAN_IMPL=auto: 13/13 pass (was 5/13).
  • COMET_PARQUET_SCAN_IMPL=native_iceberg_compat: 13/13 pass (was 5/13).

The Spark SQL test workflows already cover both scan impls on every PR, so the unignored suites give us ongoing protection against regressions.

Spark 4.0's PushVariantIntoScan rewrites VariantType columns into a
StructType whose fields carry __VARIANT_METADATA_KEY metadata, then
pushes variant_get paths down as ordinary struct field accesses. By the
time CometScanRule runs, the requiredSchema looks like a normal struct
of primitives, so Comet scans natively but does not honor the on-disk
variant shredding layout, returning nulls for typed paths.

Detect the marker via VariantMetadata.isVariantStruct in the Spark 4.0
type shim and reject those structs in CometScanTypeChecker so the scan
falls back to Spark. Stop ignoring VariantShreddingSuite and
ParquetVariantShreddingSuite in the 4.0.1 diff.

Closes apache#2209.
@andygrove andygrove added spark 4 correctness bug Something isn't working labels Apr 25, 2026
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @andygrove!

@mbutrovich mbutrovich merged commit fdf00d4 into apache:main Apr 26, 2026
170 of 171 checks passed
@andygrove andygrove deleted the worktree-unignore-variant-shredding branch April 26, 2026 14:37
andygrove added a commit to andygrove/datafusion-comet that referenced this pull request Apr 26, 2026
Mirrors the spark-4.0 CometTypeShim helper that apache#4084 added to the scan rule. VariantMetadata.isVariantStruct exists in Spark 4.1.1 (in PushVariantIntoScan.scala) so the implementation is identical to spark-4.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working correctness spark 4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[native_iceberg_compat] VariantShreddingSuite test failures with Spark 4.0.0

2 participants