[SPARK-54454][SQL] Enable variant shredding and variant logical type annotation configs by default #53164

harshmotw-db · 2025-11-21T23:16:20Z

What changes were proposed in this pull request?

This PR enables the annotation of the variant parquet logical type and shredded writes and reads by default.

Why are the changes needed?

Having variant data annotated with the variant logical type is required by the parquet variant spec (source). This is necessary to adhere to the spec
Variant shredding brings in significant performance optimizations over regular unshredded variants, and should be the default mode.

Does this PR introduce any user-facing change?

Yes, variant data written by Spark would be annotated with the variant logical type annotation and variant shredding would be enabled by default.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

+1 for Apache Spark 4.2.0. Thank you, @harshmotw-db . Please make the CI happy.

CI is still failing.

cloud-fan · 2025-11-24T06:31:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .version("4.1.0")
      .booleanConf
-      .createWithDefault(false)
+      .createWithDefault(true)


I think we should at least enable this config in 4.1, to conform to the Parquet spec. If many people start to use variant type with Spark 4.1, then the entire ecosystem may be forced to support the Spark-specific variant type in Parquet.

@harshmotw-db can we open a separate PR for this config? And also cc @dongjoon-hyun

dongjoon-hyun

Yes, in order to re-evaluate the scope, please make working PRs which passes all the CIs successfully, @harshmotw-db and @cloud-fan .

It's because a release manager cannot make any decision (not only for branch-4.1, but also for master) on the broken proposal.

harshmotw-db · 2025-11-25T06:21:02Z

@dongjoon-hyun @cloud-fan CI should be happy now. Only one test failure was a real issue - SPARK-48067: default variant columns works in VariantSuite. The other 31 assumed certain configs to work.
The fix for the real issue is in these commits. When pushing variant into the scan for shredded reads, we transform it into a struct and apply GetStructField on it to extract the relevant fields. However, this didn't fair well when Variants had a default value because default resolution would expect Variant to be in its native VariantType representation rather than in a struct.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

dongjoon-hyun

@harshmotw-db . Please spin off all other code change from this PR because it's very misleading. Technically, the config change PR focuses on only the flag switch.

In short,

Make this PR have only SQLConf.scala change.
Make a new PR containing all other changes like (Cast.scala, ResolveDefaultColumnsUtil.scala, *Suite.scala).

We had better proceed (2) first.

harshmotw-db · 2025-11-26T00:20:23Z

@dongjoon-hyun @cloud-fan This PR has the non-config changes.

### What changes were proposed in this pull request? [This PR](#53164) enables shredding and variant logical type annotation configs by default. However, some test suites assume the old behavior. This PR fixes those tests to also work with the new default configs. This PR also fixes a bug we discovered in the previous PR where variant default resolution would fail when pushVariantIntoScan was enabled. ### Why are the changes needed? To fix the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53224 from harshmotw-db/harshmotw-db/shredding_fixes. Lead-authored-by: Harsh Motwani <harsh.motwani@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? [This PR](#53164) enables shredding and variant logical type annotation configs by default. However, some test suites assume the old behavior. This PR fixes those tests to also work with the new default configs. This PR also fixes a bug we discovered in the previous PR where variant default resolution would fail when pushVariantIntoScan was enabled. ### Why are the changes needed? To fix the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53224 from harshmotw-db/harshmotw-db/shredding_fixes. Lead-authored-by: Harsh Motwani <harsh.motwani@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d36bd62) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2025-11-26T13:15:20Z

@harshmotw-db let's update this PR to be config flipping only.

dongjoon-hyun · 2025-11-26T23:37:37Z

+1 for @cloud-fan 's advice.

harshmotw-db · 2025-11-27T00:05:59Z

@dongjoon-hyun I had missed two changes in the other PR that I have put in this PR. Once that PR is merged, this PR can simply contain the flag flips. Currently this PR also contains those minor test changes. If you are okay with merging this PR with those changes, you can go ahead

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala

.../scala/org/apache/spark/sql/execution/datasources/parquet/ParquetVariantShreddingSuite.scala

enable variant shredding configs by default

0b07a84

github-actions bot added the SQL label Nov 21, 2025

dongjoon-hyun previously approved these changes Nov 22, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-54454] Enable variant shredding and variant logical type annotation configs by default~~ [SPARK-54454][SQL] Enable variant shredding and variant logical type annotation configs by default Nov 23, 2025

cloud-fan reviewed Nov 24, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 24, 2025

View reviewed changes

harshmotw-db added 3 commits November 25, 2025 04:00

resolve unit test failures

3a601f8

fix default resolution for push_variant_into_scan

02ab583

fix

b65b85d

harshmotw-db requested review from cloud-fan and dongjoon-hyun November 25, 2025 06:21

fix

29a87b4