Skip to content

Conversation

@harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR enables the annotation of the variant parquet logical type and shredded writes and reads by default.

Why are the changes needed?

  1. Having variant data annotated with the variant logical type is required by the parquet variant spec (source). This is necessary to adhere to the spec
  2. Variant shredding brings in significant performance optimizations over regular unshredded variants, and should be the default mode.

Does this PR introduce any user-facing change?

Yes, variant data written by Spark would be annotated with the variant logical type annotation and variant shredding would be enabled by default.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 21, 2025
dongjoon-hyun
dongjoon-hyun previously approved these changes Nov 22, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Apache Spark 4.2.0. Thank you, @harshmotw-db . Please make the CI happy.

@dongjoon-hyun dongjoon-hyun dismissed their stale review November 22, 2025 17:11

CI is still failing.

@HyukjinKwon HyukjinKwon changed the title [SPARK-54454] Enable variant shredding and variant logical type annotation configs by default [SPARK-54454][SQL] Enable variant shredding and variant logical type annotation configs by default Nov 23, 2025
.version("4.1.0")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least enable this config in 4.1, to conform to the Parquet spec. If many people start to use variant type with Spark 4.1, then the entire ecosystem may be forced to support the Spark-specific variant type in Parquet.

@harshmotw-db can we open a separate PR for this config? And also cc @dongjoon-hyun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in order to re-evaluate the scope, please make working PRs which passes all the CIs successfully, @harshmotw-db and @cloud-fan .

It's because a release manager cannot make any decision (not only for branch-4.1, but also for master) on the broken proposal.

@harshmotw-db
Copy link
Contributor Author

@dongjoon-hyun @cloud-fan CI should be happy now. Only one test failure was a real issue - SPARK-48067: default variant columns works in VariantSuite. The other 31 assumed certain configs to work.
The fix for the real issue is in these commits. When pushing variant into the scan for shredded reads, we transform it into a struct and apply GetStructField on it to extract the relevant fields. However, this didn't fair well when Variants had a default value because default resolution would expect Variant to be in its native VariantType representation rather than in a struct.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db . Please spin off all other code change from this PR because it's very misleading. Technically, the config change PR focuses on only the flag switch.

In short,

  1. Make this PR have only SQLConf.scala change.
  2. Make a new PR containing all other changes like (Cast.scala, ResolveDefaultColumnsUtil.scala, *Suite.scala).

We had better proceed (2) first.

@harshmotw-db
Copy link
Contributor Author

@dongjoon-hyun @cloud-fan This PR has the non-config changes.

cloud-fan added a commit that referenced this pull request Nov 26, 2025
### What changes were proposed in this pull request?

[This PR](#53164) enables shredding and variant logical type annotation configs by default. However, some test suites assume the old behavior. This PR fixes those tests to also work with the new default configs.

This PR also fixes a bug we discovered in the previous PR where variant default resolution would fail when pushVariantIntoScan was enabled.

### Why are the changes needed?

To fix the bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53224 from harshmotw-db/harshmotw-db/shredding_fixes.

Lead-authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Nov 26, 2025
### What changes were proposed in this pull request?

[This PR](#53164) enables shredding and variant logical type annotation configs by default. However, some test suites assume the old behavior. This PR fixes those tests to also work with the new default configs.

This PR also fixes a bug we discovered in the previous PR where variant default resolution would fail when pushVariantIntoScan was enabled.

### Why are the changes needed?

To fix the bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53224 from harshmotw-db/harshmotw-db/shredding_fixes.

Lead-authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d36bd62)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

@harshmotw-db let's update this PR to be config flipping only.

@dongjoon-hyun
Copy link
Member

+1 for @cloud-fan 's advice.

@harshmotw-db harshmotw-db force-pushed the harshmotw-db/enable_variant_shredding branch from 1c8d539 to 23e6a1d Compare November 26, 2025 23:42
@harshmotw-db
Copy link
Contributor Author

@dongjoon-hyun I had missed two changes in the other PR that I have put in this PR. Once that PR is merged, this PR can simply contain the flag flips. Currently this PR also contains those minor test changes. If you are okay with merging this PR with those changes, you can go ahead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants