Skip to content

Conversation

@harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation.

Why are the changes needed?

The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation.

This change is hidden behind a flag that is disabled by default until read support can be properly implemented.

Does this PR introduce any user-facing change?

Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read

How was this patch tested?

Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 11, 2025
convertInternal(groupColumn, sparkReadType.map(_.asInstanceOf[StructType]))) {
// Temporary workaround to read Shredded variant data
case v: VariantLogicalTypeAnnotation if v.getSpecVersion == 1 && sparkReadType.isEmpty =>
convertInternal(groupColumn, None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenhao-db I have added this temporary workaround on the reader side to allow tests to be able to scan a table containing Variant data if the VARIANT_ALLOW_READING_SHREDDED config is set to true

@harshmotw-db
Copy link
Contributor Author

@cashmand @chenhao-db @cloud-fan Can you please look at this PR? Thanks!

Copy link
Contributor

@cashmand cashmand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change!

.booleanConf
.createWithDefault(false)

val PARQUET_WRITE_VARIANT_SPEC_VERSION =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit strange for this to be a conf for now. I don't think we should allow writing a version that Spark doesn't know how to write, and currently the only valid spec version is 1, so if we have a conf, it should only accept "1" as a valid setting. Is there some use for this conf that I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, yeah I agree. It shouldn't be based on a config. I'll fix it

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.1!

@cloud-fan cloud-fan closed this in 5270c99 Nov 12, 2025
cloud-fan pushed a commit that referenced this pull request Nov 12, 2025
…tation

### What changes were proposed in this pull request?

This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation.

### Why are the changes needed?

The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation.

This change is hidden behind a flag that is disabled by default until read support can be properly implemented.

### Does this PR introduce _any_ user-facing change?

Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read

### How was this patch tested?

Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53005 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 5270c99)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun pushed a commit that referenced this pull request Nov 21, 2025
…tation

### What changes were proposed in this pull request?

[This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type.

The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in.

### Why are the changes needed?

The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to read parquet files with the variant logical type annotation.

### How was this patch tested?

Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Nov 21, 2025
…tation

### What changes were proposed in this pull request?

[This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type.

The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in.

### Why are the changes needed?

The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to read parquet files with the variant logical type annotation.

### How was this patch tested?

Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit da7389b)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
…tation

### What changes were proposed in this pull request?

This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation.

### Why are the changes needed?

The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation.

This change is hidden behind a flag that is disabled by default until read support can be properly implemented.

### Does this PR introduce _any_ user-facing change?

Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read

### How was this patch tested?

Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#53005 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 25, 2025
…tation

### What changes were proposed in this pull request?

[This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type.

The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in.

### Why are the changes needed?

The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to read parquet files with the variant logical type annotation.

### How was this patch tested?

Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…tation

### What changes were proposed in this pull request?

This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation.

### Why are the changes needed?

The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation.

This change is hidden behind a flag that is disabled by default until read support can be properly implemented.

### Does this PR introduce _any_ user-facing change?

Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read

### How was this patch tested?

Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#53005 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…tation

### What changes were proposed in this pull request?

[This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type.

The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in.

### Why are the changes needed?

The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to read parquet files with the variant logical type annotation.

### How was this patch tested?

Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants