[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005

harshmotw-db · 2025-11-11T21:42:59Z

What changes were proposed in this pull request?

This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation.

Why are the changes needed?

The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation.

This change is hidden behind a flag that is disabled by default until read support can be properly implemented.

Does this PR introduce any user-facing change?

Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read

How was this patch tested?

Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly.

Was this patch authored or co-authored using generative AI tooling?

No.

harshmotw-db · 2025-11-11T21:45:30Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

      convertInternal(groupColumn, sparkReadType.map(_.asInstanceOf[StructType]))) {
+      // Temporary workaround to read Shredded variant data
+      case v: VariantLogicalTypeAnnotation if v.getSpecVersion == 1 && sparkReadType.isEmpty =>
+        convertInternal(groupColumn, None)


@chenhao-db I have added this temporary workaround on the reader side to allow tests to be able to scan a table containing Variant data if the VARIANT_ALLOW_READING_SHREDDED config is set to true

harshmotw-db · 2025-11-11T21:46:07Z

@cashmand @chenhao-db @cloud-fan Can you please look at this PR? Thanks!

cashmand

Thanks for making this change!

cashmand · 2025-11-11T22:28:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .booleanConf
+      .createWithDefault(false)
+
+  val PARQUET_WRITE_VARIANT_SPEC_VERSION =


It seems a bit strange for this to be a conf for now. I don't think we should allow writing a version that Spark doesn't know how to write, and currently the only valid spec version is 1, so if we have a conf, it should only accept "1" as a valid setting. Is there some use for this conf that I'm missing?

Actually, yeah I agree. It shouldn't be based on a config. I'll fix it

cloud-fan · 2025-11-12T15:43:09Z

thanks, merging to master/4.1!

…tation ### What changes were proposed in this pull request? This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation. ### Why are the changes needed? The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation. This change is hidden behind a flag that is disabled by default until read support can be properly implemented. ### Does this PR introduce _any_ user-facing change? Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read ### How was this patch tested? Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53005 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5270c99) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tation ### What changes were proposed in this pull request? [This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tation ### What changes were proposed in this pull request? [This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit da7389b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tation ### What changes were proposed in this pull request? This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation. ### Why are the changes needed? The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation. This change is hidden behind a flag that is disabled by default until read support can be properly implemented. ### Does this PR introduce _any_ user-facing change? Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read ### How was this patch tested? Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53005 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tation ### What changes were proposed in this pull request? [This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tation ### What changes were proposed in this pull request? This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation. ### Why are the changes needed? The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation. This change is hidden behind a flag that is disabled by default until read support can be properly implemented. ### Does this PR introduce _any_ user-facing change? Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read ### How was this patch tested? Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53005 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tation ### What changes were proposed in this pull request? [This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

harshmotw-db added 2 commits November 11, 2025 02:26

basic write support implemented

c6fab62

read tests

88a7439

github-actions bot added the SQL label Nov 11, 2025

minor change

1f3e3e8

harshmotw-db commented Nov 11, 2025

View reviewed changes

cashmand suggested changes Nov 11, 2025

View reviewed changes

fix config

e308d20

cloud-fan approved these changes Nov 12, 2025

View reviewed changes

cloud-fan closed this in 5270c99 Nov 12, 2025

harshmotw-db mentioned this pull request Nov 18, 2025

[SPARK-54410][SQL] Fix read support for the variant logical type annotation #53120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005

[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005

Uh oh!

harshmotw-db commented Nov 11, 2025

Uh oh!

harshmotw-db Nov 11, 2025

Uh oh!

harshmotw-db commented Nov 11, 2025

Uh oh!

cashmand left a comment

Uh oh!

cashmand Nov 11, 2025

Uh oh!

harshmotw-db Nov 11, 2025

Uh oh!

cloud-fan commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005

[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005

Uh oh!

Conversation

harshmotw-db commented Nov 11, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

harshmotw-db Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

harshmotw-db commented Nov 11, 2025

Uh oh!

cashmand left a comment

Choose a reason for hiding this comment

Uh oh!

cashmand Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants