-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54306] Annotate Variant columns with Variant logical type annotation #53005
Conversation
| convertInternal(groupColumn, sparkReadType.map(_.asInstanceOf[StructType]))) { | ||
| // Temporary workaround to read Shredded variant data | ||
| case v: VariantLogicalTypeAnnotation if v.getSpecVersion == 1 && sparkReadType.isEmpty => | ||
| convertInternal(groupColumn, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenhao-db I have added this temporary workaround on the reader side to allow tests to be able to scan a table containing Variant data if the VARIANT_ALLOW_READING_SHREDDED config is set to true
|
@cashmand @chenhao-db @cloud-fan Can you please look at this PR? Thanks! |
cashmand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change!
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val PARQUET_WRITE_VARIANT_SPEC_VERSION = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit strange for this to be a conf for now. I don't think we should allow writing a version that Spark doesn't know how to write, and currently the only valid spec version is 1, so if we have a conf, it should only accept "1" as a valid setting. Is there some use for this conf that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, yeah I agree. It shouldn't be based on a config. I'll fix it
|
thanks, merging to master/4.1! |
…tation ### What changes were proposed in this pull request? This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation. ### Why are the changes needed? The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation. This change is hidden behind a flag that is disabled by default until read support can be properly implemented. ### Does this PR introduce _any_ user-facing change? Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read ### How was this patch tested? Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53005 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5270c99) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…tation ### What changes were proposed in this pull request? [This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…tation ### What changes were proposed in this pull request? [This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit da7389b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…tation ### What changes were proposed in this pull request? This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation. ### Why are the changes needed? The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation. This change is hidden behind a flag that is disabled by default until read support can be properly implemented. ### Does this PR introduce _any_ user-facing change? Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read ### How was this patch tested? Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53005 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…tation ### What changes were proposed in this pull request? [This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…tation ### What changes were proposed in this pull request? This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation. ### Why are the changes needed? The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation. This change is hidden behind a flag that is disabled by default until read support can be properly implemented. ### Does this PR introduce _any_ user-facing change? Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read ### How was this patch tested? Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53005 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…tation ### What changes were proposed in this pull request? [This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR makes changes to the parquet writer to make it annotate variant columns with the parquet variant logical type annotation.
Why are the changes needed?
The Parquet spec has formally adopted the Variant logical type, and therefore, Variant columns must be properly annotated in Spark 4.1.0 which depends on Parquet-java 1.16.0 which contains the variant logical type annotation.
This change is hidden behind a flag that is disabled by default until read support can be properly implemented.
Does this PR introduce any user-facing change?
Yes, Parquet files written by Spark 4.1.0 with the flag enabled (which it eventually will be by default) could contain the variant logical type annotation which readers without support for the type will not be able to read
How was this patch tested?
Unit test to check if nested as well as top-level variants are properly annotated, and the data is being written correctly.
Was this patch authored or co-authored using generative AI tooling?
No.