[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation and revert to old behavior#54870
[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation and revert to old behavior#54870ZiyaZa wants to merge 2 commits intoapache:masterfrom
Conversation
|
LGTM if CI is green, please create a new JIRA ticket as the original commit is already released. |
CI is green, linked the new ticket in the title. |
…ion and revert to old behavior ### What changes were proposed in this pull request? This PR introduces a new flag `spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled` for Parquet reader to control the behavior when it reads an external file with `UNKNOWN` logical type annotation: - (Default) When false, we infer the Spark type based on the physical type used in the Parquet file, as we did before Spark 4.1. - When true, we use NullType as the Spark type. ### Why are the changes needed? To fix the regression introduced by #52922, as we have been reading files differently since then. ### Does this PR introduce _any_ user-facing change? Yes. With default flag value, when we read a Parquet file written by an external engine: - Before, we inferred NullType - Now, we'll infer a type based on the physical type (e.g. IntegerType) ### How was this patch tested? Added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54870 from ZiyaZa/unknown-type-flag. Authored-by: Ziya Mukhtarov <ziya5muxtarov@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 50514c5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thanks, merging to master/4.1! |
| "inference and infers NullType. When disabled, ignores the UNKNOWN annotation " + | ||
| "and uses the physical type instead.") | ||
| .version("4.1.2") | ||
| .withBindingPolicy(ConfigBindingPolicy.SESSION) |
There was a problem hiding this comment.
Hi, @ZiyaZa and @cloud-fan .
This broken branch-4.1. Let me revert this from branch-4.1 only for now.
[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1627:8: value withBindingPolicy is not a member of org.apache.spark.internal.config.ConfigBuilder
[error] possible cause: maybe a semicolon is missing before `value withBindingPolicy`?
[error] .withBindingPolicy(ConfigBindingPolicy.SESSION)
[error] ^
[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1627:26: not found: value ConfigBindingPolicy
[error] .withBindingPolicy(ConfigBindingPolicy.SESSION)
[error] ^
[error] two errors found
There was a problem hiding this comment.
Hi, it seems withBindingPolicy doesn't exist in 4.1. So deleting that line should solve it. I can create a PR deleting that line, or re-apply this PR as a whole after your revert. Either way works.
|
Yes, it's already reverted. Please make a new backporting PR to branch-4.1 now to make it sure that CI passes, @ZiyaZa . |
|
BTW, thank you for the fix, @ZiyaZa . |
|
Created a new PR here: #54885 Thanks for letting me know. |
What changes were proposed in this pull request?
This PR introduces a new flag
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabledfor Parquet reader to control the behavior when it reads an external file withUNKNOWNlogical type annotation:Why are the changes needed?
To fix the regression introduced by #52922, as we have been reading files differently since then.
Does this PR introduce any user-facing change?
Yes. With default flag value, when we read a Parquet file written by an external engine:
How was this patch tested?
Added tests.
Was this patch authored or co-authored using generative AI tooling?
No.