-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044
[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044
Conversation
ok to test |
@@ -96,6 +97,58 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS | |||
} | |||
} | |||
|
|||
test("parquet files with legacy mode and schema evolution") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test("parquet files with legacy mode and schema evolution") { | |
test("SPARK-36803: parquet files with legacy mode and schema evolution") { |
Kubernetes integration test starting |
Kubernetes integration test status failure |
@@ -602,8 +602,10 @@ private[parquet] class ParquetRowConverter( | |||
// matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise, | |||
// it's case 2. | |||
val guessedElementType = schemaConverter.convertField(repeatedType) | |||
// We also need to check if the list element follows the backward compatible pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also update the long code comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can. I will update, thanks.
...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
Show resolved
Hide resolved
.../scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
Show resolved
Hide resolved
cc @sunchao since this is Parquet. |
Test build #143441 has finished for PR 34044 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
@cloud-fan Would you be able to help me figure out why CI fails in my PR. I checked the logs but no failed tests were reported and I got the following log:
The newly added test passes: |
@sadikovi mind retriggering https://github.com/sadikovi/spark/runs/3649026062 ? |
Re-triggered the build |
Test build #143451 has finished for PR 34044 at commit
|
val guessedElementType = schemaConverter.convertField(repeatedType) | ||
val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting - does it mean in the parquet-mr read path Spark were not able to handle legacy list format? also do we need to do something similar to legacy map format?
BTW: you can remove ()
in parquetSchema.getName()
since this is an accessor method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, the existing schemaConverter.convertField(repeatedType)
already covered the legacy format lists but this particular issue is about schema evolution with added new struct fields. I wonder whether it's better to just expand equalsIgnoreCompatibleNullability
and allow element
to contain guessedElementType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct, legacy format would still be read by Spark, it was schema evolution of a list element that could trigger this issue. If all of the files have the same schema, everything should work just fine.
I considered having something like "contains" instead of "equals" but I had a concern that this might introduce issues when the schema "contains" but it should still be treated as a 3-level LIST. Also, I could not find "contains" method for DataType in the codebase. IMHO, it is better to check parquet compatibility issues using parquet schema rather Catalyst schema which was meant to reconcile those types anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, making method non-private is not ideal but neither is adding a new DataType."contains" method or duplicating code for another function. Let me know what you think could be a better (or the least intrusive) approach. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yea it's just a nit from me and this looks OK to me too.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #143494 has finished for PR 34044 at commit
|
…s written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec26d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec26d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec26d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master/3.2/3.1/3.0! |
Thank you @cloud-fan @HyukjinKwon @sunchao @dongjoon-hyun for the reviews! |
…s written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes apache#34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec26d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.
The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.
Why are the changes needed?
It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema:
DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)
in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added a new test case in ParquetInteroperabilitySuite.scala.