Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044

Closed

Conversation

sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Sep 20, 2021

What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType) in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

@HyukjinKwon
Copy link
Member

ok to test

@@ -96,6 +97,58 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS
}
}

test("parquet files with legacy mode and schema evolution") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test("parquet files with legacy mode and schema evolution") {
test("SPARK-36803: parquet files with legacy mode and schema evolution") {

@SparkQA
Copy link

SparkQA commented Sep 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47949/

@SparkQA
Copy link

SparkQA commented Sep 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47949/

@@ -602,8 +602,10 @@ private[parquet] class ParquetRowConverter(
// matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
// it's case 2.
val guessedElementType = schemaConverter.convertField(repeatedType)
// We also need to check if the list element follows the backward compatible pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also update the long code comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. I will update, thanks.

@dongjoon-hyun
Copy link
Member

cc @sunchao since this is Parquet.

@SparkQA
Copy link

SparkQA commented Sep 20, 2021

Test build #143441 has finished for PR 34044 at commit d3d47f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47960/

@SparkQA
Copy link

SparkQA commented Sep 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47960/

@sadikovi
Copy link
Contributor Author

@cloud-fan Would you be able to help me figure out why CI fails in my PR. I checked the logs but no failed tests were reported and I got the following log:

2021-09-20T08:48:08.2253254Z [info] Tests: succeeded 7729, failed 0, canceled 0, ignored 26, pending 0
2021-09-20T08:48:08.2254305Z [info] All tests passed.
2021-09-20T08:48:08.2379778Z [error] Error: Total 0, Failed 0, Errors 0, Passed 0
2021-09-20T08:48:08.2503788Z [error] Error during tests:
2021-09-20T08:48:08.7685175Z [error] 	Running java with options 
...
sbt.ForkMain 36107 failed with exit code 134

The newly added test passes: 2021-09-20T08:42:40.7932026Z [info] - SPARK-36803: parquet files with legacy mode and schema evolution (846 milliseconds)

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon changed the title [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode Sep 20, 2021
@sadikovi
Copy link
Contributor Author

Re-triggered the build

@SparkQA
Copy link

SparkQA commented Sep 20, 2021

Test build #143451 has finished for PR 34044 at commit 33fa362.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val guessedElementType = schemaConverter.convertField(repeatedType)
val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting - does it mean in the parquet-mr read path Spark were not able to handle legacy list format? also do we need to do something similar to legacy map format?

BTW: you can remove () in parquetSchema.getName() since this is an accessor method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the existing schemaConverter.convertField(repeatedType) already covered the legacy format lists but this particular issue is about schema evolution with added new struct fields. I wonder whether it's better to just expand equalsIgnoreCompatibleNullability and allow element to contain guessedElementType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct, legacy format would still be read by Spark, it was schema evolution of a list element that could trigger this issue. If all of the files have the same schema, everything should work just fine.

I considered having something like "contains" instead of "equals" but I had a concern that this might introduce issues when the schema "contains" but it should still be treated as a 3-level LIST. Also, I could not find "contains" method for DataType in the codebase. IMHO, it is better to check parquet compatibility issues using parquet schema rather Catalyst schema which was meant to reconcile those types anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, making method non-private is not ideal but neither is adding a new DataType."contains" method or duplicating code for another function. Let me know what you think could be a better (or the least intrusive) approach. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yea it's just a nit from me and this looks OK to me too.

@SparkQA
Copy link

SparkQA commented Sep 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48005/

@SparkQA
Copy link

SparkQA commented Sep 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48005/

@SparkQA
Copy link

SparkQA commented Sep 22, 2021

Test build #143494 has finished for PR 34044 at commit 8a21e05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in ec26d94 Sep 22, 2021
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
…s written in legacy mode

### What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

### Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec26d94)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
…s written in legacy mode

### What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

### Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec26d94)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Sep 22, 2021
…s written in legacy mode

### What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

### Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec26d94)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2/3.1/3.0!

@sadikovi
Copy link
Contributor Author

Thank you @cloud-fan @HyukjinKwon @sunchao @dongjoon-hyun for the reviews!

fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…s written in legacy mode

### What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

### Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

Closes apache#34044 from sadikovi/parquet-legacy-write-mode-list-issue.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec26d94)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants