[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044

sadikovi · 2021-09-20T00:44:52Z

What changes were proposed in this pull request?

This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.

The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema.

Why are the changes needed?

It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType) in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new test case in ParquetInteroperabilitySuite.scala.

HyukjinKwon · 2021-09-20T01:40:29Z

ok to test

HyukjinKwon · 2021-09-20T01:45:14Z

.../scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala

@@ -96,6 +97,58 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS
    }
  }

+  test("parquet files with legacy mode and schema evolution") {


Suggested change

test("parquet files with legacy mode and schema evolution") {

test("SPARK-36803: parquet files with legacy mode and schema evolution") {

SparkQA · 2021-09-20T02:41:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47949/

SparkQA · 2021-09-20T02:50:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47949/

cloud-fan · 2021-09-20T03:21:04Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

@@ -602,8 +602,10 @@ private[parquet] class ParquetRowConverter(
      // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
      // it's case 2.
      val guessedElementType = schemaConverter.convertField(repeatedType)
+      // We also need to check if the list element follows the backward compatible pattern.


shall we also update the long code comment above?

Yes, we can. I will update, thanks.

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

.../scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala

dongjoon-hyun · 2021-09-20T03:54:04Z

cc @sunchao since this is Parquet.

SparkQA · 2021-09-20T06:32:32Z

Test build #143441 has finished for PR 34044 at commit d3d47f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-20T08:02:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47960/

SparkQA · 2021-09-20T08:11:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47960/

sadikovi · 2021-09-20T10:46:20Z

@cloud-fan Would you be able to help me figure out why CI fails in my PR. I checked the logs but no failed tests were reported and I got the following log:

2021-09-20T08:48:08.2253254Z [info] Tests: succeeded 7729, failed 0, canceled 0, ignored 26, pending 0
2021-09-20T08:48:08.2254305Z [info] All tests passed.
2021-09-20T08:48:08.2379778Z [error] Error: Total 0, Failed 0, Errors 0, Passed 0
2021-09-20T08:48:08.2503788Z [error] Error during tests:
2021-09-20T08:48:08.7685175Z [error] 	Running java with options 
...
sbt.ForkMain 36107 failed with exit code 134

The newly added test passes: 2021-09-20T08:42:40.7932026Z [info] - SPARK-36803: parquet files with legacy mode and schema evolution (846 milliseconds)

HyukjinKwon · 2021-09-20T11:04:21Z

@sadikovi mind retriggering https://github.com/sadikovi/spark/runs/3649026062 ?

sadikovi · 2021-09-20T11:07:17Z

Re-triggered the build

SparkQA · 2021-09-20T12:02:20Z

Test build #143451 has finished for PR 34044 at commit 33fa362.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2021-09-21T16:25:35Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

      val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())


interesting - does it mean in the parquet-mr read path Spark were not able to handle legacy list format? also do we need to do something similar to legacy map format?

BTW: you can remove () in parquetSchema.getName() since this is an accessor method.

I see, the existing schemaConverter.convertField(repeatedType) already covered the legacy format lists but this particular issue is about schema evolution with added new struct fields. I wonder whether it's better to just expand equalsIgnoreCompatibleNullability and allow element to contain guessedElementType.

Yes, that is correct, legacy format would still be read by Spark, it was schema evolution of a list element that could trigger this issue. If all of the files have the same schema, everything should work just fine.

I considered having something like "contains" instead of "equals" but I had a concern that this might introduce issues when the schema "contains" but it should still be treated as a 3-level LIST. Also, I could not find "contains" method for DataType in the codebase. IMHO, it is better to check parquet compatibility issues using parquet schema rather Catalyst schema which was meant to reconcile those types anyway.

I know, making method non-private is not ideal but neither is adding a new DataType."contains" method or duplicating code for another function. Let me know what you think could be a better (or the least intrusive) approach. Thanks.

Got it. Yea it's just a nit from me and this looks OK to me too.

SparkQA · 2021-09-22T01:45:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48005/

SparkQA · 2021-09-22T02:45:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48005/

SparkQA · 2021-09-22T05:38:08Z

Test build #143494 has finished for PR 34044 at commit 8a21e05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…s written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes #34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec26d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2021-09-22T09:42:00Z

thanks, merging to master/3.2/3.1/3.0!

sadikovi · 2021-09-22T10:13:59Z

Thank you @cloud-fan @HyukjinKwon @sunchao @dongjoon-hyun for the reviews!

…s written in legacy mode ### What changes were proposed in this pull request? This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion. The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying element type does not match the full inferred Catalyst schema. ### Why are the changes needed? It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test case in ParquetInteroperabilitySuite.scala. Closes apache#34044 from sadikovi/parquet-legacy-write-mode-list-issue. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec26d94) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

update parquet row converter for arrays

d3d47f7

github-actions bot added the SQL label Sep 20, 2021

gatorsmile requested a review from cloud-fan September 20, 2021 01:31

HyukjinKwon reviewed Sep 20, 2021

View reviewed changes

cloud-fan reviewed Sep 20, 2021

View reviewed changes

dongjoon-hyun reviewed Sep 20, 2021

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala Show resolved Hide resolved

dongjoon-hyun reviewed Sep 20, 2021

View reviewed changes

.../scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala Show resolved Hide resolved

address comments

33fa362

HyukjinKwon changed the title ~~[SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode~~ [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode Sep 20, 2021

dongjoon-hyun approved these changes Sep 21, 2021

View reviewed changes

sunchao reviewed Sep 21, 2021

View reviewed changes

fix parquetSchema.getName

8a21e05

sunchao approved these changes Sep 22, 2021

View reviewed changes

cloud-fan closed this in ec26d94 Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044

[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044

sadikovi commented Sep 20, 2021 •

edited

Loading

HyukjinKwon commented Sep 20, 2021

HyukjinKwon Sep 20, 2021

SparkQA commented Sep 20, 2021

SparkQA commented Sep 20, 2021

cloud-fan Sep 20, 2021

sadikovi Sep 20, 2021

dongjoon-hyun commented Sep 20, 2021

SparkQA commented Sep 20, 2021

SparkQA commented Sep 20, 2021

SparkQA commented Sep 20, 2021

sadikovi commented Sep 20, 2021

HyukjinKwon commented Sep 20, 2021

sadikovi commented Sep 20, 2021

SparkQA commented Sep 20, 2021

sunchao Sep 21, 2021

sunchao Sep 21, 2021

sadikovi Sep 22, 2021

sadikovi Sep 22, 2021

sunchao Sep 22, 2021

SparkQA commented Sep 22, 2021

SparkQA commented Sep 22, 2021

SparkQA commented Sep 22, 2021

cloud-fan commented Sep 22, 2021

sadikovi commented Sep 22, 2021

	test("parquet files with legacy mode and schema evolution") {
	test("SPARK-36803: parquet files with legacy mode and schema evolution") {

		val guessedElementType = schemaConverter.convertField(repeatedType)
		val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())

[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044

[SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode #34044

Conversation

sadikovi commented Sep 20, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Sep 20, 2021

Choose a reason for hiding this comment

SparkQA commented Sep 20, 2021

SparkQA commented Sep 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 20, 2021

SparkQA commented Sep 20, 2021

SparkQA commented Sep 20, 2021

SparkQA commented Sep 20, 2021

sadikovi commented Sep 20, 2021

HyukjinKwon commented Sep 20, 2021

sadikovi commented Sep 20, 2021

SparkQA commented Sep 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 22, 2021

SparkQA commented Sep 22, 2021

SparkQA commented Sep 22, 2021

cloud-fan commented Sep 22, 2021

sadikovi commented Sep 22, 2021

sadikovi commented Sep 20, 2021 •

edited

Loading