[SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS by kori73 · Pull Request #40810 · apache/spark

kori73 · 2023-04-16T19:14:23Z

What changes were proposed in this pull request?

This PR proposes to assign name to _LEGACY_ERROR_TEMP_2247 as "CANNOT_MERGE_SCHEMAS".

Also proposes to display both left and right schemas in the exception so that one can compare them. Please let me know if you prefer the old error message with a single schema.

This is the stack trace after the changes:

scala> spark.read.option("mergeSchema", "true").parquet(path)
org.apache.spark.SparkException: [CANNOT_MERGE_SCHEMAS] Failed merging schemas:
Initial schema:
"STRUCT<id: BIGINT>"
Schema that cannot be merged with the initial schema:
"STRUCT<id: INT>".
  at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingSchemaError(QueryExecutionErrors.scala:2355)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:104)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:100)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:100)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:496)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:132)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:78)
  at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:208)
  at scala.Option.orElse(Option.scala:447)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:205)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:407)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:563)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:548)
  ... 49 elided
Caused by: org.apache.spark.SparkException: [CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE] Failed to merge incompatible data types "BIGINT" and "INT".
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotMergeIncompatibleDataTypesError(QueryExecutionErrors.scala:1326)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$3(StructType.scala:610)
  at scala.Option.map(Option.scala:230)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:602)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$2$adapted(StructType.scala:599)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:599)
  at org.apache.spark.sql.types.StructType$.mergeInternal(StructType.scala:647)
  at org.apache.spark.sql.types.StructType$.merge(StructType.scala:593)
  at org.apache.spark.sql.types.StructType.merge(StructType.scala:498)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:102)
  ... 67 more

Why are the changes needed?

We should assign proper name to LEGACY_ERROR_TEMP*

Does this PR introduce any user-facing change?

Yes, the users will see an improved error message.

How was this patch tested?

Changed an existing test case to test the new error class with checkError utility.

Also improve the error message by adding both the left and right schemas

kori73 · 2023-04-17T19:42:03Z

@itholic @MaxGekk

MaxGekk · 2023-04-22T14:16:38Z

-      messageParameters = Map(
-        "schema" -> schema.treeString),
+      errorClass = "CANNOT_MERGE_SCHEMAS",
+      messageParameters = Map("left" -> leftSchema.treeString, "right" -> rightSchema.treeString),


Could you wrap schemas by toSQLType() instead of rightSchema, please. BTW, the error occurs in PySpark, SQL, R, we show it in some common form as a SQL type.

Thanks for the explanation. I have wrapped the schemas with toSQLType

MaxGekk · 2023-04-22T14:18:30Z

+          "left" -> df1.schema.treeString,
+          "right" -> df2.schema.treeString))


Just embed the SQL types, please.

wrapped with toSQLType here as well

MaxGekk · 2023-04-24T06:20:39Z

@kori73 Could you update the example (output) according to the recent commit, please.

kori73 · 2023-04-24T08:23:22Z

@kori73 Could you update the example (output) according to the recent commit, please.

updated the example according to the recent commit

MaxGekk · 2023-04-24T08:29:00Z

+1, LGTM. Merging to master.
Thank you, @kori73.

MaxGekk · 2023-04-24T08:31:13Z

@kori73 Congratulations with your first contribution to Apache Spark!

Koray Beyaz added 3 commits April 16, 2023 00:47

Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS

b32a4dc

Also improve the error message by adding both the left and right schemas

adapt test to error class and add sqlState

5682ce0

fix error class formatting

d969537

github-actions Bot added CORE SQL labels Apr 16, 2023

kori73 added 2 commits April 18, 2023 16:48

Merge branch 'apache:master' into assign-name-2247

75bb6d7

Merge branch 'apache:master' into assign-name-2247

d9d9b71

MaxGekk requested changes Apr 22, 2023

View reviewed changes

Koray Beyaz and others added 2 commits April 22, 2023 18:27

wrap schemas with toSQLType()

e0f830b

Merge branch 'apache:master' into assign-name-2247

82ab007

MaxGekk approved these changes Apr 24, 2023

View reviewed changes

MaxGekk closed this in 69946bb Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS#40810

[SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS#40810
kori73 wants to merge 7 commits into
apache:masterfrom
kori73:assign-name-2247

kori73 commented Apr 16, 2023 •

edited

Loading

Uh oh!

kori73 commented Apr 17, 2023

Uh oh!

MaxGekk Apr 22, 2023

Uh oh!

kori73 Apr 23, 2023

Uh oh!

MaxGekk Apr 22, 2023

Uh oh!

kori73 Apr 23, 2023

Uh oh!

MaxGekk commented Apr 24, 2023

Uh oh!

kori73 commented Apr 24, 2023

Uh oh!

MaxGekk commented Apr 24, 2023

Uh oh!

MaxGekk commented Apr 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"left" -> df1.schema.treeString,
		"right" -> df2.schema.treeString))

Uh oh!

Conversation

kori73 commented Apr 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

kori73 commented Apr 17, 2023

Uh oh!

MaxGekk Apr 22, 2023

Choose a reason for hiding this comment

Uh oh!

kori73 Apr 23, 2023

Choose a reason for hiding this comment

Uh oh!

MaxGekk Apr 22, 2023

Choose a reason for hiding this comment

Uh oh!

kori73 Apr 23, 2023

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Apr 24, 2023

Uh oh!

kori73 commented Apr 24, 2023

Uh oh!

MaxGekk commented Apr 24, 2023

Uh oh!

MaxGekk commented Apr 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kori73 commented Apr 16, 2023 •

edited

Loading