Skip to content

[SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS#40810

Closed
kori73 wants to merge 7 commits into
apache:masterfrom
kori73:assign-name-2247
Closed

[SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS#40810
kori73 wants to merge 7 commits into
apache:masterfrom
kori73:assign-name-2247

Conversation

@kori73

@kori73 kori73 commented Apr 16, 2023

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR proposes to assign name to _LEGACY_ERROR_TEMP_2247 as "CANNOT_MERGE_SCHEMAS".

Also proposes to display both left and right schemas in the exception so that one can compare them. Please let me know if you prefer the old error message with a single schema.

This is the stack trace after the changes:

scala> spark.read.option("mergeSchema", "true").parquet(path)
org.apache.spark.SparkException: [CANNOT_MERGE_SCHEMAS] Failed merging schemas:
Initial schema:
"STRUCT<id: BIGINT>"
Schema that cannot be merged with the initial schema:
"STRUCT<id: INT>".
  at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingSchemaError(QueryExecutionErrors.scala:2355)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:104)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:100)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:100)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:496)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:132)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:78)
  at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:208)
  at scala.Option.orElse(Option.scala:447)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:205)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:407)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:563)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:548)
  ... 49 elided
Caused by: org.apache.spark.SparkException: [CANNOT_MERGE_INCOMPATIBLE_DATA_TYPE] Failed to merge incompatible data types "BIGINT" and "INT".
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotMergeIncompatibleDataTypesError(QueryExecutionErrors.scala:1326)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$3(StructType.scala:610)
  at scala.Option.map(Option.scala:230)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:602)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$2$adapted(StructType.scala:599)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:599)
  at org.apache.spark.sql.types.StructType$.mergeInternal(StructType.scala:647)
  at org.apache.spark.sql.types.StructType$.merge(StructType.scala:593)
  at org.apache.spark.sql.types.StructType.merge(StructType.scala:498)
  at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:102)
  ... 67 more

Why are the changes needed?

We should assign proper name to LEGACY_ERROR_TEMP*

Does this PR introduce any user-facing change?

Yes, the users will see an improved error message.

How was this patch tested?

Changed an existing test case to test the new error class with checkError utility.

Koray Beyaz added 3 commits April 16, 2023 00:47
@kori73

kori73 commented Apr 17, 2023

Copy link
Copy Markdown
Contributor Author

@itholic @MaxGekk

messageParameters = Map(
"schema" -> schema.treeString),
errorClass = "CANNOT_MERGE_SCHEMAS",
messageParameters = Map("left" -> leftSchema.treeString, "right" -> rightSchema.treeString),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you wrap schemas by toSQLType() instead of rightSchema, please. BTW, the error occurs in PySpark, SQL, R, we show it in some common form as a SQL type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I have wrapped the schemas with toSQLType

Comment on lines +999 to +1000
"left" -> df1.schema.treeString,
"right" -> df2.schema.treeString))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just embed the SQL types, please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapped with toSQLType here as well

@MaxGekk

MaxGekk commented Apr 24, 2023

Copy link
Copy Markdown
Member

@kori73 Could you update the example (output) according to the recent commit, please.

@kori73

kori73 commented Apr 24, 2023

Copy link
Copy Markdown
Contributor Author

@kori73 Could you update the example (output) according to the recent commit, please.

updated the example according to the recent commit

@MaxGekk

MaxGekk commented Apr 24, 2023

Copy link
Copy Markdown
Member

+1, LGTM. Merging to master.
Thank you, @kori73.

@MaxGekk MaxGekk closed this in 69946bb Apr 24, 2023
@MaxGekk

MaxGekk commented Apr 24, 2023

Copy link
Copy Markdown
Member

@kori73 Congratulations with your first contribution to Apache Spark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants