Description
On Spark 3.x, Comet's native-error → JVM-exception shim
(spark/src/main/spark-3.{4,5}/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala)
translates a native ParquetSchemaConvert error into a SparkException whose
cause is SchemaColumnConvertNotSupportedException:
val cause = new SchemaColumnConvertNotSupportedException(column, physicalType, logicalType)
QueryExecutionErrors.unsupportedSchemaColumnConvertError(filePath, column, logicalType,
physicalType, cause)
// returns: new SparkException(errorClass = "_LEGACY_ERROR_TEMP_2063", ..., cause = e)
Spark 3.x's executor / task error handling then re-wraps this SparkException
once more on the way back to the driver, producing a two-level chain:
SparkException (driver-side wrapping)
cause -> SparkException (shim-generated, errorClass "_LEGACY_ERROR_TEMP_2063")
cause -> SchemaColumnConvertNotSupportedException
Spark's own vectorized reader produces a one-level chain because
ParquetVectorUpdaterFactory.getUpdater throws
SchemaColumnConvertNotSupportedException directly; the file-scan code catches
it once and wraps in a SparkException. Spark 4.0+ also produces a one-level
chain for Comet because the 4.x shim's parquetColumnDataTypeMismatchError path
appears not to be re-wrapped by the executor.
Why it matters
Spark's own SPARK-34212 Parquet should read decimals correctly (and similar
tests) assert the cause directly:
val e = intercept[SparkException] { readParquet(schema, path).collect() }.getCause
assert(e.isInstanceOf[SchemaColumnConvertNotSupportedException])
On Comet 3.x, e.getCause is the inner SparkException, not the
SchemaColumnConvertNotSupportedException, so the assertion fails. Tests that
walk the cause chain (e.g. our regression test in ParquetReadSuite) pass.
Affected tests (currently kept ignored)
dev/diffs/3.4.3.diff — SPARK-34212 Parquet should read decimals correctly
(ParquetQuerySuite).
dev/diffs/3.5.8.diff — same.
These would be unignored in 4.0.2.diff / 4.1.1.diff (where the chain is
one-level and the schema-adapter rejection from #4351 is in place).
Suggested fix
Change the 3.x shim to throw SchemaColumnConvertNotSupportedException
directly rather than wrapping it in unsupportedSchemaColumnConvertError's
SparkException. Spark's task error handling will wrap it once on the way
back to the driver, producing the same one-level chain Spark's own vectorized
reader produces. The error message format (Parquet column cannot be converted in file …) needs to be preserved since some Spark SQL tests assert on it.
Related
Description
On Spark 3.x, Comet's native-error → JVM-exception shim
(
spark/src/main/spark-3.{4,5}/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala)translates a native
ParquetSchemaConverterror into aSparkExceptionwhosecause is
SchemaColumnConvertNotSupportedException:Spark 3.x's executor / task error handling then re-wraps this
SparkExceptiononce more on the way back to the driver, producing a two-level chain:
Spark's own vectorized reader produces a one-level chain because
ParquetVectorUpdaterFactory.getUpdaterthrowsSchemaColumnConvertNotSupportedExceptiondirectly; the file-scan code catchesit once and wraps in a
SparkException. Spark 4.0+ also produces a one-levelchain for Comet because the 4.x shim's
parquetColumnDataTypeMismatchErrorpathappears not to be re-wrapped by the executor.
Why it matters
Spark's own
SPARK-34212 Parquet should read decimals correctly(and similartests) assert the cause directly:
On Comet 3.x,
e.getCauseis the innerSparkException, not theSchemaColumnConvertNotSupportedException, so the assertion fails. Tests thatwalk the cause chain (e.g. our regression test in
ParquetReadSuite) pass.Affected tests (currently kept ignored)
dev/diffs/3.4.3.diff—SPARK-34212 Parquet should read decimals correctly(
ParquetQuerySuite).dev/diffs/3.5.8.diff— same.These would be unignored in
4.0.2.diff/4.1.1.diff(where the chain isone-level and the schema-adapter rejection from #4351 is in place).
Suggested fix
Change the 3.x shim to throw
SchemaColumnConvertNotSupportedExceptiondirectly rather than wrapping it in
unsupportedSchemaColumnConvertError'sSparkException. Spark's task error handling will wrap it once on the wayback to the driver, producing the same one-level chain Spark's own vectorized
reader produces. The error message format (
Parquet column cannot be converted in file …) needs to be preserved since some Spark SQL tests assert on it.Related