Skip to content

native_datafusion (Spark 3.x): shim's ParquetSchemaConvert translation produces an extra SparkException cause-chain layer #4354

@andygrove

Description

@andygrove

Description

On Spark 3.x, Comet's native-error → JVM-exception shim
(spark/src/main/spark-3.{4,5}/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala)
translates a native ParquetSchemaConvert error into a SparkException whose
cause is SchemaColumnConvertNotSupportedException:

val cause = new SchemaColumnConvertNotSupportedException(column, physicalType, logicalType)
QueryExecutionErrors.unsupportedSchemaColumnConvertError(filePath, column, logicalType,
  physicalType, cause)
// returns: new SparkException(errorClass = "_LEGACY_ERROR_TEMP_2063", ..., cause = e)

Spark 3.x's executor / task error handling then re-wraps this SparkException
once more on the way back to the driver, producing a two-level chain:

SparkException (driver-side wrapping)
  cause -> SparkException (shim-generated, errorClass "_LEGACY_ERROR_TEMP_2063")
    cause -> SchemaColumnConvertNotSupportedException

Spark's own vectorized reader produces a one-level chain because
ParquetVectorUpdaterFactory.getUpdater throws
SchemaColumnConvertNotSupportedException directly; the file-scan code catches
it once and wraps in a SparkException. Spark 4.0+ also produces a one-level
chain for Comet because the 4.x shim's parquetColumnDataTypeMismatchError path
appears not to be re-wrapped by the executor.

Why it matters

Spark's own SPARK-34212 Parquet should read decimals correctly (and similar
tests) assert the cause directly:

val e = intercept[SparkException] { readParquet(schema, path).collect() }.getCause
assert(e.isInstanceOf[SchemaColumnConvertNotSupportedException])

On Comet 3.x, e.getCause is the inner SparkException, not the
SchemaColumnConvertNotSupportedException, so the assertion fails. Tests that
walk the cause chain (e.g. our regression test in ParquetReadSuite) pass.

Affected tests (currently kept ignored)

  • dev/diffs/3.4.3.diffSPARK-34212 Parquet should read decimals correctly
    (ParquetQuerySuite).
  • dev/diffs/3.5.8.diff — same.

These would be unignored in 4.0.2.diff / 4.1.1.diff (where the chain is
one-level and the schema-adapter rejection from #4351 is in place).

Suggested fix

Change the 3.x shim to throw SchemaColumnConvertNotSupportedException
directly rather than wrapping it in unsupportedSchemaColumnConvertError's
SparkException. Spark's task error handling will wrap it once on the way
back to the driver, producing the same one-level chain Spark's own vectorized
reader produces. The error message format (Parquet column cannot be converted in file …) needs to be preserved since some Spark SQL tests assert on it.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions