Skip to content

native_datafusion: silent wrong-answer paths for integer-to-decimal Parquet conversions Spark rejects #4344

@andygrove

Description

@andygrove

Description

native_datafusion silently accepts integer-to-decimal Parquet reads where the requested decimal type cannot represent the integer values in the file. Spark's vectorized reader rejects these conversions with SchemaColumnConvertNotSupportedException (per ParquetVectorUpdaterFactory.getUpdater) because reading e.g. an INT64 column into a DECIMAL(p,s) whose precision is below the integer's required precision is unsafe. native_datafusion instead returns wrong (truncated/overflowed) values.

This is the integer-to-decimal counterpart to #4297 (primitive-to-primitive numeric/date conversions) and #4343 (decimal-to-decimal narrowing).

Affected tests (Spark 4.1.1, dev/diffs/4.1.1.diff)

Currently tagged IgnoreCometNativeDataFusion pointing at the umbrella #3720:

  • ParquetTypeWideningSuiteunsupported parquet conversion $fromType -> $toType
    (the second occurrence in the suite, the integer→decimal block at line ~264). Iterates pairs such as:
    • ByteType -> DECIMAL(1, 0)
    • ShortType -> DECIMAL(ByteDecimal.precision, 0) / DECIMAL(ByteDecimal.precision + 1, 1) etc.
    • IntegerType -> ShortDecimal / DECIMAL(IntDecimal.precision - 1, 0) etc.
    • LongType -> IntDecimal / DECIMAL(LongDecimal.precision - 1, 0) etc.
      Expects SchemaColumnConvertNotSupportedException when the vectorized reader is enabled and the target decimal precision is too small to hold the integer.

The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under #3720 there as well.

Reproduction

import org.apache.comet.CometConf
import org.apache.spark.sql.internal.SQLConf

withSQLConf(
  CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
  SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
  withTempPath { dir =>
    val path = dir.getCanonicalPath
    Seq(123456L).toDF("c")
      .selectExpr("cast(c as bigint) as c")
      .write.parquet(path)
    // LongType is INT64 in Parquet; a target DECIMAL(p, 0) with p < 19 cannot
    // represent every Long, so Spark rejects it. native_datafusion accepts it.
    spark.read.schema("c decimal(5, 0)").parquet(path).show()
  }
}

Suggested approach

Same direction as #4297 / #4343: extend the integer→decimal branch of the schema adapter / replace_with_spark_cast to mirror Spark's allowlist — only accept conversions where the target decimal precision is large enough to hold the integer's range (and scale is 0, or handled per Spark's rules). Reject everything else with SparkError::ParquetSchemaConvert.

Parent issue

Split from umbrella #3720.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions