Description
native_datafusion silently accepts integer-to-decimal Parquet reads where the requested decimal type cannot represent the integer values in the file. Spark's vectorized reader rejects these conversions with SchemaColumnConvertNotSupportedException (per ParquetVectorUpdaterFactory.getUpdater) because reading e.g. an INT64 column into a DECIMAL(p,s) whose precision is below the integer's required precision is unsafe. native_datafusion instead returns wrong (truncated/overflowed) values.
This is the integer-to-decimal counterpart to #4297 (primitive-to-primitive numeric/date conversions) and #4343 (decimal-to-decimal narrowing).
Affected tests (Spark 4.1.1, dev/diffs/4.1.1.diff)
Currently tagged IgnoreCometNativeDataFusion pointing at the umbrella #3720:
ParquetTypeWideningSuite — unsupported parquet conversion $fromType -> $toType
(the second occurrence in the suite, the integer→decimal block at line ~264). Iterates pairs such as:
ByteType -> DECIMAL(1, 0)
ShortType -> DECIMAL(ByteDecimal.precision, 0) / DECIMAL(ByteDecimal.precision + 1, 1) etc.
IntegerType -> ShortDecimal / DECIMAL(IntDecimal.precision - 1, 0) etc.
LongType -> IntDecimal / DECIMAL(LongDecimal.precision - 1, 0) etc.
Expects SchemaColumnConvertNotSupportedException when the vectorized reader is enabled and the target decimal precision is too small to hold the integer.
The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under #3720 there as well.
Reproduction
import org.apache.comet.CometConf
import org.apache.spark.sql.internal.SQLConf
withSQLConf(
CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
Seq(123456L).toDF("c")
.selectExpr("cast(c as bigint) as c")
.write.parquet(path)
// LongType is INT64 in Parquet; a target DECIMAL(p, 0) with p < 19 cannot
// represent every Long, so Spark rejects it. native_datafusion accepts it.
spark.read.schema("c decimal(5, 0)").parquet(path).show()
}
}
Suggested approach
Same direction as #4297 / #4343: extend the integer→decimal branch of the schema adapter / replace_with_spark_cast to mirror Spark's allowlist — only accept conversions where the target decimal precision is large enough to hold the integer's range (and scale is 0, or handled per Spark's rules). Reject everything else with SparkError::ParquetSchemaConvert.
Parent issue
Split from umbrella #3720.
Description
native_datafusionsilently accepts integer-to-decimal Parquet reads where the requested decimal type cannot represent the integer values in the file. Spark's vectorized reader rejects these conversions withSchemaColumnConvertNotSupportedException(perParquetVectorUpdaterFactory.getUpdater) because reading e.g. an INT64 column into aDECIMAL(p,s)whose precision is below the integer's required precision is unsafe.native_datafusioninstead returns wrong (truncated/overflowed) values.This is the integer-to-decimal counterpart to #4297 (primitive-to-primitive numeric/date conversions) and #4343 (decimal-to-decimal narrowing).
Affected tests (Spark 4.1.1,
dev/diffs/4.1.1.diff)Currently tagged
IgnoreCometNativeDataFusionpointing at the umbrella #3720:ParquetTypeWideningSuite—unsupported parquet conversion $fromType -> $toType(the second occurrence in the suite, the integer→decimal block at line ~264). Iterates pairs such as:
ByteType -> DECIMAL(1, 0)ShortType -> DECIMAL(ByteDecimal.precision, 0)/DECIMAL(ByteDecimal.precision + 1, 1)etc.IntegerType -> ShortDecimal/DECIMAL(IntDecimal.precision - 1, 0)etc.LongType -> IntDecimal/DECIMAL(LongDecimal.precision - 1, 0)etc.Expects
SchemaColumnConvertNotSupportedExceptionwhen the vectorized reader is enabled and the target decimal precision is too small to hold the integer.The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under #3720 there as well.
Reproduction
Suggested approach
Same direction as #4297 / #4343: extend the integer→decimal branch of the schema adapter /
replace_with_spark_castto mirror Spark's allowlist — only accept conversions where the target decimal precision is large enough to hold the integer's range (and scale is 0, or handled per Spark's rules). Reject everything else withSparkError::ParquetSchemaConvert.Parent issue
Split from umbrella #3720.