fix: reject string/binary read as numeric in native_datafusion scan by andygrove · Pull Request #4091 · apache/datafusion-comet

andygrove · 2026-04-25T19:44:10Z

Which issue does this PR close?

Closes #4088.

Rationale for this change

When the native_datafusion scan reads a Parquet BINARY (UTF8) column under a numeric read schema, the existing schema adapter creates a Spark Cast with is_adapting_schema=true. In that mode Cast delegates to DataFusion's cast, which parses the bytes (returning null on non-numeric strings, or in some paths reinterpreting the raw bytes). Spark's vectorized reader rejects this kind of mismatch with SchemaColumnConvertNotSupportedException on every supported version, and native_iceberg_compat already does the same via TypeUtil.checkParquetType. The native scan should match.

What changes are included in this PR?

native/core/src/parquet/schema_adapter.rs: in replace_with_spark_cast, add a guard before the existing branches that returns DataFusionError::Plan when the source type is Utf8, LargeUtf8, Binary, or LargeBinary and the target type is any integer (Int8/Int16/Int32/Int64/UInt*) or floating-point type (Float32/Float64). The rule mirrors TypeUtil.checkParquetType's BINARY case (lines 208-221), which only allows reading BINARY as StringType, BinaryType, or a binary-encoded decimal.

The check is intentionally narrow: it only fires for string/binary -> numeric mismatches and leaves every other type path unchanged.

How are these changes tested?

Added a focused test to ParquetReadSuite: native_datafusion rejects string read as numeric. It writes string data, reads it under c int, forces spark.comet.scan.impl=native_datafusion and spark.sql.sources.useV1SourceList=parquet, and asserts that collect() raises SparkException. Verified against ParquetReadV1Suite (44 succeeded, no regressions; 1 pre-existing test ignored).

The behavior is also covered by the per-impl matrix added in #4087 (string read as int: native_datafusion), whose assertion will need flipping from "succeeds with garbage" to "throws" once that PR merges.

The native_datafusion Spark physical expression adapter fell through to DataFusion's cast for Utf8/Binary -> numeric type changes (because SparkCastOptions.is_adapting_schema delegates to DataFusion's cast), which silently parses the bytes (returning nulls or, on some paths, reinterpreting raw bytes) where Spark's vectorized reader and the native_iceberg_compat scan throw SchemaColumnConvertNotSupportedException. Add a guard in replace_with_spark_cast that rejects when the source type is Utf8/LargeUtf8/Binary/LargeBinary and the target type is any integer or floating-point type, mirroring TypeUtil.checkParquetType on the JVM side. Closes apache#4088.

andygrove added correctness bug Something isn't working labels Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reject string/binary read as numeric in native_datafusion scan#4091

fix: reject string/binary read as numeric in native_datafusion scan#4091
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix-issue-4088-string-as-int

andygrove commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 25, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant