Skip to content

fix: reject string/binary read as numeric in native_datafusion scan#4091

Open
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix-issue-4088-string-as-int
Open

fix: reject string/binary read as numeric in native_datafusion scan#4091
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix-issue-4088-string-as-int

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4088.

Rationale for this change

When the native_datafusion scan reads a Parquet BINARY (UTF8) column under a numeric read schema, the existing schema adapter creates a Spark Cast with is_adapting_schema=true. In that mode Cast delegates to DataFusion's cast, which parses the bytes (returning null on non-numeric strings, or in some paths reinterpreting the raw bytes). Spark's vectorized reader rejects this kind of mismatch with SchemaColumnConvertNotSupportedException on every supported version, and native_iceberg_compat already does the same via TypeUtil.checkParquetType. The native scan should match.

What changes are included in this PR?

native/core/src/parquet/schema_adapter.rs: in replace_with_spark_cast, add a guard before the existing branches that returns DataFusionError::Plan when the source type is Utf8, LargeUtf8, Binary, or LargeBinary and the target type is any integer (Int8/Int16/Int32/Int64/UInt*) or floating-point type (Float32/Float64). The rule mirrors TypeUtil.checkParquetType's BINARY case (lines 208-221), which only allows reading BINARY as StringType, BinaryType, or a binary-encoded decimal.

The check is intentionally narrow: it only fires for string/binary -> numeric mismatches and leaves every other type path unchanged.

How are these changes tested?

Added a focused test to ParquetReadSuite: native_datafusion rejects string read as numeric. It writes string data, reads it under c int, forces spark.comet.scan.impl=native_datafusion and spark.sql.sources.useV1SourceList=parquet, and asserts that collect() raises SparkException. Verified against ParquetReadV1Suite (44 succeeded, no regressions; 1 pre-existing test ignored).

The behavior is also covered by the per-impl matrix added in #4087 (string read as int: native_datafusion), whose assertion will need flipping from "succeeds with garbage" to "throws" once that PR merges.

The native_datafusion Spark physical expression adapter fell through to
DataFusion's cast for Utf8/Binary -> numeric type changes (because
SparkCastOptions.is_adapting_schema delegates to DataFusion's cast),
which silently parses the bytes (returning nulls or, on some paths,
reinterpreting raw bytes) where Spark's vectorized reader and the
native_iceberg_compat scan throw SchemaColumnConvertNotSupportedException.

Add a guard in replace_with_spark_cast that rejects when the source
type is Utf8/LargeUtf8/Binary/LargeBinary and the target type is any
integer or floating-point type, mirroring TypeUtil.checkParquetType
on the JVM side.

Closes apache#4088.
@andygrove andygrove added correctness bug Something isn't working labels Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working correctness

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: STRING column read as INT silently returns garbage values

1 participant