fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields by vaibhawvipul · Pull Request #3808 · apache/datafusion-comet

vaibhawvipul · 2026-03-27T14:06:47Z

Which issue does this PR close?

Closes #3760 .

Rationale for this change

When running Spark SQL tests with native_datafusion scan, tests expecting errors for duplicate/ambiguous fields in case-insensitive mode fail because DataFusion's Parquet reader doesn't enforce Spark's case-sensitivity validation. Instead of detecting duplicates and raising the proper Spark error, the native reader silently returns wrong results or falls back to Spark.

What changes are included in this PR?

Native duplicate field detection (Rust):

Added per-column duplicate detection in schema_adapter.rs via check_column_duplicate() - checks each Column expression in the physical plan for ambiguous case-insensitive matches against the original physical schema

Removed plan-time fallback (Scala):

Removed the fallback block in CometScanRule.scala that detected duplicate field names at plan time and fell back to Spark - duplicates are now detected at read time in the native reader

Spark SQL test diffs (3.4.3, 3.5.8, 4.0.1):

Removed IgnoreCometNativeDataFusion annotations for issue-3760 from FileBasedDataSourceSuite and ParquetFilterSuite
Adapted error interception in tests to handle both Spark's SparkException(FAILED_READ_FILE) wrapper and Comet's direct SparkRuntimeException

How are these changes tested?

Rust and Scala tests
Spark SQL tests verified:
- Spark native readers should respect spark.sql.caseSensitive - parquet
- SPARK-25207: exception when duplicate fields in case-insensitive mode

…usion scan Instead of falling back to Spark when duplicate field names are found in case-insensitive mode, the native DataFusion reader now detects ambiguous columns per-expression and raises SparkRuntimeException with error class _LEGACY_ERROR_TEMP_2093, matching Spark's behavior. This enables the previously ignored Spark SQL tests: - FileBasedDataSourceSuite: caseSensitive test - ParquetFilterSuite V1/V2: SPARK-25207 duplicate fields test Closes apache#3760

…sensitive mode Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 from FileBasedDataSourceSuite and ParquetFilterSuite. Adapt tests to handle both Spark's SparkException wrapper and Comet's direct SparkRuntimeException.

… in case-insensitive mode Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 and adapt tests to handle both Spark's SparkException wrapper and Comet's direct RuntimeException/SparkRuntimeException.

vaibhawvipul · 2026-03-27T16:40:26Z

A lot of the CI failures are the following -

/usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Warning: Docker pull failed with exit code 1, back off 1.615 seconds before retry.
  /usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Warning: Docker pull failed with exit code 1, back off 8.993 seconds before retry.
  /usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Error: Docker pull failed with exit code 1

any ideas how to fix them?

… into issue-3760

vaibhawvipul · 2026-03-28T01:41:58Z

only clippy issue in CI, fixed.

vaibhawvipul added 4 commits March 27, 2026 16:57

fix lint errors

b2546e5

vaibhawvipul added 3 commits March 28, 2026 07:09

fix: use explicit Arc clone to satisfy clippy::clone_on_ref_ptr lint

0f3190d

Merge branch 'main' into issue-3760

446069e

Merge branch 'issue-3760' of github.com:vaibhawvipul/datafusion-comet…

fa73d04

… into issue-3760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808

fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul:issue-3760

vaibhawvipul commented Mar 27, 2026

Uh oh!

vaibhawvipul commented Mar 27, 2026

Uh oh!

vaibhawvipul commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vaibhawvipul commented Mar 27, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

vaibhawvipul commented Mar 27, 2026

Uh oh!

vaibhawvipul commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant