Skip to content

fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808

Open
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul:issue-3760
Open

fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul:issue-3760

Conversation

@vaibhawvipul
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #3760 .

Rationale for this change

When running Spark SQL tests with native_datafusion scan, tests expecting errors for duplicate/ambiguous fields in case-insensitive mode fail because DataFusion's Parquet reader doesn't enforce Spark's case-sensitivity validation. Instead of detecting duplicates and raising the proper Spark error, the native reader silently returns wrong results or falls back to Spark.

What changes are included in this PR?

Native duplicate field detection (Rust):

  • Added per-column duplicate detection in schema_adapter.rs via check_column_duplicate() - checks each Column expression in the physical plan for ambiguous case-insensitive matches against the original physical schema

Removed plan-time fallback (Scala):

  • Removed the fallback block in CometScanRule.scala that detected duplicate field names at plan time and fell back to Spark - duplicates are now detected at read time in the native reader

Spark SQL test diffs (3.4.3, 3.5.8, 4.0.1):

  • Removed IgnoreCometNativeDataFusion annotations for issue-3760 from FileBasedDataSourceSuite and ParquetFilterSuite
  • Adapted error interception in tests to handle both Spark's SparkException(FAILED_READ_FILE) wrapper and Comet's direct SparkRuntimeException

How are these changes tested?

  • Rust and Scala tests
  • Spark SQL tests verified:
    • Spark native readers should respect spark.sql.caseSensitive - parquet
    • SPARK-25207: exception when duplicate fields in case-insensitive mode

…usion scan

Instead of falling back to Spark when duplicate field names are found in
case-insensitive mode, the native DataFusion reader now detects ambiguous
columns per-expression and raises SparkRuntimeException with error class
_LEGACY_ERROR_TEMP_2093, matching Spark's behavior.

This enables the previously ignored Spark SQL tests:
- FileBasedDataSourceSuite: caseSensitive test
- ParquetFilterSuite V1/V2: SPARK-25207 duplicate fields test

Closes apache#3760
…sensitive mode

Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 from
FileBasedDataSourceSuite and ParquetFilterSuite. Adapt tests to handle
both Spark's SparkException wrapper and Comet's direct SparkRuntimeException.
… in case-insensitive mode

Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 and adapt
tests to handle both Spark's SparkException wrapper and Comet's direct
RuntimeException/SparkRuntimeException.
@vaibhawvipul
Copy link
Copy Markdown
Contributor Author

A lot of the CI failures are the following -

/usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Warning: Docker pull failed with exit code 1, back off 1.615 seconds before retry.
  /usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Warning: Docker pull failed with exit code 1, back off 8.993 seconds before retry.
  /usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Error: Docker pull failed with exit code 1

any ideas how to fix them?

@vaibhawvipul
Copy link
Copy Markdown
Contributor Author

only clippy issue in CI, fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields

1 participant