fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808
Open
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
Open
fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
Conversation
…usion scan Instead of falling back to Spark when duplicate field names are found in case-insensitive mode, the native DataFusion reader now detects ambiguous columns per-expression and raises SparkRuntimeException with error class _LEGACY_ERROR_TEMP_2093, matching Spark's behavior. This enables the previously ignored Spark SQL tests: - FileBasedDataSourceSuite: caseSensitive test - ParquetFilterSuite V1/V2: SPARK-25207 duplicate fields test Closes apache#3760
…sensitive mode Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 from FileBasedDataSourceSuite and ParquetFilterSuite. Adapt tests to handle both Spark's SparkException wrapper and Comet's direct SparkRuntimeException.
… in case-insensitive mode Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 and adapt tests to handle both Spark's SparkException wrapper and Comet's direct RuntimeException/SparkRuntimeException.
Contributor
Author
|
A lot of the CI failures are the following - any ideas how to fix them? |
Contributor
Author
|
only clippy issue in CI, fixed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #3760 .
Rationale for this change
When running Spark SQL tests with native_datafusion scan, tests expecting errors for duplicate/ambiguous fields in case-insensitive mode fail because DataFusion's Parquet reader doesn't enforce Spark's case-sensitivity validation. Instead of detecting duplicates and raising the proper Spark error, the native reader silently returns wrong results or falls back to Spark.
What changes are included in this PR?
Native duplicate field detection (Rust):
Removed plan-time fallback (Scala):
Spark SQL test diffs (3.4.3, 3.5.8, 4.0.1):
How are these changes tested?