test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]#4087
Draft
andygrove wants to merge 12 commits intoapache:mainfrom
Draft
test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]#4087andygrove wants to merge 12 commits intoapache:mainfrom
andygrove wants to merge 12 commits intoapache:mainfrom
Conversation
Both native_datafusion and native_iceberg_compat throw SparkException (matching Spark's reference behavior). The withMismatchedSchema helper was redesigned to accept a separate check lambda so collect() executes while the temp directory is still present.
On Spark 4.0, COMET_SCHEMA_EVOLUTION_ENABLED defaults to true and TypeUtil.checkParquetType has an isSpark40Plus guard, so four native_iceberg_compat tests that previously expected SparkException now succeed with widened values. Make each assertion version-conditional using CometSparkSessionExtensions.isSpark40Plus and update the behavior matrix accordingly.
This was referenced Apr 25, 2026
This was referenced Apr 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #3720 (does not close it; this PR encodes current behavior rather than fixing divergences).
Rationale for this change
Issue #3720 lists several Parquet schema-mismatch cases where Comet diverges from Spark, but no permanent test captures the per-case, per-scan-impl, per-Spark-version behavior. The divergences are tracked only in the issue text and a couple of ad-hoc tests in
ParquetReadSuite.Adding a focused suite gives us:
native_datafusionandnative_iceberg_compatdiffer from Spark on each Spark version, andWhat changes are included in this PR?
A new test suite at
spark/src/test/scala/org/apache/comet/parquet/ParquetSchemaMismatchSuite.scalawith 18 tests covering 9 cases (7 from #3720 plus 2 harmless control widenings) under the two Comet scan implementations (native_datafusion,native_iceberg_compat).Each test asserts Comet's actual current behavior. Spark's reference behavior is documented in per-case comments and in a behavior matrix at the top of the file. Where Comet's behavior depends on the Spark version (Spark 4.0 enables
COMET_SCHEMA_EVOLUTION_ENABLEDby default andTypeUtil.checkParquetTypehas anisSpark40Plusguard for INT96), the relevant tests useif (CometSparkSessionExtensions.isSpark40Plus) ...to keep assertions accurate.Behavior matrix (excerpt):
Notable findings the suite captures:
native_iceberg_compatis consistently strict on Spark 3.x (uses Comet'sTypeUtil.checkParquetType); on Spark 4.0 it relaxes for value-preserving widenings becauseCOMET_SCHEMA_EVOLUTION_ENABLEDdefaults to true.native_datafusionrejects some structural mismatches (binary as timestamp, timestamp_ntz as array) but silently accepts others (decimal precision narrowing, string read as int producing garbage values). The string-as-int case is a real correctness gap worth a separate follow-up.native_datafusion: filters on widened columns do not overflow.How are these changes tested?
The new suite is the test. Verified locally under Spark 3.4, 3.5, and 4.0 profiles (all 18 tests pass under each).