Skip to content

test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]#4087

Draft
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:tests/issue-3720-schema-mismatch
Draft

test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]#4087
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:tests/issue-3720-schema-mismatch

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Part of #3720 (does not close it; this PR encodes current behavior rather than fixing divergences).

Rationale for this change

Issue #3720 lists several Parquet schema-mismatch cases where Comet diverges from Spark, but no permanent test captures the per-case, per-scan-impl, per-Spark-version behavior. The divergences are tracked only in the issue text and a couple of ad-hoc tests in ParquetReadSuite.

Adding a focused suite gives us:

What changes are included in this PR?

A new test suite at spark/src/test/scala/org/apache/comet/parquet/ParquetSchemaMismatchSuite.scala with 18 tests covering 9 cases (7 from #3720 plus 2 harmless control widenings) under the two Comet scan implementations (native_datafusion, native_iceberg_compat).

Each test asserts Comet's actual current behavior. Spark's reference behavior is documented in per-case comments and in a behavior matrix at the top of the file. Where Comet's behavior depends on the Spark version (Spark 4.0 enables COMET_SCHEMA_EVOLUTION_ENABLED by default and TypeUtil.checkParquetType has an isSpark40Plus guard for INT96), the relevant tests use if (CometSparkSessionExtensions.isSpark40Plus) ... to keep assertions accurate.

Behavior matrix (excerpt):

Case                                   Spark 3.4  3.5    4.0    Comet native_datafusion              Comet native_iceberg_compat
1. BINARY -> TIMESTAMP                 throw      throw  throw  throw                                throw
2. INT32 -> INT64                      throw      throw  OK     OK (widened values)                  throw on 3.x / OK on 4.0
3. INT96 LTZ -> TIMESTAMP_NTZ          throw      throw  throw  OK (silent, possible wall-clock diff) throw on 3.x / OK on 4.0
4. Decimal(10,2) -> Decimal(5,0)       throw      throw  throw  OK (reads, values unverified)        throw
5. INT32 -> INT64 w/ rowgroup filter   throw      throw  OK     OK (1 row, no overflow)              throw on 3.x / OK on 4.0
6. STRING -> INT                       throw      throw  throw  OK (garbage values)                  throw
7. TIMESTAMP_NTZ -> ARRAY<...>         throw      throw  throw  throw                                throw
C1. INT8 -> INT32                      OK         OK     OK     OK (widened values)                  OK (widened values)
C2. FLOAT -> DOUBLE                    OK         OK     OK     OK (widened values)                  throw on 3.x / OK on 4.0

Notable findings the suite captures:

  • native_iceberg_compat is consistently strict on Spark 3.x (uses Comet's TypeUtil.checkParquetType); on Spark 4.0 it relaxes for value-preserving widenings because COMET_SCHEMA_EVOLUTION_ENABLED defaults to true.
  • native_datafusion rejects some structural mismatches (binary as timestamp, timestamp_ntz as array) but silently accepts others (decimal precision narrowing, string read as int producing garbage values). The string-as-int case is a real correctness gap worth a separate follow-up.
  • The Case 5 row group skipping regression test passes on native_datafusion: filters on widened columns do not overflow.

How are these changes tested?

The new suite is the test. Verified locally under Spark 3.4, 3.5, and 4.0 profiles (all 18 tests pass under each).

Both native_datafusion and native_iceberg_compat throw SparkException
(matching Spark's reference behavior). The withMismatchedSchema helper
was redesigned to accept a separate check lambda so collect() executes
while the temp directory is still present.
On Spark 4.0, COMET_SCHEMA_EVOLUTION_ENABLED defaults to true and
TypeUtil.checkParquetType has an isSpark40Plus guard, so four
native_iceberg_compat tests that previously expected SparkException now
succeed with widened values. Make each assertion version-conditional
using CometSparkSessionExtensions.isSpark40Plus and update the behavior
matrix accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant