test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP] by andygrove · Pull Request #4087 · apache/datafusion-comet

andygrove · 2026-04-25T18:45:07Z

Which issue does this PR close?

Part of #3720 (does not close it; this PR encodes current behavior rather than fixing divergences).

Rationale for this change

Issue #3720 lists several Parquet schema-mismatch cases where Comet diverges from Spark, but no permanent test captures the per-case, per-scan-impl, per-Spark-version behavior. The divergences are tracked only in the issue text and a couple of ad-hoc tests in ParquetReadSuite.

Adding a focused suite gives us:

a regression guard for every case in native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types #3720,
a behavior matrix that shows at a glance how native_datafusion and native_iceberg_compat differ from Spark on each Spark version, and
a concrete reference future fixes can target (and the test+matrix to update when a fix lands).

What changes are included in this PR?

A new test suite at spark/src/test/scala/org/apache/comet/parquet/ParquetSchemaMismatchSuite.scala with 18 tests covering 9 cases (7 from #3720 plus 2 harmless control widenings) under the two Comet scan implementations (native_datafusion, native_iceberg_compat).

Each test asserts Comet's actual current behavior. Spark's reference behavior is documented in per-case comments and in a behavior matrix at the top of the file. Where Comet's behavior depends on the Spark version (Spark 4.0 enables COMET_SCHEMA_EVOLUTION_ENABLED by default and TypeUtil.checkParquetType has an isSpark40Plus guard for INT96), the relevant tests use if (CometSparkSessionExtensions.isSpark40Plus) ... to keep assertions accurate.

Behavior matrix (excerpt):

Case                                   Spark 3.4  3.5    4.0    Comet native_datafusion              Comet native_iceberg_compat
1. BINARY -> TIMESTAMP                 throw      throw  throw  throw                                throw
2. INT32 -> INT64                      throw      throw  OK     OK (widened values)                  throw on 3.x / OK on 4.0
3. INT96 LTZ -> TIMESTAMP_NTZ          throw      throw  throw  OK (silent, possible wall-clock diff) throw on 3.x / OK on 4.0
4. Decimal(10,2) -> Decimal(5,0)       throw      throw  throw  OK (reads, values unverified)        throw
5. INT32 -> INT64 w/ rowgroup filter   throw      throw  OK     OK (1 row, no overflow)              throw on 3.x / OK on 4.0
6. STRING -> INT                       throw      throw  throw  OK (garbage values)                  throw
7. TIMESTAMP_NTZ -> ARRAY<...>         throw      throw  throw  throw                                throw
C1. INT8 -> INT32                      OK         OK     OK     OK (widened values)                  OK (widened values)
C2. FLOAT -> DOUBLE                    OK         OK     OK     OK (widened values)                  throw on 3.x / OK on 4.0

Notable findings the suite captures:

native_iceberg_compat is consistently strict on Spark 3.x (uses Comet's TypeUtil.checkParquetType); on Spark 4.0 it relaxes for value-preserving widenings because COMET_SCHEMA_EVOLUTION_ENABLED defaults to true.
native_datafusion rejects some structural mismatches (binary as timestamp, timestamp_ntz as array) but silently accepts others (decimal precision narrowing, string read as int producing garbage values). The string-as-int case is a real correctness gap worth a separate follow-up.
The Case 5 row group skipping regression test passes on native_datafusion: filters on widened columns do not overflow.

How are these changes tested?

The new suite is the test. Verified locally under Spark 3.4, 3.5, and 4.0 profiles (all 18 tests pass under each).

Both native_datafusion and native_iceberg_compat throw SparkException (matching Spark's reference behavior). The withMismatchedSchema helper was redesigned to accept a separate check lambda so collect() executes while the temp directory is still present.

On Spark 4.0, COMET_SCHEMA_EVOLUTION_ENABLED defaults to true and TypeUtil.checkParquetType has an isSpark40Plus guard, so four native_iceberg_compat tests that previously expected SparkException now succeed with widened values. Make each assertion version-conditional using CometSparkSessionExtensions.isSpark40Plus and update the behavior matrix accordingly.

andygrove added 12 commits April 25, 2026 11:37

test: add ParquetSchemaMismatchSuite skeleton for issue apache#3720

d474ba3

test: case 2 int32 read as int64

e19a5d2

test: case 3 timestamp_ltz read as timestamp_ntz

96390ac

test: case 4 incompatible decimal precision/scale

1aa00df

test: case 5 int32 as int64 with row group filter

1d139ee

test: case 6 string read as int

69b1457

test: case 7 timestamp_ntz read as array<timestamp_ntz>

b012e99

test: update matrix row 7 with confirmed throw outcomes

318cac7

test: control case int8 read as int32

60a0ffa

test: control case float read as double

eabfdef

andygrove changed the title ~~test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior~~ test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP] Apr 25, 2026

This was referenced Apr 25, 2026

fix: reject incompatible decimal precision/scale in native_datafusion scan #4090

Open

fix: reject string/binary read as numeric in native_datafusion scan #4091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]#4087

test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]#4087
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:tests/issue-3720-schema-mismatch

andygrove commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 25, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant