Skip to content

chore: Run Spark SQL tests with native_datafusion in CI [WIP]#3393

Draft
andygrove wants to merge 8 commits intoapache:mainfrom
andygrove:spark-sql-native-datafusion
Draft

chore: Run Spark SQL tests with native_datafusion in CI [WIP]#3393
andygrove wants to merge 8 commits intoapache:mainfrom
andygrove:spark-sql-native-datafusion

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 4, 2026

Which issue does this PR close?

N/A - This PR enables running Spark SQL tests with native_datafusion scan in CI.

Rationale for this change

Running Spark SQL tests with native_datafusion scan helps ensure compatibility and catch regressions. This PR enables these tests in CI while ignoring known failing tests that are tracked in separate issues.

What changes are included in this PR?

  1. CI workflow changes: Added native_datafusion scan mode to the Spark SQL test matrix

  2. Test annotations: Added IgnoreCometNativeDataFusion annotations for failing tests, linked to tracking issues:

Issue Category Tests
#3311 Schema mismatch / type coercion ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite, FileBasedDataSourceSuite
#3312 input_file_name() not supported UDFSuite, ExtractPythonUDFsSuite
#3313 Static scan metrics DynamicPartitionPruningSuite
#3315 Parquet V2 / streaming sources FileDataSourceV2FallBackSuite, StreamingQuerySuite
#3317 Row index metadata ParquetFileMetadataStructRowIndexSuite
#3319 Bucketed scan BucketedReadSuite, DisableUnnecessaryBucketedScanSuite
#3320 Predicate pushdown ParquetFilterSuite

How are these changes tested?

The changes are tested by the CI workflow itself - tests should pass with the known failures ignored.

andygrove and others added 8 commits February 4, 2026 08:18
…sion tests

Added annotations for the following tests that fail with native_datafusion scan:

DynamicPartitionPruningSuite:
- static scan metrics → apache#3313

ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite:
- SPARK-36182: can't read TimestampLTZ as TimestampNTZ → apache#3311
- SPARK-34212 Parquet should read decimals correctly → apache#3311
- row group skipping doesn't overflow when reading into larger type → apache#3311
- SPARK-35640 tests → apache#3311
- schema mismatch failure error message tests → apache#3311
- SPARK-25207: duplicate fields case-insensitive → apache#3311
- SPARK-31026: fields with dots in names → apache#3320
- Filters should be pushed down at row group level → apache#3320

FileBasedDataSourceSuite:
- Spark native readers should respect spark.sql.caseSensitive → apache#3311

BucketedReadSuite, DisableUnnecessaryBucketedScanSuite:
- disable bucketing when output doesn't contain all bucketing columns → apache#3319
- bucket coalescing tests → apache#3319
- SPARK-32859: disable unnecessary bucketed table scan tests → apache#3319
- Aggregates with no groupby over tables having 1 BUCKET → apache#3319

ParquetFileMetadataStructRowIndexSuite:
- reading _tmp_metadata_row_index tests → apache#3317

FileDataSourceV2FallBackSuite:
- Fallback Parquet V2 to V1 → apache#3315

UDFSuite:
- SPARK-8005 input_file_name → apache#3312

ExtractPythonUDFsSuite:
- Python UDF should not break column pruning/filter pushdown -- Parquet V1 → apache#3312

StreamingQuerySuite:
- SPARK-41198: input row calculation with CTE → apache#3315
- SPARK-41199: input row calculation with mixed DSv1 and DSv2 sources → apache#3315

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added the import statement to test files that were missing it:
- FileDataSourceV2FallBackSuite.scala
- ParquetFileMetadataStructRowIndexSuite.scala
- ExtractPythonUDFsSuite.scala
- DisableUnnecessaryBucketedScanSuite.scala
- StreamingQuerySuite.scala
- UDFSuite.scala

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The method signature in IgnoreComet.scala was not properly formatted
according to scalafmt rules. This fixes the formatting to match
Spark's scalafmt configuration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set NOLINT_ON_COMPILE=true to skip scalastyle validation during
SBT compilation, reducing CI time for Spark SQL test runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant