ci: Enable Comet PR test matrix and TPCDS plan-stability for Spark 4.2 [WIP]#4126
Draft
andygrove wants to merge 4 commits intoapache:mainfrom
Draft
ci: Enable Comet PR test matrix and TPCDS plan-stability for Spark 4.2 [WIP]#4126andygrove wants to merge 4 commits intoapache:mainfrom
andygrove wants to merge 4 commits intoapache:mainfrom
Conversation
Adds a build-only spark-4.2 profile so we can start tracking Spark 4.2 compatibility ahead of the GA release. Tests, plan-stability fixtures, and CI matrix entries are not wired up yet. Profile properties: - spark.version 4.2.0-preview4 - scala.version 2.13.18 - parquet.version 1.17.0 - shims.majorVerSrc spark-4.x - shims.minorVerSrc spark-4.2 Two Spark API changes between 4.1 and 4.2 require new shims, which because they diverge between 4.0/4.1 and 4.2 must live in the per- minor source roots rather than spark-4.x: - DataSourceV2ScanExecBase.partitions changed from Seq[Seq[InputPartition]] to Seq[Option[InputPartition]]. Pulled the override out of CometBatchScanExec into a new ShimCometBatchScanExec trait that exposes the value as `shimPartitions`; the concrete override in CometBatchScanExec just delegates. - DataSourceRDDPartition.inputPartitions: Seq[InputPartition] became inputPartition: Option[InputPartition]. Reflective access in CometIcebergNativeScan now goes through ShimDataSourceRDDPartition which normalizes both shapes to Seq[InputPartition]. The spark-4.2 profile copies CometExprShim.scala from spark-4.1 unchanged for now; that will be revisited when 4.2 expression APIs diverge.
…tion The 4.2 prep introduced these shims only under spark-4.0/4.1/4.2 source roots, breaking compilation against spark-3.4 and spark-3.5 because both shims are referenced unconditionally in the cross-version main tree (CometBatchScanExec, CometIcebergNativeScan). Spark 3.4 and 3.5 expose the same underlying APIs as 4.0/4.1 (partitions: Seq[Seq[InputPartition]], DataSourceRDDPartition.inputPartitions: Seq[InputPartition]), so the 4.0 shim implementations apply verbatim. Place them under spark-3.x so 3.4 and 3.5 share them.
Adds a Spark 4.2, JDK 17 entry to the linux-test matrix in pr_build_linux.yml and a Spark 4.2, JDK 17, Scala 2.13 entry to the macos-aarch64-test matrix in pr_build_macos.yml. Also wires up the iceberg/jetty test dependencies for the spark-4.2 profile (matching the spark-4.1 setup, since iceberg-spark-runtime 4.2 is not yet published) and adds an isSpark42Plus helper. Spark 4.2 stays out of lint-java because semanticdb-scalac_2.13.18 is not yet published.
Add isSpark42Plus branch to CometPlanStabilitySuite to route Spark 4.2 to
dedicated approved-plans-{v1_4,v2_7}-spark4_2 directories. Spark 4.0 logic is
unchanged. Also extends dev/regenerate-golden-files.sh to accept --spark-version
4.2.
Generated via SPARK_GENERATE_GOLDEN_FILES=1 against -Pspark-4.2. 22 of the
generated files differ from the spark4_0 directory (q2, q5, q33, q49, q54,
q56, q60, q66 in v1_4 and q5a, q14a, q49 in v2_7, both native_datafusion and
native_iceberg_compat); the rest are byte-identical. The CometTPCDSV1_4
suite (194 tests) and CometTPCDSV2_7 suite (64 tests) both pass against the
new goldens with 0 failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #4113.
Rationale for this change
#4119 added a build-only
spark-4.2Maven profile targeting Spark 4.2.0-preview4. To start exercising Comet against 4.2 in CI (rather than discovering everything at once when 4.2 GA lands), this PR turns on the existing PR test matrices for Spark 4.2 and adds dedicated TPC-DS plan-stability goldens.This mirrors the approach previously used to bring Spark 4.1 online before reverting (see commits
622e851e1and75e3b3116on thespark-4.1.1branch).What changes are included in this PR?
.github/workflows/pr_build_linux.yml: addSpark 4.2, JDK 17to thelinux-testmatrix and a comment explaining why 4.1/4.2 are skipped from thelint-javamatrix (semanticdb-scalac is not yet published for Scala 2.13.17/2.13.18)..github/workflows/pr_build_macos.yml: addSpark 4.2, JDK 17, Scala 2.13to themacos-aarch64-testmatrix.spark/pom.xml: wire iceberg/jetty test dependencies into thespark-4.2profile (Iceberg falls back to the 4.0 runtime since 4.2 is not yet published; Jetty pinned at 11.0.26).spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: addisSpark42Plushelper.spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala: routeisSpark42Plusto the newapproved-plans-{v1_4,v2_7}-spark4_2directories.dev/regenerate-golden-files.sh: accept--spark-version 4.2and include 4.2 in the default version list.spark/src/test/resources/tpcds-plan-stability/approved-plans-{v1_4,v2_7}-spark4_2/: regenerated golden files. 22 of the generated files differ from thespark4_0directory (q2,q5,q33,q49,q54,q56,q60,q66in v1_4 andq5a,q14a,q49in v2_7, bothnative_datafusionandnative_iceberg_compatper query); the rest are byte-identical.This PR does not attempt to fix any 4.2-specific runtime/test failures the new matrix entries surface; those will be tracked and addressed in follow-up PRs as we did for Spark 4.1.
How are these changes tested?
-Pspark-4.2end-to-end with JDK 17.CometTPCDSV1_4_PlanStabilitySuite(194 tests) andCometTPCDSV2_7_PlanStabilitySuite(64 tests) against-Pspark-4.2withSPARK_GENERATE_GOLDEN_FILESunset; both pass with 0 failures.