build: add spark-4.1 Maven profile and shim sources#4097
Open
andygrove wants to merge 3 commits intoapache:mainfrom
Open
build: add spark-4.1 Maven profile and shim sources#4097andygrove wants to merge 3 commits intoapache:mainfrom
andygrove wants to merge 3 commits intoapache:mainfrom
Conversation
Adds infrastructure to compile Comet against Spark 4.1 (currently 4.1.1) without enabling it in any CI workflow. After this change, ./mvnw -Pspark-4.1 succeeds locally for compile and test-compile. Profile properties (root pom.xml and spark/pom.xml): - scala.version 2.13.17, scala.binary.version 2.13 - spark.version 4.1.1, spark.version.short 4.1 - parquet.version 1.16.0, slf4j.version 2.0.17 - shims.majorVerSrc spark-4.1 - iceberg-spark-runtime-4.0 reused since the 4.1 artifact isn't published yet - jetty 11.0.26 to match Spark 4.1.1's jetty.version Shim trees (new): - spark/src/main/spark-4.1/** and common/src/main/spark-4.1/** mirror their spark-4.0 counterparts, with two 4.1-specific deltas: * CometExprShim.binaryOutputStyle no longer parses BINARY_OUTPUT_STYLE as a string (Spark 4.1 made it an enumConf, so getConf returns the enum directly). * CometEvalModeUtil.sumEvalMode reads s.evalContext.evalMode (Spark 4.1 wraps EvalMode in a NumericEvalContext). * CometTypeShim.isVariantStruct wired through to VariantMetadata.isVariantStruct, matching the spark-4.0 shim that apache#4084 introduced. - spark/src/test/spark-4.1/** mirrors the spark-4.0 test shim tree so JVM tests can compile under -Pspark-4.1. No-op refactors that prepare existing profiles for the new shim: - spark-3.4/3.5/4.0 CometExprShim.scala gain a CometEvalModeUtil.sumEvalMode helper that returns s.evalMode; aggregates.scala calls the helper instead of sum.evalMode directly. Behavior unchanged on 3.x/4.0; lets the spark-4.1 shim provide the alternate form. - New MapStatusHelper.scala wraps MapStatus.apply so Java callers (CometBypassMergeSortShuffleWriter, CometUnsafeShuffleWriter) don't need to see Scala default parameters; required because Spark 4.1 added a checksumVal default param. - CometShuffleManager swaps Collections.emptyMap() for new ConcurrentHashMap[Int, OpenHashSet[Long]]() in the reflective IndexShuffleBlockResolver constructor lookup. ConcurrentHashMap is also a java.util.Map, so 3.x/4.0 still resolve; required because Spark 4.1 widened the third constructor parameter from Map to ConcurrentMap. - isSpark41Plus added to CometSparkSessionExtensions for downstream use; not yet referenced anywhere. Out of scope for this PR (will land in a follow-up): - dev/diffs/4.1.1.diff for Spark SQL Tests - Any .github/workflows/* matrix entries for 4.1 - Plan-stability golden files for 4.1 - Docs additions - Runtime fixes (CometNativeWriteExec FileNameSpec, REMAINDER_BY_ZERO test branching, etc.)
Adds a new build-spark-4.1 job to pr_build_linux.yml that runs ./mvnw install -DskipTests -Dmaven.test.skip=true -Pspark-4.1. Verifies that the new profile and shim trees keep compiling on every PR without enabling any spark-4.1 tests in CI. Avoids the existing lint-java matrix because that job runs -Psemanticdb scalafix and semanticdb-scalac_2.13.17 is not yet published. Avoids the linux-test matrix because Comet's existing test suites have known failures under the spark-4.1 profile that are deferred to follow-up PRs. The job is independent: needs only the lint gate, doesn't share artifacts with build-native, and doesn't change any other job.
GitHub Actions job IDs may only contain alphanumerics, dash, and underscore. The dot in build-spark-4.1 made the workflow invalid. Rename to build-spark-4-1 to match the dash style used elsewhere in this file. The display name (Build Spark 4.1, JDK 17) is unchanged.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Adds the infrastructure needed to compile Comet against Spark 4.1 (currently 4.1.1) without enabling any spark-4.1 tests in CI. This is a preparation PR that lets the bigger Spark 4.1 enablement work (diff file, full test matrix entries, plan-stability golden files, docs) land in subsequent PRs against a code base that already has the profile and shims in place.
After this PR,
./mvnw install -DskipTests -Dmaven.test.skip=true -Pspark-4.1succeeds, and a new compile-only PR-build job verifies that on every PR.What changes are included in this PR?
New
spark-4.1Maven profile inpom.xmlandspark/pom.xml. Pins versions to match what Spark 4.1.1 itself ships with:scala.version=2.13.17,parquet.version=1.16.0,slf4j.version=2.0.17,jetty=11.0.26. The iceberg-spark-runtime artifact for Spark 4.1 isn't published yet, so the profile reusesiceberg-spark-runtime-4.0_2.13:1.10.0until it is.New shim source trees at
spark/src/main/spark-4.1/,common/src/main/spark-4.1/, andspark/src/test/spark-4.1/, mirroring theirspark-4.0counterparts. Three 4.1-specific deltas inside the trees:CometExprShim.binaryOutputStyleno longer parsesBINARY_OUTPUT_STYLEas a string. Spark 4.1 made it anenumConf, sogetConfalready returns the enum.CometEvalModeUtil.sumEvalModereadss.evalContext.evalMode. Spark 4.1 wrappedSum'sEvalModein aNumericEvalContext.CometTypeShim.isVariantStructwires through toVariantMetadata.isVariantStruct, matching the helper fix: fall back for shredded Variant scans on Spark 4.0 #4084 added to the spark-4.0 shim.No-op refactors that prepare existing profiles for the new shim (these compile to identical bytecode on 3.x/4.0 today):
CometEvalModeUtil.sumEvalModehelper added to the spark-3.4 / spark-3.5 / spark-4.0 type shims.aggregates.scalaupdated to callCometEvalModeUtil.sumEvalMode(sum)instead ofsum.evalMode. Lets the spark-4.1 shim provide the alternates.evalContext.evalModeform.MapStatusHelper.scalawrapsMapStatus.apply, so the three Java shuffle-writer call sites (inCometBypassMergeSortShuffleWriterandCometUnsafeShuffleWriter) don't need to see Scala default parameters. Required because Spark 4.1 added achecksumVal: Long = 0default.CometShuffleManagerswapsCollections.emptyMap()fornew ConcurrentHashMap[Int, OpenHashSet[Long]]()in the reflectiveIndexShuffleBlockResolverconstructor lookup.ConcurrentHashMapis also ajava.util.Map, so 3.x/4.0 still resolve; required because Spark 4.1 widened the third constructor parameter fromMaptoConcurrentMap.isSpark41Plusadded toCometSparkSessionExtensions. Not yet referenced anywhere; lands here so downstream PRs can use it without churn.New compile-only PR-build job in
pr_build_linux.yml. Runs./mvnw install -DskipTests -Dmaven.test.skip=true -Pspark-4.1so the new profile keeps compiling on every PR. Sits alongsidelint-javaand the test matrix as an independent job. Excluded fromlint-javabecause that job runs-Psemanticdbscalafix andsemanticdb-scalac_2.13.17is not yet published. Excluded fromlinux-testbecause Comet's existing test suites have known failures under spark-4.1 that are deferred to follow-up PRs.Out of scope (intentionally deferred to follow-up PRs):
dev/diffs/4.1.1.difffor Spark SQL Tests.github/workflows/*matrix entries that would run spark-4.1 tests (lint or unit)CometPlanStabilitySuitebranchCometNativeWriteExecFileNameSpec,REMAINDER_BY_ZEROtest branching)How are these changes tested?
Locally:
./mvnw install -DskipTests -Dmaven.test.skip=true -Pspark-4.1passes (76s reactor time)../mvnw test-compile -Pspark-4.1passes../mvnw test-compile -Pspark-3.5 -Pscala-2.13and./mvnw test-compile -Pspark-4.0pass (verifying the no-op refactors).CI: the new
build-spark-4.1job runs the same compile command on every PR. The existing 3.4 / 3.5 / 4.0 PR-build matrices continue to exercise the no-op refactors against those profiles.