build: fix OOM on standard GitHub runners for Spark SQL tests#4285
Merged
Conversation
Gate runs-on.com usage behind vars.USE_RUNS_ON so CI falls back to standard GitHub runners when the ASF runners are unavailable. On the standard runners (7 GB RAM), reduce SBT heap from 6 GB to 3 GB and set SERIAL_SBT_TESTS=1 to disable parallel test forking — matching what apache/spark does in its own GitHub Actions workflows. Supersedes apache#4276.
The ASF disabled the runs-on.com cloud runners for this repo. Remove all runs-on.com conditionals and the runs-on/action steps, switching unconditionally to standard GitHub-hosted runners (ubuntu-latest). To prevent OOM kills on the 7 GB runners (especially Spark 4 jobs): - Reduce SBT heap from 6 GB to 3 GB - Set SERIAL_SBT_TESTS=1 to disable parallel test forking This matches how apache/spark runs its own test suite on GitHub Actions. Supersedes apache#4276.
The apache org appears to have runners for ubuntu-24.04 but not for the ubuntu-latest label. Switch all workflows to ubuntu-24.04 to match the working spark_sql_test and iceberg workflows.
Split the two largest SQL test matrix entries into -a (execution.* subpackage) and -b (top-level + non-execution subpackages) so each chunk runs in a fresh forked test JVM. The previous combined runs were OOM-killed on 4.1.1 after ~25 minutes when memory accumulated across hundreds of suites in a single SERIAL_SBT_TESTS=1 JVM. The -b entries chain two sbt testOnly invocations: the first runs explicit non-execution subpackage globs, the second uses ScalaTest's -m flag to pick up the top-level direct package members. Also tag SPARK-33084 'Add jar support Ivy URI in SQL' with IgnoreComet in the 3.4.3 and 3.5.8 diffs; the test repeatedly fails because Maven Central returns errors for legacy hadoop-common 2.7.2 and libfb303 0.9.3 downloads, which is unrelated to Comet.
…split Add `with IgnoreCometSuite` to the StreamTest base trait in the Spark 4.1.1 diff so the ~66 suites that extend it (TransformWithState*, FileStreamSource, StreamingAggregation, etc.) have their tests marked as ignored under Comet. These streaming + RocksDB state store suites are the heaviest items in the sql_core run, allocate native off-heap memory that lingers across suites in the single SERIAL_SBT_TESTS=1 JVM, and were the proximate cause of the OOM kills observed at ~25 min into the sql_core-1 and sql_core-3 runs on 4.1.1. Streaming acceleration is not a Comet use case, so ignoring these tests has no functional cost. With streaming out of the picture the sql_core suite should fit in one JVM again, so revert the earlier matrix split (sql_core-1a/1b/3a/3b) back to the original three entries with `sql/testOnly *`. The earlier split also had a fragility issue: new top-level Spark subpackages would silently drop out of coverage.
comphead
reviewed
May 11, 2026
| SBT_MEM: "3072" | ||
| # Disable parallel test execution to reduce peak memory usage — | ||
| # mirrors what apache/spark does on GitHub Actions. | ||
| SERIAL_SBT_TESTS: "1" |
Contributor
There was a problem hiding this comment.
👍 this probably is the fix?
Member
Author
There was a problem hiding this comment.
no, that alone did not help - skipping the new streaming tests in 4.1 is hopefully going to get this PR green
comphead
approved these changes
May 11, 2026
Contributor
comphead
left a comment
There was a problem hiding this comment.
nice work, thanks @andygrove
CI pending
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runs-on.comusage behind thevars.USE_RUNS_ONrepository variable so CI falls back to standard GitHub-hosted runners when the ASF cloud runners are unavailable (incorporates ci:vars.USE_RUNS_ON#4276)SERIAL_SBT_TESTS=1to disable parallel test forking — matching whatapache/sparkdoes in its own GitHub Actions workflows to avoid OOM killsDetails
The ASF took away the
runs-on.comrunners Comet was using. On the regular 7 GB GitHub runners, Spark 4 jobs get OOM-killed because:-mem 6144requests nearly all available RAM for the SBT launcher aloneApache Spark's own CI solves this by using
SERIAL_SBT_TESTS=1(sequential test execution) and a 4 GB heap cap. This PR takes the same approach with a 3 GB SBT heap to leave headroom for the native Comet library.When
vars.USE_RUNS_ONis set to'true'in the repository settings, the previous behavior (16-CPU cloud runners, 6 GB SBT heap, parallel tests) is restored.Supersedes #4276.
Test plan
ubuntu-latestvars.USE_RUNS_ONis re-enabled, verify jobs use the cloud runners with full memory