Skip to content

build: fix OOM on standard GitHub runners for Spark SQL tests#4285

Merged
andygrove merged 7 commits into
apache:mainfrom
andygrove:ci/fix-oom-standard-runners
May 11, 2026
Merged

build: fix OOM on standard GitHub runners for Spark SQL tests#4285
andygrove merged 7 commits into
apache:mainfrom
andygrove:ci/fix-oom-standard-runners

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

  • Gates runs-on.com usage behind the vars.USE_RUNS_ON repository variable so CI falls back to standard GitHub-hosted runners when the ASF cloud runners are unavailable (incorporates ci: vars.USE_RUNS_ON #4276)
  • On standard runners (7 GB RAM), reduces SBT heap from 6 GB to 3 GB and sets SERIAL_SBT_TESTS=1 to disable parallel test forking — matching what apache/spark does in its own GitHub Actions workflows to avoid OOM kills

Details

The ASF took away the runs-on.com runners Comet was using. On the regular 7 GB GitHub runners, Spark 4 jobs get OOM-killed because:

  1. -mem 6144 requests nearly all available RAM for the SBT launcher alone
  2. Parallel test forking spawns additional JVMs that push total usage over the limit

Apache Spark's own CI solves this by using SERIAL_SBT_TESTS=1 (sequential test execution) and a 4 GB heap cap. This PR takes the same approach with a 3 GB SBT heap to leave headroom for the native Comet library.

When vars.USE_RUNS_ON is set to 'true' in the repository settings, the previous behavior (16-CPU cloud runners, 6 GB SBT heap, parallel tests) is restored.

Supersedes #4276.

Test plan

  • Verify Spark 4.0/4.1 SQL test jobs no longer get OOM-killed on ubuntu-latest
  • Verify Spark 3.4/3.5 jobs still pass (they were less memory-hungry but benefit from the same fix)
  • When vars.USE_RUNS_ON is re-enabled, verify jobs use the cloud runners with full memory

andygrove added 5 commits May 11, 2026 07:11
Gate runs-on.com usage behind vars.USE_RUNS_ON so CI falls back to
standard GitHub runners when the ASF runners are unavailable. On the
standard runners (7 GB RAM), reduce SBT heap from 6 GB to 3 GB and
set SERIAL_SBT_TESTS=1 to disable parallel test forking — matching
what apache/spark does in its own GitHub Actions workflows.

Supersedes apache#4276.
The ASF disabled the runs-on.com cloud runners for this repo. Remove
all runs-on.com conditionals and the runs-on/action steps, switching
unconditionally to standard GitHub-hosted runners (ubuntu-latest).

To prevent OOM kills on the 7 GB runners (especially Spark 4 jobs):
- Reduce SBT heap from 6 GB to 3 GB
- Set SERIAL_SBT_TESTS=1 to disable parallel test forking

This matches how apache/spark runs its own test suite on GitHub Actions.

Supersedes apache#4276.
The apache org appears to have runners for ubuntu-24.04 but not for
the ubuntu-latest label. Switch all workflows to ubuntu-24.04 to
match the working spark_sql_test and iceberg workflows.
@andygrove andygrove marked this pull request as draft May 11, 2026 16:20
andygrove added 2 commits May 11, 2026 10:26
Split the two largest SQL test matrix entries into -a (execution.*
subpackage) and -b (top-level + non-execution subpackages) so each
chunk runs in a fresh forked test JVM. The previous combined runs
were OOM-killed on 4.1.1 after ~25 minutes when memory accumulated
across hundreds of suites in a single SERIAL_SBT_TESTS=1 JVM.

The -b entries chain two sbt testOnly invocations: the first runs
explicit non-execution subpackage globs, the second uses ScalaTest's
-m flag to pick up the top-level direct package members.

Also tag SPARK-33084 'Add jar support Ivy URI in SQL' with IgnoreComet
in the 3.4.3 and 3.5.8 diffs; the test repeatedly fails because Maven
Central returns errors for legacy hadoop-common 2.7.2 and libfb303
0.9.3 downloads, which is unrelated to Comet.
…split

Add `with IgnoreCometSuite` to the StreamTest base trait in the Spark 4.1.1
diff so the ~66 suites that extend it (TransformWithState*, FileStreamSource,
StreamingAggregation, etc.) have their tests marked as ignored under Comet.
These streaming + RocksDB state store suites are the heaviest items in the
sql_core run, allocate native off-heap memory that lingers across suites in
the single SERIAL_SBT_TESTS=1 JVM, and were the proximate cause of the OOM
kills observed at ~25 min into the sql_core-1 and sql_core-3 runs on 4.1.1.

Streaming acceleration is not a Comet use case, so ignoring these tests has
no functional cost.

With streaming out of the picture the sql_core suite should fit in one JVM
again, so revert the earlier matrix split (sql_core-1a/1b/3a/3b) back to the
original three entries with `sql/testOnly *`. The earlier split also had a
fragility issue: new top-level Spark subpackages would silently drop out of
coverage.
@andygrove andygrove marked this pull request as ready for review May 11, 2026 17:46
SBT_MEM: "3072"
# Disable parallel test execution to reduce peak memory usage —
# mirrors what apache/spark does on GitHub Actions.
SERIAL_SBT_TESTS: "1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this probably is the fix?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that alone did not help - skipping the new streaming tests in 4.1 is hopefully going to get this PR green

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, thanks @andygrove
CI pending

@andygrove andygrove merged commit aa37736 into apache:main May 11, 2026
125 checks passed
@andygrove andygrove deleted the ci/fix-oom-standard-runners branch May 11, 2026 18:47
@andygrove andygrove mentioned this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants