build: fix OOM on standard GitHub runners for Spark SQL tests by andygrove · Pull Request #4285 · apache/datafusion-comet

andygrove · 2026-05-11T13:11:56Z

Summary

Gates runs-on.com usage behind the vars.USE_RUNS_ON repository variable so CI falls back to standard GitHub-hosted runners when the ASF cloud runners are unavailable (incorporates ci: vars.USE_RUNS_ON #4276)
On standard runners (7 GB RAM), reduces SBT heap from 6 GB to 3 GB and sets SERIAL_SBT_TESTS=1 to disable parallel test forking — matching what apache/spark does in its own GitHub Actions workflows to avoid OOM kills

Details

The ASF took away the runs-on.com runners Comet was using. On the regular 7 GB GitHub runners, Spark 4 jobs get OOM-killed because:

-mem 6144 requests nearly all available RAM for the SBT launcher alone
Parallel test forking spawns additional JVMs that push total usage over the limit

Apache Spark's own CI solves this by using SERIAL_SBT_TESTS=1 (sequential test execution) and a 4 GB heap cap. This PR takes the same approach with a 3 GB SBT heap to leave headroom for the native Comet library.

When vars.USE_RUNS_ON is set to 'true' in the repository settings, the previous behavior (16-CPU cloud runners, 6 GB SBT heap, parallel tests) is restored.

Supersedes #4276.

Test plan

Verify Spark 4.0/4.1 SQL test jobs no longer get OOM-killed on ubuntu-latest
Verify Spark 3.4/3.5 jobs still pass (they were less memory-hungry but benefit from the same fix)
When vars.USE_RUNS_ON is re-enabled, verify jobs use the cloud runners with full memory

Gate runs-on.com usage behind vars.USE_RUNS_ON so CI falls back to standard GitHub runners when the ASF runners are unavailable. On the standard runners (7 GB RAM), reduce SBT heap from 6 GB to 3 GB and set SERIAL_SBT_TESTS=1 to disable parallel test forking — matching what apache/spark does in its own GitHub Actions workflows. Supersedes apache#4276.

The ASF disabled the runs-on.com cloud runners for this repo. Remove all runs-on.com conditionals and the runs-on/action steps, switching unconditionally to standard GitHub-hosted runners (ubuntu-latest). To prevent OOM kills on the 7 GB runners (especially Spark 4 jobs): - Reduce SBT heap from 6 GB to 3 GB - Set SERIAL_SBT_TESTS=1 to disable parallel test forking This matches how apache/spark runs its own test suite on GitHub Actions. Supersedes apache#4276.

The apache org appears to have runners for ubuntu-24.04 but not for the ubuntu-latest label. Switch all workflows to ubuntu-24.04 to match the working spark_sql_test and iceberg workflows.

Split the two largest SQL test matrix entries into -a (execution.* subpackage) and -b (top-level + non-execution subpackages) so each chunk runs in a fresh forked test JVM. The previous combined runs were OOM-killed on 4.1.1 after ~25 minutes when memory accumulated across hundreds of suites in a single SERIAL_SBT_TESTS=1 JVM. The -b entries chain two sbt testOnly invocations: the first runs explicit non-execution subpackage globs, the second uses ScalaTest's -m flag to pick up the top-level direct package members. Also tag SPARK-33084 'Add jar support Ivy URI in SQL' with IgnoreComet in the 3.4.3 and 3.5.8 diffs; the test repeatedly fails because Maven Central returns errors for legacy hadoop-common 2.7.2 and libfb303 0.9.3 downloads, which is unrelated to Comet.

…split Add `with IgnoreCometSuite` to the StreamTest base trait in the Spark 4.1.1 diff so the ~66 suites that extend it (TransformWithState*, FileStreamSource, StreamingAggregation, etc.) have their tests marked as ignored under Comet. These streaming + RocksDB state store suites are the heaviest items in the sql_core run, allocate native off-heap memory that lingers across suites in the single SERIAL_SBT_TESTS=1 JVM, and were the proximate cause of the OOM kills observed at ~25 min into the sql_core-1 and sql_core-3 runs on 4.1.1. Streaming acceleration is not a Comet use case, so ignoring these tests has no functional cost. With streaming out of the picture the sql_core suite should fit in one JVM again, so revert the earlier matrix split (sql_core-1a/1b/3a/3b) back to the original three entries with `sql/testOnly *`. The earlier split also had a fragility issue: new top-level Spark subpackages would silently drop out of coverage.

comphead · 2026-05-11T17:55:59Z

+          SBT_MEM: "3072"
+          # Disable parallel test execution to reduce peak memory usage —
+          # mirrors what apache/spark does on GitHub Actions.
+          SERIAL_SBT_TESTS: "1"


👍 this probably is the fix?

no, that alone did not help - skipping the new streaming tests in 4.1 is hopefully going to get this PR green

comphead

nice work, thanks @andygrove
CI pending

andygrove added 5 commits May 11, 2026 07:11

drop one spark 4 run

70efede

build: use ubuntu-24.04 instead of ubuntu-latest

7c8f6fc

The apache org appears to have runners for ubuntu-24.04 but not for the ubuntu-latest label. Switch all workflows to ubuntu-24.04 to match the working spark_sql_test and iceberg workflows.

drop another run

77fbcbc

coderfender mentioned this pull request May 11, 2026

test: enable nested array cast coverage #4278

Open

andygrove marked this pull request as draft May 11, 2026 16:20

andygrove added 2 commits May 11, 2026 10:26

andygrove marked this pull request as ready for review May 11, 2026 17:46

comphead reviewed May 11, 2026

View reviewed changes

comphead approved these changes May 11, 2026

View reviewed changes

andygrove merged commit aa37736 into apache:main May 11, 2026
125 checks passed

andygrove deleted the ci/fix-oom-standard-runners branch May 11, 2026 18:47

andygrove mentioned this pull request May 11, 2026

ci: vars.USE_RUNS_ON #4276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: fix OOM on standard GitHub runners for Spark SQL tests#4285

build: fix OOM on standard GitHub runners for Spark SQL tests#4285
andygrove merged 7 commits into
apache:mainfrom
andygrove:ci/fix-oom-standard-runners

andygrove commented May 11, 2026

Uh oh!

comphead May 11, 2026

Uh oh!

andygrove May 11, 2026

Uh oh!

comphead left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 11, 2026

Summary

Details

Test plan

Uh oh!

comphead May 11, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 11, 2026

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants