Skip to content

[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs#55726

Draft
zhengruifeng wants to merge 11 commits intoapache:masterfrom
zhengruifeng:share-sbt-compile-pyspark-sparkr
Draft

[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs#55726
zhengruifeng wants to merge 11 commits intoapache:masterfrom
zhengruifeng:share-sbt-compile-pyspark-sparkr

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 7, 2026

What changes were proposed in this pull request?

This PR adds a single shared precompile CI job that runs Spark's SBT build once and uploads the resulting target/ trees as a GitHub Actions artifact. The 8 pyspark matrix entries plus the optional pyspark-install entry now consume that artifact instead of re-running the same SBT build themselves. The job is named generically because the same artifact can be reused by sparkr, R, or documentation jobs in follow-ups.

Concretely:

  • New precompile job in .github/workflows/build_and_test.yml runs the SBT build:
    ./build/sbt -Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive \
      -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver \
      -Pdocker-integration-tests -Pvolcano \
      Test/package streaming-kinesis-asl-assembly/assembly connect/assembly assembly/package
    
    It tars every target/ directory (excluding ./build/ and ./.git/) with tar -cjf (bzip2), uploads as spark-compile-${{ github.run_id }} with retention-days: 1 so storage is reclaimed within 24h.
    The job's if: gate fires when any of pyspark, pyspark-pandas, or pyspark-install is true in the precondition output, so the artifact is always available for any matrix entry that needs it (including via inputs.jobs overrides used by scheduled / dispatched workflows).
  • The pyspark matrix job adds precompile to needs:, downloads and extracts the artifact before running tests, and sets SKIP_BUILD: true in env.
  • dev/run-tests.py now skips build_apache_spark and build_spark_assembly_sbt when SKIP_BUILD is set, matching the existing SKIP_UNIDOC / SKIP_MIMA pattern.

SBT invocations: before vs. after

Every pyspark matrix entry today drives dev/run-tests.py, which makes two SBT calls back-to-back (build_spark_sbt at dev/run-tests.py:647 then build_spark_assembly_sbt at dev/run-tests.py:656):

# build_spark_sbt
./build/sbt <11 profiles> Test/package streaming-kinesis-asl-assembly/assembly connect/assembly
# build_spark_assembly_sbt
./build/sbt <11 profiles> assembly/package

The 11 profiles, identical across all 8 entries: -Phadoop-3 -Pyarn -Phive -Phive-thriftserver -Pkubernetes -Phadoop-cloud -Pjvm-profiler -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pvolcano.

After this PR, with SKIP_BUILD: true set on the matrix job, both calls are gated off; no SBT compile runs in the matrix entry at all. The new precompile job runs one SBT invocation that combines all four goals (safe because SKIP_MIMA=true in the pyspark job, so the original split for dev/mima is moot here):

SBT compile invocations per pyspark matrix entry Total SBT compile invocations across the matrix
Before 2 16
After 0 1 (in the new precompile job, all 4 goals combined)

The produced target/ is byte-equivalent — same goals, same profiles, same Scala/Java/Hadoop versions.

Why are the changes needed?

Each of the 8 pyspark matrix jobs runs the same ~13m27s SBT compile independently. Across a single CI run that's roughly 108m of redundant compile time, against a per-run total of ~700m. This change deduplicates that work.

Estimated savings, based on a recent run of Build and test:

Per-run CI time
Redundant SBT compile today (8 pyspark matrix entries) ~108m
Add back: shared build + artifact transfer ~19m
Net CI compute saved per run ~89m (~13% of total)

Wall clock of the workflow is roughly unchanged. The build was previously parallel-hidden inside each matrix runner; sharing it serializes one ~13m build before the matrix, but the slowest matrix runner shrinks by the same amount, so the critical path is similar (within a few minutes).

Does this PR introduce any user-facing change?

No. CI infrastructure change only.

How was this patch tested?

The change is exercised by the CI run of this PR itself:

  • If precompile succeeds and produces an artifact of reasonable size, the build phase works.
  • If the pyspark matrix completes normally on top of the downloaded artifact, the artifact is sufficient and SKIP_BUILD is correctly skipping the local compile.

A few things to watch in the first run:

  • Artifact size. Spark's combined target/ is roughly 1-3 GB raw; expect ~600 MB-1 GB after bzip2. The "Package compile output" step prints the size with ls -lh. If it ever gets close to GHA's 10 GB per-artifact cap we should slim the find pattern (e.g., exclude target/streams and intermediate scaladoc).
  • bzip2 in the test images. The pyspark Docker images need bzip2 for the extract step. It is in the bzip2 package and present in every standard Ubuntu base image.

The doctests in dev/sparktestsupport/utils.py continue to pass; no logic in is-changed.py or the module graph was changed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

The pyspark matrix (8 jobs) and sparkr each rebuild the same Spark JARs
from scratch, costing ~13m27s of SBT compile per job. With sparkr included,
this is roughly 127m of redundant SBT compile per CI run.

This change adds a `precompile-pyspark` job that runs the same SBT build
(`Test/package + streaming-kinesis-asl-assembly/assembly + connect/assembly
+ assembly/package` with the 11 standard profiles) once, tars all `target/`
directories with `zstd -3 -T0`, and uploads them as a 1-day-retention
artifact. The pyspark matrix and sparkr jobs now `needs:` this job,
download and extract the artifact, and set `SKIP_BUILD=1` so
`dev/run-tests` skips the redundant compile.

Net CI compute saved: roughly 95-105m per run, ~13-14% of total. Wall
clock is roughly unchanged - the build is now serialized before the
matrix instead of parallel-hidden inside it.

Generated-by: Claude Code (Opus 4.7)
Drop the pyspark-specific name from the shared SBT compile job. The
artifact it produces is reusable from any job that needs the same
profile/goal set, e.g. R or doc builds in follow-ups.

Generated-by: Claude Code (Opus 4.7)
The pyspark/sparkr Docker images do not ship zstd, so the "Extract
precompiled artifact" step failed with `zstd: Cannot exec: No such file
or directory`. Switch to xz, which is in xz-utils and present in every
standard Ubuntu base image.

Use `XZ_OPT='-T0 -9'` so compression is multi-threaded and at the
highest level, which is also a slightly better ratio than zstd at -3.

Generated-by: Claude Code (Opus 4.7)
Use tar's `-j` codec (bzip2). bzip2 is in every standard Ubuntu base
image (same availability as xz), and its default level is 9, so no
extra options are needed.

Generated-by: Claude Code (Opus 4.7)
REVERT BEFORE MERGE.

Force the precondition step to emit only pyspark + pyspark-pandas as
true so the PR's CI iterations skip Maven/lint/docs/sparkr/etc. and
only exercise the path this PR touches.

Generated-by: Claude Code (Opus 4.7)
Revert the sparkr job to its original shape: no `precompile` in `needs`,
no `SKIP_BUILD` env, no artifact download/extract. Also drop sparkr from
the `precompile` job's `if:` gate since it is no longer a consumer.

Generated-by: Claude Code (Opus 4.7)
@zhengruifeng zhengruifeng changed the title [INFRA] Share SBT compile artifact across pyspark and sparkr CI jobs [INFRA] Share SBT compile artifact across pyspark CI jobs May 7, 2026
@zhengruifeng zhengruifeng changed the title [INFRA] Share SBT compile artifact across pyspark CI jobs [SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs May 7, 2026
The TEMP override that forced pyspark/pyspark-pandas only was added for
iteration on this PR. With the implementation validated, restore the
normal precondition logic so the full set of jobs runs.

Generated-by: Claude Code (Opus 4.7)
The pyspark-install matrix entry has its own gate independent of the
umbrella `pyspark` flag. Through the normal precondition path the two
are correlated (pyspark-install belongs to the pyspark module list),
but via `inputs.jobs` they can be set independently. Add pyspark-install
to the precompile job's `if:` so the artifact is always available when
the matrix entry runs.

Generated-by: Claude Code (Opus 4.7)
…inishes

Adds a final job that runs after the pyspark matrix completes (whether
the matrix succeeded, failed, or was cancelled — gated only on the
precompile job succeeding) and deletes the spark-compile-<run_id>
artifact via the GitHub Actions REST API.

The artifact's `retention-days: 1` already auto-expires it within 24h,
so this is a "reclaim immediately" optimization rather than a leak fix.
Best-effort: a failed delete does not fail the workflow.

Generated-by: Claude Code (Opus 4.7)
The precompile job runs on a fresh ubuntu-latest runner with ~14 GB free
out of the box. The full SBT build plus the resulting bzip2 artifact fits
comfortably; the disk-cleanup step (which removes Android SDK, .NET, etc.
from the runner image) added ~10s for no benefit.

Generated-by: Claude Code (Opus 4.7)
@gaogaotiantian
Copy link
Copy Markdown
Contributor

First of all, I think this is super useful.

This extra task however, takes an extra slot from out 20 concurrent job limit. I'm definitely not saying we should do this - we should definitely do it, without any question, but we also need to think about if it's worth it to give it a separate slot, or if we can combine this with some other prereq jobs (for example, when we detected that we need to build it, we just build).

Another observation is this

Task download/upload tar/untar
Compile 1m2s 4m52s
Use 25s 2m38s

which makes me wonder - maybe we should use a less aggressive compress algorithm? zstd or gzip? We are spending much more time to compress/decompress, than upload/download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants