[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs by zhengruifeng · Pull Request #55726 · apache/spark

zhengruifeng · 2026-05-07T07:41:33Z

What changes were proposed in this pull request?

This PR adds a single shared precompile CI job that runs Spark's SBT build once and uploads the resulting target/ trees as a GitHub Actions artifact. The 8 pyspark matrix entries plus the optional pyspark-install entry now consume that artifact instead of re-running the same SBT build themselves. The job is named generically because the same artifact can be reused by sparkr, R, or documentation jobs in follow-ups.

Concretely:

New precompile job in .github/workflows/build_and_test.yml runs the SBT build:
```
./build/sbt -Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive \
  -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver \
  -Pdocker-integration-tests -Pvolcano \
  Test/package streaming-kinesis-asl-assembly/assembly connect/assembly assembly/package
```
It tars every target/ directory (excluding ./build/ and ./.git/) with tar -cjf (bzip2), uploads as spark-compile-${{ github.run_id }} with retention-days: 1 so storage is reclaimed within 24h.
The job's if: gate fires when any of pyspark, pyspark-pandas, or pyspark-install is true in the precondition output, so the artifact is always available for any matrix entry that needs it (including via inputs.jobs overrides used by scheduled / dispatched workflows).
The pyspark matrix job adds precompile to needs:, downloads and extracts the artifact before running tests, and sets SKIP_BUILD: true in env.
dev/run-tests.py now skips build_apache_spark and build_spark_assembly_sbt when SKIP_BUILD is set, matching the existing SKIP_UNIDOC / SKIP_MIMA pattern.

SBT invocations: before vs. after

Every pyspark matrix entry today drives dev/run-tests.py, which makes two SBT calls back-to-back (build_spark_sbt at dev/run-tests.py:647 then build_spark_assembly_sbt at dev/run-tests.py:656):

# build_spark_sbt
./build/sbt <11 profiles> Test/package streaming-kinesis-asl-assembly/assembly connect/assembly
# build_spark_assembly_sbt
./build/sbt <11 profiles> assembly/package

The 11 profiles, identical across all 8 entries: -Phadoop-3 -Pyarn -Phive -Phive-thriftserver -Pkubernetes -Phadoop-cloud -Pjvm-profiler -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pvolcano.

After this PR, with SKIP_BUILD: true set on the matrix job, both calls are gated off; no SBT compile runs in the matrix entry at all. The new precompile job runs one SBT invocation that combines all four goals (safe because SKIP_MIMA=true in the pyspark job, so the original split for dev/mima is moot here):

	SBT compile invocations per pyspark matrix entry	Total SBT compile invocations across the matrix
Before	2	16
After	0	1 (in the new `precompile` job, all 4 goals combined)

The produced target/ is byte-equivalent — same goals, same profiles, same Scala/Java/Hadoop versions.

Why are the changes needed?

Each of the 8 pyspark matrix jobs runs the same ~13m27s SBT compile independently. Across a single CI run that's roughly 108m of redundant compile time, against a per-run total of ~700m. This change deduplicates that work.

Estimated savings, based on a recent run of Build and test:

	Per-run CI time
Redundant SBT compile today (8 pyspark matrix entries)	~108m
Add back: shared build + artifact transfer	~19m
Net CI compute saved per run	~89m (~13% of total)

Wall clock of the workflow is roughly unchanged. The build was previously parallel-hidden inside each matrix runner; sharing it serializes one ~13m build before the matrix, but the slowest matrix runner shrinks by the same amount, so the critical path is similar (within a few minutes).

Does this PR introduce any user-facing change?

No. CI infrastructure change only.

How was this patch tested?

The change is exercised by the CI run of this PR itself:

If precompile succeeds and produces an artifact of reasonable size, the build phase works.
If the pyspark matrix completes normally on top of the downloaded artifact, the artifact is sufficient and SKIP_BUILD is correctly skipping the local compile.

A few things to watch in the first run:

Artifact size. Spark's combined target/ is roughly 1-3 GB raw; expect ~600 MB-1 GB after bzip2. The "Package compile output" step prints the size with ls -lh. If it ever gets close to GHA's 10 GB per-artifact cap we should slim the find pattern (e.g., exclude target/streams and intermediate scaladoc).
bzip2 in the test images. The pyspark Docker images need bzip2 for the extract step. It is in the bzip2 package and present in every standard Ubuntu base image.

The doctests in dev/sparktestsupport/utils.py continue to pass; no logic in is-changed.py or the module graph was changed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

The pyspark matrix (8 jobs) and sparkr each rebuild the same Spark JARs from scratch, costing ~13m27s of SBT compile per job. With sparkr included, this is roughly 127m of redundant SBT compile per CI run. This change adds a `precompile-pyspark` job that runs the same SBT build (`Test/package + streaming-kinesis-asl-assembly/assembly + connect/assembly + assembly/package` with the 11 standard profiles) once, tars all `target/` directories with `zstd -3 -T0`, and uploads them as a 1-day-retention artifact. The pyspark matrix and sparkr jobs now `needs:` this job, download and extract the artifact, and set `SKIP_BUILD=1` so `dev/run-tests` skips the redundant compile. Net CI compute saved: roughly 95-105m per run, ~13-14% of total. Wall clock is roughly unchanged - the build is now serialized before the matrix instead of parallel-hidden inside it. Generated-by: Claude Code (Opus 4.7)

Drop the pyspark-specific name from the shared SBT compile job. The artifact it produces is reusable from any job that needs the same profile/goal set, e.g. R or doc builds in follow-ups. Generated-by: Claude Code (Opus 4.7)

The pyspark/sparkr Docker images do not ship zstd, so the "Extract precompiled artifact" step failed with `zstd: Cannot exec: No such file or directory`. Switch to xz, which is in xz-utils and present in every standard Ubuntu base image. Use `XZ_OPT='-T0 -9'` so compression is multi-threaded and at the highest level, which is also a slightly better ratio than zstd at -3. Generated-by: Claude Code (Opus 4.7)

Use tar's `-j` codec (bzip2). bzip2 is in every standard Ubuntu base image (same availability as xz), and its default level is 9, so no extra options are needed. Generated-by: Claude Code (Opus 4.7)

REVERT BEFORE MERGE. Force the precondition step to emit only pyspark + pyspark-pandas as true so the PR's CI iterations skip Maven/lint/docs/sparkr/etc. and only exercise the path this PR touches. Generated-by: Claude Code (Opus 4.7)

Revert the sparkr job to its original shape: no `precompile` in `needs`, no `SKIP_BUILD` env, no artifact download/extract. Also drop sparkr from the `precompile` job's `if:` gate since it is no longer a consumer. Generated-by: Claude Code (Opus 4.7)

The TEMP override that forced pyspark/pyspark-pandas only was added for iteration on this PR. With the implementation validated, restore the normal precondition logic so the full set of jobs runs. Generated-by: Claude Code (Opus 4.7)

The pyspark-install matrix entry has its own gate independent of the umbrella `pyspark` flag. Through the normal precondition path the two are correlated (pyspark-install belongs to the pyspark module list), but via `inputs.jobs` they can be set independently. Add pyspark-install to the precompile job's `if:` so the artifact is always available when the matrix entry runs. Generated-by: Claude Code (Opus 4.7)

…inishes Adds a final job that runs after the pyspark matrix completes (whether the matrix succeeded, failed, or was cancelled — gated only on the precompile job succeeding) and deletes the spark-compile-<run_id> artifact via the GitHub Actions REST API. The artifact's `retention-days: 1` already auto-expires it within 24h, so this is a "reclaim immediately" optimization rather than a leak fix. Best-effort: a failed delete does not fail the workflow. Generated-by: Claude Code (Opus 4.7)

…yspark finishes" This reverts commit 9323883.

The precompile job runs on a fresh ubuntu-latest runner with ~14 GB free out of the box. The full SBT build plus the resulting bzip2 artifact fits comfortably; the disk-cleanup step (which removes Android SDK, .NET, etc. from the runner image) added ~10s for no benefit. Generated-by: Claude Code (Opus 4.7)

gaogaotiantian · 2026-05-07T20:27:19Z

First of all, I think this is super useful.

This extra task however, takes an extra slot from out 20 concurrent job limit. I'm definitely not saying we should do this - we should definitely do it, without any question, but we also need to think about if it's worth it to give it a separate slot, or if we can combine this with some other prereq jobs (for example, when we detected that we need to build it, we just build).

Another observation is this

Task	download/upload	tar/untar
Compile	1m2s	4m52s
Use	25s	2m38s

which makes me wonder - maybe we should use a less aggressive compress algorithm? zstd or gzip? We are spending much more time to compress/decompress, than upload/download.

zhengruifeng added 6 commits May 7, 2026 07:40

[INFRA] Rename precompile-pyspark job to precompile

f449cb7

Drop the pyspark-specific name from the shared SBT compile job. The artifact it produces is reusable from any job that needs the same profile/goal set, e.g. R or doc builds in follow-ups. Generated-by: Claude Code (Opus 4.7)

[INFRA] Switch precompile artifact compression from xz to bzip2

8c039ce

Use tar's `-j` codec (bzip2). bzip2 is in every standard Ubuntu base image (same availability as xz), and its default level is 9, so no extra options are needed. Generated-by: Claude Code (Opus 4.7)

zhengruifeng changed the title ~~[INFRA] Share SBT compile artifact across pyspark and sparkr CI jobs~~ [INFRA] Share SBT compile artifact across pyspark CI jobs May 7, 2026

zhengruifeng changed the title ~~[INFRA] Share SBT compile artifact across pyspark CI jobs~~ [SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs May 7, 2026

zhengruifeng added 5 commits May 7, 2026 12:49

[INFRA] Remove temporary precondition override

5901d81

The TEMP override that forced pyspark/pyspark-pandas only was added for iteration on this PR. With the implementation validated, restore the normal precondition logic so the full set of jobs runs. Generated-by: Claude Code (Opus 4.7)

Revert "[INFRA] Add precompile-cleanup job to delete artifact after p…

9545921

…yspark finishes" This reverts commit 9323883.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs#55726

[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs#55726
zhengruifeng wants to merge 11 commits intoapache:masterfrom
zhengruifeng:share-sbt-compile-pyspark-sparkr

zhengruifeng commented May 7, 2026 •

edited

Loading

Uh oh!

gaogaotiantian commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhengruifeng commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

SBT invocations: before vs. after

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gaogaotiantian commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented May 7, 2026 •

edited

Loading