[SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant by zhengruifeng · Pull Request #56004 · apache/spark

zhengruifeng · 2026-05-20T01:50:47Z

Backport of #55972 to branch-4.x. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the docker/build-push-action SHA pin (kept the version already in use on branch-4.x).

What changes were proposed in this pull request?

Make the build_infra_images_cache.yml workflow tolerant of individual image build failures:

Add continue-on-error: true to each of the 12 Build and push steps so a failure in one does not abort the remaining builds. In particular, a failure of the base ./dev/infra/ image build should no longer prevent the other image builds from running.
Add a final "Fail if any image build failed" step that runs with if: always(), prints each build step's outcome, and exits non-zero if any was failure.

Why are the changes needed?

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the ./dev/infra/ base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

YAML parses cleanly (python3 -c "import yaml; yaml.safe_load(...)"). Verified all 12 build steps received continue-on-error: true and that the final aggregator step references every build step's outcome.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

…erant Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. No. YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`. Generated-by: Claude Code (Claude Opus 4.7) Closes apache#55972 from zhengruifeng/build-infra-images-continue-on-error. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

…r tolerant Backport of #55972 to `branch-4.x`. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the `docker/build-push-action` SHA pin (kept the version already in use on `branch-4.x`). ### What changes were proposed in this pull request? Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. ### Why are the changes needed? Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) Closes #56004 from zhengruifeng/build-infra-images-continue-on-error-4.x. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-05-20T02:01:30Z

merged to 4.x

…_pr.py ### What changes were proposed in this pull request? When `dev/merge_spark_pr.py` merges a PR whose target branch is not the repository's default (e.g. backport PRs against `branch-X.Y`), explicitly close the PR through the GitHub REST API after the push. Specifically: - Add a small `close_pr(pr_num)` helper that issues an authenticated `PATCH /pulls/{n}` with `{"state": "closed"}`. - After a successful `git push` in `merge_pr()`, fetch the PR via `GET /pulls/{n}` and, if the state is still `"open"`, call `close_pr(pr_num)`. ### Why are the changes needed? The squash-merge commit message already contains `Closes #N from ...`, which GitHub treats as a closing keyword. However, GitHub honors closing keywords **only when the commit lands on the repository's default branch**. Backport PRs that target `branch-X.Y` therefore got merged successfully but stayed open on GitHub, and the committer had to close them manually. #56004 was a recent example. Checking the PR state after the push (instead of comparing `target_ref` to the default branch up front) keeps the change small and self-correcting: it does the right thing whenever GitHub's auto-close did not fire, without having to encode the default-branch rule in the script. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Compile-checked with `python3 -m py_compile dev/merge_spark_pr.py` and the existing doctests still pass via `python3 -m doctest dev/merge_spark_pr.py`. End-to-end merge behavior will be validated the next time a committer runs the script on a backport PR. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) Closes #56007 from zhengruifeng/merge-script-close-backport-pr. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

…_pr.py ### What changes were proposed in this pull request? When `dev/merge_spark_pr.py` merges a PR whose target branch is not the repository's default (e.g. backport PRs against `branch-X.Y`), explicitly close the PR through the GitHub REST API after the push. Specifically: - Add a small `close_pr(pr_num)` helper that issues an authenticated `PATCH /pulls/{n}` with `{"state": "closed"}`. - After a successful `git push` in `merge_pr()`, fetch the PR via `GET /pulls/{n}` and, if the state is still `"open"`, call `close_pr(pr_num)`. ### Why are the changes needed? The squash-merge commit message already contains `Closes #N from ...`, which GitHub treats as a closing keyword. However, GitHub honors closing keywords **only when the commit lands on the repository's default branch**. Backport PRs that target `branch-X.Y` therefore got merged successfully but stayed open on GitHub, and the committer had to close them manually. #56004 was a recent example. Checking the PR state after the push (instead of comparing `target_ref` to the default branch up front) keeps the change small and self-correcting: it does the right thing whenever GitHub's auto-close did not fire, without having to encode the default-branch rule in the script. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Compile-checked with `python3 -m py_compile dev/merge_spark_pr.py` and the existing doctests still pass via `python3 -m doctest dev/merge_spark_pr.py`. End-to-end merge behavior will be validated the next time a committer runs the script on a backport PR. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) Closes #56007 from zhengruifeng/merge-script-close-backport-pr. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 4988ef1) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng changed the title ~~[SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant~~ [SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant May 20, 2026

zhengruifeng closed this May 20, 2026

zhengruifeng deleted the build-infra-images-continue-on-error-4.x branch May 20, 2026 02:12

zhengruifeng mentioned this pull request May 20, 2026

[SPARK-56963][INFRA] Auto-close non-default-branch PRs in merge_spark_pr.py #56007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant#56004

[SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant#56004
zhengruifeng wants to merge 1 commit into
apache:branch-4.xfrom
zhengruifeng:build-infra-images-continue-on-error-4.x

zhengruifeng commented May 20, 2026

Uh oh!

zhengruifeng commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented May 20, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant