[SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant#56004
Closed
zhengruifeng wants to merge 1 commit into
Closed
Conversation
…erant Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. No. YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`. Generated-by: Claude Code (Claude Opus 4.7) Closes apache#55972 from zhengruifeng/build-infra-images-continue-on-error. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng
added a commit
that referenced
this pull request
May 20, 2026
…r tolerant Backport of #55972 to `branch-4.x`. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the `docker/build-push-action` SHA pin (kept the version already in use on `branch-4.x`). ### What changes were proposed in this pull request? Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. ### Why are the changes needed? Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) Closes #56004 from zhengruifeng/build-infra-images-continue-on-error-4.x. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Contributor
Author
|
merged to 4.x |
zhengruifeng
added a commit
that referenced
this pull request
May 20, 2026
…_pr.py
### What changes were proposed in this pull request?
When `dev/merge_spark_pr.py` merges a PR whose target branch is not the repository's default (e.g. backport PRs against `branch-X.Y`), explicitly close the PR through the GitHub REST API after the push. Specifically:
- Add a small `close_pr(pr_num)` helper that issues an authenticated `PATCH /pulls/{n}` with `{"state": "closed"}`.
- After a successful `git push` in `merge_pr()`, fetch the PR via `GET /pulls/{n}` and, if the state is still `"open"`, call `close_pr(pr_num)`.
### Why are the changes needed?
The squash-merge commit message already contains `Closes #N from ...`, which GitHub treats as a closing keyword. However, GitHub honors closing keywords **only when the commit lands on the repository's default branch**. Backport PRs that target `branch-X.Y` therefore got merged successfully but stayed open on GitHub, and the committer had to close them manually. #56004 was a recent example.
Checking the PR state after the push (instead of comparing `target_ref` to the default branch up front) keeps the change small and self-correcting: it does the right thing whenever GitHub's auto-close did not fire, without having to encode the default-branch rule in the script.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compile-checked with `python3 -m py_compile dev/merge_spark_pr.py` and the existing doctests still pass via `python3 -m doctest dev/merge_spark_pr.py`. End-to-end merge behavior will be validated the next time a committer runs the script on a backport PR.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)
Closes #56007 from zhengruifeng/merge-script-close-backport-pr.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng
added a commit
that referenced
this pull request
May 20, 2026
…_pr.py
### What changes were proposed in this pull request?
When `dev/merge_spark_pr.py` merges a PR whose target branch is not the repository's default (e.g. backport PRs against `branch-X.Y`), explicitly close the PR through the GitHub REST API after the push. Specifically:
- Add a small `close_pr(pr_num)` helper that issues an authenticated `PATCH /pulls/{n}` with `{"state": "closed"}`.
- After a successful `git push` in `merge_pr()`, fetch the PR via `GET /pulls/{n}` and, if the state is still `"open"`, call `close_pr(pr_num)`.
### Why are the changes needed?
The squash-merge commit message already contains `Closes #N from ...`, which GitHub treats as a closing keyword. However, GitHub honors closing keywords **only when the commit lands on the repository's default branch**. Backport PRs that target `branch-X.Y` therefore got merged successfully but stayed open on GitHub, and the committer had to close them manually. #56004 was a recent example.
Checking the PR state after the push (instead of comparing `target_ref` to the default branch up front) keeps the change small and self-correcting: it does the right thing whenever GitHub's auto-close did not fire, without having to encode the default-branch rule in the script.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compile-checked with `python3 -m py_compile dev/merge_spark_pr.py` and the existing doctests still pass via `python3 -m doctest dev/merge_spark_pr.py`. End-to-end merge behavior will be validated the next time a committer runs the script on a backport PR.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)
Closes #56007 from zhengruifeng/merge-script-close-backport-pr.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 4988ef1)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of #55972 to
branch-4.x. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on thedocker/build-push-actionSHA pin (kept the version already in use onbranch-4.x).What changes were proposed in this pull request?
Make the
build_infra_images_cache.ymlworkflow tolerant of individual image build failures:continue-on-error: trueto each of the 12Build and pushsteps so a failure in one does not abort the remaining builds. In particular, a failure of the base./dev/infra/image build should no longer prevent the other image builds from running.if: always(), prints each build step'soutcome, and exits non-zero if any wasfailure.Why are the changes needed?
Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the
./dev/infra/base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.Does this PR introduce any user-facing change?
No.
How was this patch tested?
YAML parses cleanly (
python3 -c "import yaml; yaml.safe_load(...)"). Verified all 12 build steps receivedcontinue-on-error: trueand that the final aggregator step references every build step'soutcome.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)