Skip to content

[SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant#56004

Closed
zhengruifeng wants to merge 1 commit into
apache:branch-4.xfrom
zhengruifeng:build-infra-images-continue-on-error-4.x
Closed

[SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant#56004
zhengruifeng wants to merge 1 commit into
apache:branch-4.xfrom
zhengruifeng:build-infra-images-continue-on-error-4.x

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

Backport of #55972 to branch-4.x. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the docker/build-push-action SHA pin (kept the version already in use on branch-4.x).

What changes were proposed in this pull request?

Make the build_infra_images_cache.yml workflow tolerant of individual image build failures:

  • Add continue-on-error: true to each of the 12 Build and push steps so a failure in one does not abort the remaining builds. In particular, a failure of the base ./dev/infra/ image build should no longer prevent the other image builds from running.
  • Add a final "Fail if any image build failed" step that runs with if: always(), prints each build step's outcome, and exits non-zero if any was failure.

Why are the changes needed?

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the ./dev/infra/ base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

YAML parses cleanly (python3 -c "import yaml; yaml.safe_load(...)"). Verified all 12 build steps received continue-on-error: true and that the final aggregator step references every build step's outcome.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

…erant

Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures:

- Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running.
- Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`.

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

No.

YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`.

Generated-by: Claude Code (Claude Opus 4.7)

Closes apache#55972 from zhengruifeng/build-infra-images-continue-on-error.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng zhengruifeng changed the title [SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant [SPARK-56934][4.X][INFRA] Make build_infra_images_cache workflow error tolerant May 20, 2026
zhengruifeng added a commit that referenced this pull request May 20, 2026
…r tolerant

Backport of #55972 to `branch-4.x`. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the `docker/build-push-action` SHA pin (kept the version already in use on `branch-4.x`).

### What changes were proposed in this pull request?

Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures:

- Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running.
- Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`.

### Why are the changes needed?

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Closes #56004 from zhengruifeng/build-infra-images-continue-on-error-4.x.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

merged to 4.x

@zhengruifeng zhengruifeng deleted the build-infra-images-continue-on-error-4.x branch May 20, 2026 02:12
zhengruifeng added a commit that referenced this pull request May 20, 2026
…_pr.py

### What changes were proposed in this pull request?

When `dev/merge_spark_pr.py` merges a PR whose target branch is not the repository's default (e.g. backport PRs against `branch-X.Y`), explicitly close the PR through the GitHub REST API after the push. Specifically:

- Add a small `close_pr(pr_num)` helper that issues an authenticated `PATCH /pulls/{n}` with `{"state": "closed"}`.
- After a successful `git push` in `merge_pr()`, fetch the PR via `GET /pulls/{n}` and, if the state is still `"open"`, call `close_pr(pr_num)`.

### Why are the changes needed?

The squash-merge commit message already contains `Closes #N from ...`, which GitHub treats as a closing keyword. However, GitHub honors closing keywords **only when the commit lands on the repository's default branch**. Backport PRs that target `branch-X.Y` therefore got merged successfully but stayed open on GitHub, and the committer had to close them manually. #56004 was a recent example.

Checking the PR state after the push (instead of comparing `target_ref` to the default branch up front) keeps the change small and self-correcting: it does the right thing whenever GitHub's auto-close did not fire, without having to encode the default-branch rule in the script.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Compile-checked with `python3 -m py_compile dev/merge_spark_pr.py` and the existing doctests still pass via `python3 -m doctest dev/merge_spark_pr.py`. End-to-end merge behavior will be validated the next time a committer runs the script on a backport PR.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Closes #56007 from zhengruifeng/merge-script-close-backport-pr.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request May 20, 2026
…_pr.py

### What changes were proposed in this pull request?

When `dev/merge_spark_pr.py` merges a PR whose target branch is not the repository's default (e.g. backport PRs against `branch-X.Y`), explicitly close the PR through the GitHub REST API after the push. Specifically:

- Add a small `close_pr(pr_num)` helper that issues an authenticated `PATCH /pulls/{n}` with `{"state": "closed"}`.
- After a successful `git push` in `merge_pr()`, fetch the PR via `GET /pulls/{n}` and, if the state is still `"open"`, call `close_pr(pr_num)`.

### Why are the changes needed?

The squash-merge commit message already contains `Closes #N from ...`, which GitHub treats as a closing keyword. However, GitHub honors closing keywords **only when the commit lands on the repository's default branch**. Backport PRs that target `branch-X.Y` therefore got merged successfully but stayed open on GitHub, and the committer had to close them manually. #56004 was a recent example.

Checking the PR state after the push (instead of comparing `target_ref` to the default branch up front) keeps the change small and self-correcting: it does the right thing whenever GitHub's auto-close did not fire, without having to encode the default-branch rule in the script.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Compile-checked with `python3 -m py_compile dev/merge_spark_pr.py` and the existing doctests still pass via `python3 -m doctest dev/merge_spark_pr.py`. End-to-end merge behavior will be validated the next time a committer runs the script on a backport PR.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Closes #56007 from zhengruifeng/merge-script-close-backport-pr.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 4988ef1)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant