Skip to content

set continue-on-error:true for all actions/report-to-backend#273

Closed
jp wants to merge 1 commit intogame-ci:mainfrom
jp:disable-report-to-backend
Closed

set continue-on-error:true for all actions/report-to-backend#273
jp wants to merge 1 commit intogame-ci:mainfrom
jp:disable-report-to-backend

Conversation

@jp
Copy link
Copy Markdown

@jp jp commented Dec 8, 2025

Changes

  • ...

Checklist

  • Read the contribution guide and accept the code of conduct
  • Readme (updated or not needed)

Summary by CodeRabbit

  • Chores
    • Enhanced workflow error handling to improve build and deployment pipeline resilience by allowing processes to continue when reporting steps encounter errors.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 8, 2025

Cat Gif

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 8, 2025

Walkthrough

This PR adds continue-on-error: true to three reporting steps across eleven GitHub Actions workflows. The change allows workflows to continue execution when backend reporting steps fail, instead of terminating the entire job on reporting errors.

Changes

Cohort / File(s) Summary
Ubuntu Image Workflows
.github/workflows/new-ubuntu-base-image-requested.yml, new-ubuntu-hub-image-requested.yml, new-ubuntu-legacy-editor-image-requested.yml, new-ubuntu-post-2019-2-editor-image-requested.yml
Added continue-on-error: true to Report new build, Report publication, and Report failure steps
Windows Image Workflows
.github/workflows/new-windows-base-image-requested.yml, new-windows-hub-image-requested.yml, new-windows-legacy-editor-image-requested.yml, new-windows-post-2019-2-editor-image-requested.yml
Added continue-on-error: true to Report new build, Report publication, and Report failure steps
Retry Editor Image Workflows
.github/workflows/retry-ubuntu-editor-image-requested.yml, retry-windows-editor-image-requested.yml
Added continue-on-error: true to Report new build, Report publication, and Report failure steps
Test Workflow
.github/workflows/test.yml
Added continue-on-error: true to Report new build, Report build failure, and Report publication steps

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

These changes are highly repetitive and homogeneous—the same single-flag modification applied consistently across eleven workflow files with no logic changes, control flow complexity, or functional alterations.

Poem

🐰 Three reporting steps once feared to fail,
Now continue their journey through workflow's trail,
Resilient and steadfast, they press ahead,
Jobs complete their mission, no longer dread! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description uses only placeholder content ('...') for the Changes section, providing no actual explanation of what was modified or why. Replace the placeholder with a detailed explanation of the changes made, such as: 'Added continue-on-error: true to all report-to-backend steps across 10 workflow files to prevent reporting failures from failing the entire workflow.'
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding continue-on-error:true to all report-to-backend actions across 10 workflow files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a94952 and c022174.

📒 Files selected for processing (11)
  • .github/workflows/new-ubuntu-base-image-requested.yml (3 hunks)
  • .github/workflows/new-ubuntu-hub-image-requested.yml (3 hunks)
  • .github/workflows/new-ubuntu-legacy-editor-image-requested.yml (3 hunks)
  • .github/workflows/new-ubuntu-post-2019-2-editor-image-requested.yml (3 hunks)
  • .github/workflows/new-windows-base-image-requested.yml (3 hunks)
  • .github/workflows/new-windows-hub-image-requested.yml (3 hunks)
  • .github/workflows/new-windows-legacy-editor-image-requested.yml (3 hunks)
  • .github/workflows/new-windows-post-2019-2-editor-image-requested.yml (3 hunks)
  • .github/workflows/retry-ubuntu-editor-image-requested.yml (3 hunks)
  • .github/workflows/retry-windows-editor-image-requested.yml (3 hunks)
  • .github/workflows/test.yml (3 hunks)
🔇 Additional comments (17)
.github/workflows/new-ubuntu-post-2019-2-editor-image-requested.yml (1)

51-64: Reporting steps now continue on error — good approach for resilience.

The addition of continue-on-error: true to the three reporting steps allows transient backend failures to not block the entire workflow. The conditional if guards are preserved, so the steps still only run when appropriate.

.github/workflows/new-ubuntu-hub-image-requested.yml (1)

39-50: Consistent application of resilient error handling across reporting steps.

The changes follow the same pattern as other workflows in this PR: reporting steps can fail without blocking the job. This improves operational resilience for transient backend issues.

Also applies to: 115-132, 133-146

.github/workflows/new-windows-base-image-requested.yml (1)

75-86: Reporting resilience pattern applied consistently to Windows workflows.

Identical approach to the Ubuntu variants. Cross-platform consistency is good here.

Also applies to: 149-167, 168-181

.github/workflows/test.yml (1)

51-64: Reporting tests now resilient to backend transience — appropriate for test workflows.

Since this job specifically tests the reporting system with dryRun data, adding continue-on-error: true ensures test flakiness doesn't block PR validation.

Also applies to: 64-79, 79-98

.github/workflows/new-windows-legacy-editor-image-requested.yml (1)

91-104: Consistent resilience pattern applied to legacy Windows editor workflow.

No concerns. The continue-on-error: true directive is appropriately placed on all three reporting steps.

Also applies to: 215-235, 236-252

.github/workflows/retry-windows-editor-image-requested.yml (1)

33-46: Resilience pattern correctly applied to Windows editor retry workflow.

Consistent approach across all reporting steps.

Also applies to: 154-174, 175-191

.github/workflows/retry-ubuntu-editor-image-requested.yml (1)

33-46: Ubuntu editor retry workflow updated with consistent resilience approach.

All three reporting steps properly configured to continue on error.

Also applies to: 147-167, 167-183

.github/workflows/new-windows-post-2019-2-editor-image-requested.yml (1)

99-112: Final Windows editor workflow updated consistently with reporting resilience.

All changes follow the same verified pattern. PR objective is fully realized across all 8 workflows.

Also applies to: 222-242, 243-259

.github/workflows/new-ubuntu-legacy-editor-image-requested.yml (3)

51-63: Approved: Report new build resilience.

Adding continue-on-error: true to the early reporting step is a sensible improvement that prevents backend reporting failures from blocking the build process. The change is applied consistently with the PR objective.


167-186: Approved: Report publication resilience.

Guarding the publication report with continue-on-error: true ensures that backend reporting failures won't fail an otherwise successful build. Good practice for decoupling reporting from core workflow status.


188-202: Approved: Report failure resilience.

The failure reporting step already runs conditionally (if: ${{ failure() || cancelled() }}), and adding continue-on-error: true prevents cascading errors if the backend is unavailable. Solid improvement.

.github/workflows/new-windows-hub-image-requested.yml (3)

75-86: Approved: Report new build resilience (Windows platform).

Consistent application of continue-on-error: true to the reporting step. The change follows the same pattern as the Ubuntu workflows and improves resilience across platforms.


155-172: Approved: Report publication resilience (Windows platform).

Publication reporting is now resilient to backend failures on the Windows build path as well. Change is consistent with the broader PR objective.


174-187: Approved: Report failure resilience (Windows platform).

Failure reporting benefits from the same resilience guard, ensuring that a broken backend doesn't create secondary reporting failures.

.github/workflows/new-ubuntu-base-image-requested.yml (3)

39-49: Approved: Report new build resilience (base image).

Adding continue-on-error: true to the early reporting checkpoint allows the base image build to proceed even if backend reporting is unavailable. Consistent with the PR objective.


109-126: Approved: Report publication resilience (base image).

The publication report is now decoupled from workflow status, preventing reporting failures from retroactively failing a successful build.


127-140: Approved: Report failure resilience (base image).

Consistent application of the safety guard to the failure reporting step, improving overall workflow resilience.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@webbertakken
Copy link
Copy Markdown
Member

This isn't expected behaviour. The backend is the leading system for which images to build.

Especially without additional in depth rationale of how it would benefit the system as a whole, I'll have to close this.

@jp
Copy link
Copy Markdown
Author

jp commented Dec 8, 2025

@webbertakken the backend reports is the failing step in all the last build attempts. Ex: https://github.com/game-ci/docker/actions/runs/19115630726

There is not a single gameci build that went through in more than one month due to the backend responding an 500 error and breaking the pipeline.

@webbertakken
Copy link
Copy Markdown
Member

I understand. But I prefer to report exactly that and fix the root cause, not break the entire flow between backend, workflows and versions page. The backend is leading for which builds to retry.

cc: @davidmfinol do you think you could check out the 500 errors?

@jp
Copy link
Copy Markdown
Author

jp commented Dec 8, 2025

@webbertakken the calls to the reporting backend seems to be an unnecessary point of failure of the whole setup.
If the reporting has an issue, it breaks the whole CI pipeline.

I understand it would occasionally break the reporting when the backend is down, especially this page: https://game.ci/docs/docker/versions, but it seems more important to have a better reliability of the main feature rather than having a broken pipeline due to a broken reporting.

Ideally both would work, and this PR is just disconnecting the tightly coupled CI and reporting backend, not removing it.

@webbertakken
Copy link
Copy Markdown
Member

Respectfully disagree; The backend is the core of the whole system.

If something goes wrong in it we need to solve it, not the symptoms.

@jp
Copy link
Copy Markdown
Author

jp commented Dec 8, 2025

You can find more context here: https://discord.com/channels/710946343828455455/1432711901107851275/1438231240724709640

The backend fails sending debug info to discord. Both the backend and discord are SPOF.

@ysalmi567
Copy link
Copy Markdown

@webbertakken it looks like the back-end is failing because the Discord bot is/was blocked for sending too many messages (after a previous fix unblocked dozens of images that all got built in 24 hours) - again please take a look at the Discord thread for the context.

Unfortunately the community does not have access to the Discord bot or its logs. Perhaps there's another way to unblock the build queue?

davidmfinol was looking into this, but he seems to be stretched very thin at the moment.

@webbertakken
Copy link
Copy Markdown
Member

webbertakken commented Dec 8, 2025

Sure, I'm all for fixing it. All I'm saying is that we can't detach the workflow from the backend, because it's the backend that schedules the workflow in the first place, and the reporting in is an essential step (otherwise will mark it as cancelled because it never reported back in, which sometimes legitimately happens as well).

So all I'm saying is that we need a proper fix. And perhaps detach the Discord bot from the actual functionality in favour of a different way of logging (which we do not have atm).

I asked David to work on it because I was stretched thin myself as well. And that's still the case. Let's wait for David's response and perhaps one of the other maintainers to take a stab at it.

This is what the log shows right now

image

It's something that needs to be properly fixed.

@wilg
Copy link
Copy Markdown

wilg commented Dec 9, 2025

The Discord rate limit issue will provide a header with a retry time, so it seems like the builder should just continue retrying until it goes through? https://discord.com/developers/docs/topics/rate-limits

Or a quick fix, since the Discord rate limit is fairly high (at last for this use case) at 50 requests per second, probably just an equivalent of sleep(rand(0, 120)) would mitigate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants