fix(ci): retry docker pulls in integration Preload Images step#7600
Open
sandy2008 wants to merge 1 commit into
Open
fix(ci): retry docker pulls in integration Preload Images step#7600sandy2008 wants to merge 1 commit into
sandy2008 wants to merge 1 commit into
Conversation
The integration matrix jobs intermittently failed in the "Preload Images" step when a `docker pull` from Docker Hub timed out: `Get "https://registry-1.docker.io/v2/": context deadline exceeded`. The step ran bare `docker pull`s with no retry, so a single transient registry hiccup failed the whole job (issue cortexproject#7598). Wrap each pull in a small `retry()` helper (3 attempts, 5s then 10s exponential backoff). Transient failures are retried; a genuine, persistent failure still fails the step because the final `return 1` propagates under `bash -e`. `docker login` is not an option here: the integration job runs on fork PRs, which have no repository secrets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Drafted with AI assistance (Claude Code, via a panel of 6 agents) and reviewed/validated before submission, per the Generative AI Contribution Policy.
What this PR does
Wraps every
docker pullin theintegrationjob'sPreload Imagesstep (.github/workflows/test-build-deploy.yml) in a small inlineretry()helper (3 attempts, 5s → 10s exponential backoff). The image set, theTEST_TAGSconditional branches, and the step's purpose are unchanged — only each pull's resilience to transient registry errors.Why this is reachable
The step ran a bare sequence of
docker pullcommands with no retry. Public registries occasionally return transient errors; a single one fails the whole step (exit code 1) and the integration job, even though the code under test is fine:Observed on a recent run: job
integration (ubuntu-24.04, amd64, integration_overrides)failed on the very first pull (minio/minio), unrelated to the change under test. See #7598.docker loginis not a viable mitigation for this job: theintegrationmatrix runs on pull requests including forks, which do not have access to repository secrets (secrets.DOCKER_REGISTRY_USER/PASSWORDare only used by thedeployjob). The robust, fork-PR-safe fix is client-side retry.How the fix resolves it
A
retry()shell function is defined once at the top of therun:block and eachdocker pullbecomesretry docker pull …:Correctness under
bash -e(GitHub Actions' default forrun:):docker pullruns as the condition ofuntil, so an intermediate failure does not triperrexit— it just drives another retry.return 1runs at a plain (non-conditional) call site, soerrexitdoes abort the step. Genuine, persistent failures (bad tag, deleted image, real outage) still fail loudly, with a clear message naming the command and attempt count.All pulls are wrapped (Docker Hub and quay.io); quay.io can blip too, and uniform retry keeps the step simple. Backoff is bounded: worst case is the
integration_backward_compatibilitybranch (11 images) at ≤ 15s added sleep each ≈ 165s total — comfortably inside the job'stimeout-minutes: 50(and the separate 45-min test step).Scope
In scope: retry logic for the
Preload Imagesstep only.Out of scope (deliberately, to keep this focused):
Load Docker Imagesstep (make load-images) doesdocker loadfrom local artifacts — no network, so it can't hit this failure mode and needs no retry.container:image pull (quay.io/cortexproject/build-image:…) is performed by the runner before anyrun:step and can't be wrapped here; that's a separate, runner-level concern.docker login/ authenticated-pull changes.Which issue(s) this PR fixes
Fixes #7598
Checklist
CHANGELOG.md— no entry needed: this is a CI-only change with no user-facing behavior; Cortex's CHANGELOG tracks operator/user-visible changes.make doc) — N/A; no flags or config changed.Test plan
A workflow flake can't be deterministically reproduced, but the change was validated:
integrationjob'sPreload Imagesstep has the expected 13retry docker pullinvocations (3 base + 6 backward-compat + 2 query-fuzz + 2 trailing).bash -non the extractedrun:script — clean.bash -e -o pipefail(withsleepstubbed): a command that fails once then succeeds is retried and the script continues (intermediate failure does not abort); a command that always fails exhausts retries and aborts the step non-zero (real failures still fail).docker pull→retry docker pull).