Skip to content

fix(ci): retry docker pulls in integration Preload Images step#7600

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/ci-preload-images-retry
Open

fix(ci): retry docker pulls in integration Preload Images step#7600
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/ci-preload-images-retry

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

Drafted with AI assistance (Claude Code, via a panel of 6 agents) and reviewed/validated before submission, per the Generative AI Contribution Policy.

What this PR does

Wraps every docker pull in the integration job's Preload Images step (.github/workflows/test-build-deploy.yml) in a small inline retry() helper (3 attempts, 5s → 10s exponential backoff). The image set, the TEST_TAGS conditional branches, and the step's purpose are unchanged — only each pull's resilience to transient registry errors.

Why this is reachable

The step ran a bare sequence of docker pull commands with no retry. Public registries occasionally return transient errors; a single one fails the whole step (exit code 1) and the integration job, even though the code under test is fine:

Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 1.

Observed on a recent run: job integration (ubuntu-24.04, amd64, integration_overrides) failed on the very first pull (minio/minio), unrelated to the change under test. See #7598.

docker login is not a viable mitigation for this job: the integration matrix runs on pull requests including forks, which do not have access to repository secrets (secrets.DOCKER_REGISTRY_USER/PASSWORD are only used by the deploy job). The robust, fork-PR-safe fix is client-side retry.

How the fix resolves it

A retry() shell function is defined once at the top of the run: block and each docker pull becomes retry docker pull …:

retry() {
  local max_attempts=3 attempt=1 delay=5
  until "$@"; do
    if [ "$attempt" -ge "$max_attempts" ]; then
      echo "ERROR: '$*' failed after ${max_attempts} attempts." >&2
      return 1
    fi
    echo "WARNING: '$*' failed (attempt ${attempt}/${max_attempts}); retrying in ${delay}s..." >&2
    sleep "$delay"
    attempt=$((attempt + 1))
    delay=$((delay * 2))
  done
}

Correctness under bash -e (GitHub Actions' default for run:):

  • The failing docker pull runs as the condition of until, so an intermediate failure does not trip errexit — it just drives another retry.
  • When all attempts are exhausted, return 1 runs at a plain (non-conditional) call site, so errexit does abort the step. Genuine, persistent failures (bad tag, deleted image, real outage) still fail loudly, with a clear message naming the command and attempt count.

All pulls are wrapped (Docker Hub and quay.io); quay.io can blip too, and uniform retry keeps the step simple. Backoff is bounded: worst case is the integration_backward_compatibility branch (11 images) at ≤ 15s added sleep each ≈ 165s total — comfortably inside the job's timeout-minutes: 50 (and the separate 45-min test step).

Scope

In scope: retry logic for the Preload Images step only.

Out of scope (deliberately, to keep this focused):

  • The Load Docker Images step (make load-images) does docker load from local artifacts — no network, so it can't hit this failure mode and needs no retry.
  • The per-job container: image pull (quay.io/cortexproject/build-image:…) is performed by the runner before any run: step and can't be wrapped here; that's a separate, runner-level concern.
  • No docker login / authenticated-pull changes.

Which issue(s) this PR fixes

Fixes #7598

Checklist

  • CHANGELOG.mdno entry needed: this is a CI-only change with no user-facing behavior; Cortex's CHANGELOG tracks operator/user-visible changes.
  • Documentation (make doc) — N/A; no flags or config changed.
  • Commit signed off (DCO).

Test plan

A workflow flake can't be deterministically reproduced, but the change was validated:

  • YAML parses (Psych) — the integration job's Preload Images step has the expected 13 retry docker pull invocations (3 base + 6 backward-compat + 2 query-fuzz + 2 trailing).
  • bash -n on the extracted run: script — clean.
  • Control flow under bash -e -o pipefail (with sleep stubbed): a command that fails once then succeeds is retried and the script continues (intermediate failure does not abort); a command that always fails exhausts retries and aborts the step non-zero (real failures still fail).
  • The exact image set and conditional branches are byte-for-byte preserved (only docker pullretry docker pull).

The integration matrix jobs intermittently failed in the "Preload Images"
step when a `docker pull` from Docker Hub timed out:
`Get "https://registry-1.docker.io/v2/": context deadline exceeded`.
The step ran bare `docker pull`s with no retry, so a single transient
registry hiccup failed the whole job (issue cortexproject#7598).

Wrap each pull in a small `retry()` helper (3 attempts, 5s then 10s
exponential backoff). Transient failures are retried; a genuine,
persistent failure still fails the step because the final `return 1`
propagates under `bash -e`. `docker login` is not an option here: the
integration job runs on fork PRs, which have no repository secrets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@dosubot dosubot Bot added ci/cd type/chore Something that needs to be done; not a bug or a feature labels Jun 7, 2026

@SungJin1212 SungJin1212 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/cd lgtm This PR has been approved by a maintainer size/M type/chore Something that needs to be done; not a bug or a feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: flaky integration jobs — Preload Images step fails on transient Docker Hub pull (context deadline exceeded)

2 participants