Skip to content

CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:ci-trivy-pull-retry
Open

CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:ci-trivy-pull-retry

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented Jun 2, 2026

Problem

The CVE Scan workflow intermittently fails while pulling the Trivy scanner image. Recent examples are #16657 (job flink-runtime-1.20) and #16652, #16669 (job open-api-test-fixtures-runtime), all failed the same way within hours of each other:

Running Trivy in sandboxed container (aquasec/trivy:0.69.3@sha256:bcc376...)...
Unable to find image 'aquasec/trivy:...' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 125

lhotari/sandboxed-trivy-action runs Trivy inside a Docker container. The scanner image is not cached on the runner, so Docker pulls it from Docker Hub, and that pull occasionally times out (context deadline exceeded, exit code 125), failing the job and blocking unrelated PRs. It hits different matrix entries on different PRs, which marks it as transient infrastructure flakiness rather than a code issue.

This is a transient Docker Hub availability blip, not a rate limit: the error is a network timeout rather than an HTTP 429, and GitHub-hosted runners are exempt from Docker Hub's anonymous pull limits for public images.

Change

Pre-pull the scanner image before the scan, with a bounded retry and backoff. The action's docker run uses Docker's default --pull=missing, so once the image is present locally it is reused and the registry is not contacted again. The image is defined once as a job-level TRIVY_IMAGE env var and passed to the action via its trivy-image input, so the pre-pulled image and the scanned image are guaranteed identical (and the digest pin is preserved). The retry is bounded to 5 attempts with linear backoff, so it stays polite to the registry and fails cleanly if Docker Hub is genuinely down.

…eouts

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the INFRA label Jun 2, 2026
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

Gentle ping on this CI fix. The Trivy image-pull flake it addresses keeps hitting fresh PRs: it took down #16669 for the first time earlier today (job open-api-test-fixtures-runtime, Docker Hub pull timing out), on top of the earlier #16657 / #16652 cases. Since the CVE scan is a blocking check on PRs, each hit red-marks an otherwise-green, unrelated PR and forces a committer to manually re-run the job.

The change is intentionally minimal and self-contained: +19/-0 in cve-scan.yml, a bounded pre-pull retry (5 attempts, linear backoff) that reuses the digest-pinned image, so it stays polite to the registry and touches nothing else.

@kevinjqliu you set up and own the CVE scan (#16291, #16287) - would you be able to take a quick look when you get a chance? @stevenzwu pulling you in as a backup in case Kevin is tied up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant