CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660
CI: Retry Trivy scanner image pull to absorb transient Docker Hub timeouts#16660wombatu-kun wants to merge 1 commit into
Conversation
…eouts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Gentle ping on this CI fix. The Trivy image-pull flake it addresses keeps hitting fresh PRs: it took down #16669 for the first time earlier today (job The change is intentionally minimal and self-contained: +19/-0 in @kevinjqliu you set up and own the CVE scan (#16291, #16287) - would you be able to take a quick look when you get a chance? @stevenzwu pulling you in as a backup in case Kevin is tied up. |
Problem
The CVE Scan workflow intermittently fails while pulling the Trivy scanner image. Recent examples are #16657 (job
flink-runtime-1.20) and #16652, #16669 (jobopen-api-test-fixtures-runtime), all failed the same way within hours of each other:lhotari/sandboxed-trivy-actionruns Trivy inside a Docker container. The scanner image is not cached on the runner, so Docker pulls it from Docker Hub, and that pull occasionally times out (context deadline exceeded, exit code 125), failing the job and blocking unrelated PRs. It hits different matrix entries on different PRs, which marks it as transient infrastructure flakiness rather than a code issue.This is a transient Docker Hub availability blip, not a rate limit: the error is a network timeout rather than an HTTP 429, and GitHub-hosted runners are exempt from Docker Hub's anonymous pull limits for public images.
Change
Pre-pull the scanner image before the scan, with a bounded retry and backoff. The action's
docker runuses Docker's default--pull=missing, so once the image is present locally it is reused and the registry is not contacted again. The image is defined once as a job-levelTRIVY_IMAGEenv var and passed to the action via itstrivy-imageinput, so the pre-pulled image and the scanned image are guaranteed identical (and the digest pin is preserved). The retry is bounded to 5 attempts with linear backoff, so it stays polite to the registry and fails cleanly if Docker Hub is genuinely down.