Skip to content

KPO: treat registry 5xx errors as transient during pod startup#65490

Merged
potiuk merged 1 commit intoapache:mainfrom
potiuk:fix-k8s-pod-manager-registry-outage
Apr 19, 2026
Merged

KPO: treat registry 5xx errors as transient during pod startup#65490
potiuk merged 1 commit intoapache:mainfrom
potiuk:fix-k8s-pod-manager-registry-outage

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented Apr 19, 2026

When a pod is starting, kubelet automatically retries failed image pulls with exponential backoff. KubernetesPodOperator monitors the pod via detect_pod_terminate_early_issues() in pod_manager.py and aborts early on what it deems a fatal pull error — but the transient-error pattern list did not cover registry 5xx / gateway outages. A short Docker Hub outage (502/503/504 from auth.docker.io or the registry) was being classified as fatal and would abort the pod before kubelet's next retry, even though the pull would have succeeded once upstream recovered.

Observed in this run where test_pod_hostnetwork failed with:

failed to authorize: failed to fetch anonymous token:
  unexpected status from GET https://auth.docker.io/token?...: 502 Bad Gateway
→ PodLaunchFailedException: Image cannot be pulled, unable to start: ErrImagePull

Changes

  • Add canonical gateway error phrases (bad gateway, service unavailable, gateway timeout) to TRANSIENT_ERROR_PATTERNS so these conditions let kubelet keep retrying within the caller's startup_timeout budget.
  • Also treat kubelet's own ImagePullBackOff back-off pulling cooldown message as transient — without it we would still bail the moment kubelet moves into its retry backoff state.
  • Add direct unit tests for detect_pod_terminate_early_issues covering both the new transient patterns and the fatal cases.
  • Update the existing test_await_pod_completion_breaks_on_early_termination_issue to use InvalidImageName (still fatal) instead of the now-transient ImagePullBackOff / back-off pulling combo.

startup_timeout still bounds the overall wait, and genuinely fatal pull errors (InvalidImageName, ErrImageNeverPull, manifest unknown, unauthorized, …) remain fatal.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7, 1M context)

Generated-by: Claude Code (Opus 4.7, 1M context) following the guidelines

@boring-cyborg boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 19, 2026
When a pod is starting, kubelet automatically retries failed image pulls
with exponential backoff. `KubernetesPodOperator` monitors the pod via
`detect_pod_terminate_early_issues()` and aborts early on what it deems a
fatal pull error — but the transient-error pattern list did not cover
registry 5xx / gateway outages. A short Docker Hub outage (502/503/504
from `auth.docker.io` or the registry) was being classified as fatal and
would abort the pod before kubelet's next retry, even though the pull
would have succeeded once upstream recovered.

Add `bad gateway`, `service unavailable`, `gateway timeout` to
`TRANSIENT_ERROR_PATTERNS` so these conditions let kubelet keep retrying
within the caller's `startup_timeout` budget.

On kubelet >= 1.32 the `ImagePullBackOff` message is
`Back-off pulling image "X": <prev_error>`, so these patterns also match
during the backoff state. On older kubelets the bare message carries no
detail and we keep the existing fail-fast behaviour — matching
`back-off pulling` unconditionally would have caused a 120s wait for a
genuinely missing image instead of an immediate error (thanks Jens for
catching this).

`startup_timeout` still bounds the overall wait, and genuinely fatal pull
errors (`InvalidImageName`, `ErrImageNeverPull`, `manifest unknown`,
`unauthorized`, …) remain fatal.
@potiuk potiuk force-pushed the fix-k8s-pod-manager-registry-outage branch from 7fcd476 to 7f6b953 Compare April 19, 2026 13:42
@potiuk potiuk merged commit 938ccf3 into apache:main Apr 19, 2026
111 checks passed
@potiuk potiuk deleted the fix-k8s-pod-manager-registry-outage branch April 19, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants