KPO: treat registry 5xx errors as transient during pod startup by potiuk · Pull Request #65490 · apache/airflow

potiuk · 2026-04-19T13:24:11Z

When a pod is starting, kubelet automatically retries failed image pulls with exponential backoff. KubernetesPodOperator monitors the pod via detect_pod_terminate_early_issues() in pod_manager.py and aborts early on what it deems a fatal pull error — but the transient-error pattern list did not cover registry 5xx / gateway outages. A short Docker Hub outage (502/503/504 from auth.docker.io or the registry) was being classified as fatal and would abort the pod before kubelet's next retry, even though the pull would have succeeded once upstream recovered.

Observed in this run where test_pod_hostnetwork failed with:

failed to authorize: failed to fetch anonymous token:
  unexpected status from GET https://auth.docker.io/token?...: 502 Bad Gateway
→ PodLaunchFailedException: Image cannot be pulled, unable to start: ErrImagePull

Changes

Add canonical gateway error phrases (bad gateway, service unavailable, gateway timeout) to TRANSIENT_ERROR_PATTERNS so these conditions let kubelet keep retrying within the caller's startup_timeout budget.
Also treat kubelet's own ImagePullBackOff back-off pulling cooldown message as transient — without it we would still bail the moment kubelet moves into its retry backoff state.
Add direct unit tests for detect_pod_terminate_early_issues covering both the new transient patterns and the fatal cases.
Update the existing test_await_pod_completion_breaks_on_early_termination_issue to use InvalidImageName (still fatal) instead of the now-transient ImagePullBackOff / back-off pulling combo.

startup_timeout still bounds the overall wait, and genuinely fatal pull errors (InvalidImageName, ErrImageNeverPull, manifest unknown, unauthorized, …) remain fatal.

Was generative AI tooling used to co-author this PR?

Yes — Claude Code (Opus 4.7, 1M context)

Generated-by: Claude Code (Opus 4.7, 1M context) following the guidelines

When a pod is starting, kubelet automatically retries failed image pulls with exponential backoff. `KubernetesPodOperator` monitors the pod via `detect_pod_terminate_early_issues()` and aborts early on what it deems a fatal pull error — but the transient-error pattern list did not cover registry 5xx / gateway outages. A short Docker Hub outage (502/503/504 from `auth.docker.io` or the registry) was being classified as fatal and would abort the pod before kubelet's next retry, even though the pull would have succeeded once upstream recovered. Add `bad gateway`, `service unavailable`, `gateway timeout` to `TRANSIENT_ERROR_PATTERNS` so these conditions let kubelet keep retrying within the caller's `startup_timeout` budget. On kubelet >= 1.32 the `ImagePullBackOff` message is `Back-off pulling image "X": <prev_error>`, so these patterns also match during the backoff state. On older kubelets the bare message carries no detail and we keep the existing fail-fast behaviour — matching `back-off pulling` unconditionally would have caused a 120s wait for a genuinely missing image instead of an immediate error (thanks Jens for catching this). `startup_timeout` still bounds the overall wait, and genuinely fatal pull errors (`InvalidImageName`, `ErrImageNeverPull`, `manifest unknown`, `unauthorized`, …) remain fatal.

potiuk requested review from hussein-awala, jedcunningham and jscheffl as code owners April 19, 2026 13:24

boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 19, 2026

jscheffl reviewed Apr 19, 2026

View reviewed changes

Comment thread providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py Outdated

potiuk force-pushed the fix-k8s-pod-manager-registry-outage branch from 7fcd476 to 7f6b953 Compare April 19, 2026 13:42

jscheffl approved these changes Apr 19, 2026

View reviewed changes

eladkal approved these changes Apr 19, 2026

View reviewed changes

potiuk merged commit 938ccf3 into apache:main Apr 19, 2026
111 checks passed

potiuk deleted the fix-k8s-pod-manager-registry-outage branch April 19, 2026 15:19

shahar1 mentioned this pull request Apr 22, 2026

Status of testing Providers that were prepared on April 21, 2026 #65702

Open

59 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KPO: treat registry 5xx errors as transient during pod startup#65490

KPO: treat registry 5xx errors as transient during pod startup#65490
potiuk merged 1 commit intoapache:mainfrom
potiuk:fix-k8s-pod-manager-registry-outage

potiuk commented Apr 19, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

potiuk commented Apr 19, 2026

Changes

Was generative AI tooling used to co-author this PR?

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants