KPO: treat registry 5xx errors as transient during pod startup#65490
Merged
potiuk merged 1 commit intoapache:mainfrom Apr 19, 2026
Merged
KPO: treat registry 5xx errors as transient during pod startup#65490potiuk merged 1 commit intoapache:mainfrom
potiuk merged 1 commit intoapache:mainfrom
Conversation
jscheffl
reviewed
Apr 19, 2026
When a pod is starting, kubelet automatically retries failed image pulls with exponential backoff. `KubernetesPodOperator` monitors the pod via `detect_pod_terminate_early_issues()` and aborts early on what it deems a fatal pull error — but the transient-error pattern list did not cover registry 5xx / gateway outages. A short Docker Hub outage (502/503/504 from `auth.docker.io` or the registry) was being classified as fatal and would abort the pod before kubelet's next retry, even though the pull would have succeeded once upstream recovered. Add `bad gateway`, `service unavailable`, `gateway timeout` to `TRANSIENT_ERROR_PATTERNS` so these conditions let kubelet keep retrying within the caller's `startup_timeout` budget. On kubelet >= 1.32 the `ImagePullBackOff` message is `Back-off pulling image "X": <prev_error>`, so these patterns also match during the backoff state. On older kubelets the bare message carries no detail and we keep the existing fail-fast behaviour — matching `back-off pulling` unconditionally would have caused a 120s wait for a genuinely missing image instead of an immediate error (thanks Jens for catching this). `startup_timeout` still bounds the overall wait, and genuinely fatal pull errors (`InvalidImageName`, `ErrImageNeverPull`, `manifest unknown`, `unauthorized`, …) remain fatal.
7fcd476 to
7f6b953
Compare
jscheffl
approved these changes
Apr 19, 2026
eladkal
approved these changes
Apr 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a pod is starting, kubelet automatically retries failed image pulls with exponential backoff.
KubernetesPodOperatormonitors the pod viadetect_pod_terminate_early_issues()inpod_manager.pyand aborts early on what it deems a fatal pull error — but the transient-error pattern list did not cover registry 5xx / gateway outages. A short Docker Hub outage (502/503/504 fromauth.docker.ioor the registry) was being classified as fatal and would abort the pod before kubelet's next retry, even though the pull would have succeeded once upstream recovered.Observed in this run where
test_pod_hostnetworkfailed with:Changes
bad gateway,service unavailable,gateway timeout) toTRANSIENT_ERROR_PATTERNSso these conditions let kubelet keep retrying within the caller'sstartup_timeoutbudget.ImagePullBackOffback-off pullingcooldown message as transient — without it we would still bail the moment kubelet moves into its retry backoff state.detect_pod_terminate_early_issuescovering both the new transient patterns and the fatal cases.test_await_pod_completion_breaks_on_early_termination_issueto useInvalidImageName(still fatal) instead of the now-transientImagePullBackOff / back-off pullingcombo.startup_timeoutstill bounds the overall wait, and genuinely fatal pull errors (InvalidImageName,ErrImageNeverPull,manifest unknown,unauthorized, …) remain fatal.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.7, 1M context) following the guidelines