Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes (reproduced against current main; same code path exists at least back to 10.x).
Apache Airflow version
Reproduced on 2.x with provider 10.x; the buggy code path also exists on main.
What happened
When KubernetesPodOperator runs in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task, trigger_reentry crashes.
The unguarded self.pod = self.hook.get_pod(pod_name, pod_namespace) call raises kubernetes.client.exceptions.ApiException: (404) and that exception escapes trigger_reentry. On older provider versions where _clean does not yet guard self.pod is None, the finally block additionally crashes with AttributeError: 'NoneType' object has no attribute 'metadata', masking the original cause. Either way, a task whose pod actually completed successfully is marked failed and retried.
Real-world traceback we hit (one of many; the cluster reclaims completed pods aggressively):
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 834, in trigger_reentry
self.pod = self.hook.get_pod(pod_name, pod_namespace)
...
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" not found",
"reason":"NotFound","details":{"name":"...","kind":"pods"},"code":404}
During handling of the above exception, another exception occurred:
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 895, in trigger_reentry
self._clean(event, context)
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 905, in _clean
self.pod = self.pod_manager.await_pod_completion(
File ".../airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 808, in read_pod
return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'
The trigger had already emitted status: success — the pod ran to completion, was GC'd, and then the worker resumed and tried to fetch it.
What you think should happen instead
The author's clear intent in the existing code is to translate a missing pod into the operator-level PodNotFoundException:
self.pod = self.hook.get_pod(pod_name, pod_namespace)
if not self.pod:
raise PodNotFoundException("Could not find pod after resuming from deferral")
But hook.get_pod() never returns None — it raises ApiException(404), so the if not self.pod branch is dead code. The 404 escapes uncaught.
PR #39296 added (HTTPError, ApiException) handling around _write_logs(), and PR #56976 added a self.pod is None early-return in _clean. Neither of those help here: both run after the unguarded get_pod() call.
Proposed fix: wrap the get_pod() call in try/except ApiException and:
- On non-404, re-raise unchanged.
- On 404 +
event["status"] == "success": log a warning and return (the pod already completed successfully per the trigger — logs/XCom are unrecoverable but the task succeeded).
- On 404 + non-success event: raise
PodNotFoundException (matches the existing dead-code intent).
I have a draft PR with the fix and unit tests linked below.
How to reproduce
Run any KubernetesPodOperator with deferrable=True against a cluster that GCs pods aggressively (or just kubectl delete pod the running pod between the trigger firing success and the worker re-entering). The worker resumes, get_pod returns 404, and the task fails with the traceback above. We see this routinely under GKE Autopilot when daemonsets preempt nodes hosting just-completed task pods.
Anything else
Affects any deployment that uses KubernetesPodOperator in deferrable mode with a cluster that aggressively reclaims completed pods (GKE Autopilot, EKS with node-pressure eviction, preemption-by-priority-class). Workaround until this is fixed is a SafeKubernetesPodOperator subclass that catches ApiException(404) (and AttributeError on older providers) in trigger_reentry.
Are you willing to submit PR?
Code of Conduct
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes(reproduced against currentmain; same code path exists at least back to 10.x).Apache Airflow version
Reproduced on 2.x with provider 10.x; the buggy code path also exists on
main.What happened
When
KubernetesPodOperatorruns in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task,trigger_reentrycrashes.The unguarded
self.pod = self.hook.get_pod(pod_name, pod_namespace)call raiseskubernetes.client.exceptions.ApiException: (404)and that exception escapestrigger_reentry. On older provider versions where_cleandoes not yet guardself.pod is None, thefinallyblock additionally crashes withAttributeError: 'NoneType' object has no attribute 'metadata', masking the original cause. Either way, a task whose pod actually completed successfully is marked failed and retried.Real-world traceback we hit (one of many; the cluster reclaims completed pods aggressively):
The trigger had already emitted
status: success— the pod ran to completion, was GC'd, and then the worker resumed and tried to fetch it.What you think should happen instead
The author's clear intent in the existing code is to translate a missing pod into the operator-level
PodNotFoundException:But
hook.get_pod()never returnsNone— it raisesApiException(404), so theif not self.podbranch is dead code. The 404 escapes uncaught.PR #39296 added
(HTTPError, ApiException)handling around_write_logs(), and PR #56976 added aself.pod is Noneearly-return in_clean. Neither of those help here: both run after the unguardedget_pod()call.Proposed fix: wrap the
get_pod()call intry/except ApiExceptionand:event["status"] == "success": log a warning and return (the pod already completed successfully per the trigger — logs/XCom are unrecoverable but the task succeeded).PodNotFoundException(matches the existing dead-code intent).I have a draft PR with the fix and unit tests linked below.
How to reproduce
Run any
KubernetesPodOperatorwithdeferrable=Trueagainst a cluster that GCs pods aggressively (or justkubectl delete podthe running pod between the trigger firingsuccessand the worker re-entering). The worker resumes,get_podreturns 404, and the task fails with the traceback above. We see this routinely under GKE Autopilot when daemonsets preempt nodes hosting just-completed task pods.Anything else
Affects any deployment that uses
KubernetesPodOperatorin deferrable mode with a cluster that aggressively reclaims completed pods (GKE Autopilot, EKS with node-pressure eviction, preemption-by-priority-class). Workaround until this is fixed is aSafeKubernetesPodOperatorsubclass that catchesApiException(404)(andAttributeErroron older providers) intrigger_reentry.Are you willing to submit PR?
Code of Conduct