Skip to content

KubernetesPodOperator deferrable: trigger_reentry crashes when pod is GC'd before re-entry #66715

@jekeanyanwu

Description

@jekeanyanwu

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes (reproduced against current main; same code path exists at least back to 10.x).

Apache Airflow version

Reproduced on 2.x with provider 10.x; the buggy code path also exists on main.

What happened

When KubernetesPodOperator runs in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task, trigger_reentry crashes.

The unguarded self.pod = self.hook.get_pod(pod_name, pod_namespace) call raises kubernetes.client.exceptions.ApiException: (404) and that exception escapes trigger_reentry. On older provider versions where _clean does not yet guard self.pod is None, the finally block additionally crashes with AttributeError: 'NoneType' object has no attribute 'metadata', masking the original cause. Either way, a task whose pod actually completed successfully is marked failed and retried.

Real-world traceback we hit (one of many; the cluster reclaims completed pods aggressively):

File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 834, in trigger_reentry
    self.pod = self.hook.get_pod(pod_name, pod_namespace)
...
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
  "message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" not found",
  "reason":"NotFound","details":{"name":"...","kind":"pods"},"code":404}

During handling of the above exception, another exception occurred:

File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 895, in trigger_reentry
    self._clean(event, context)
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 905, in _clean
    self.pod = self.pod_manager.await_pod_completion(
File ".../airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 808, in read_pod
    return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'

The trigger had already emitted status: success — the pod ran to completion, was GC'd, and then the worker resumed and tried to fetch it.

What you think should happen instead

The author's clear intent in the existing code is to translate a missing pod into the operator-level PodNotFoundException:

self.pod = self.hook.get_pod(pod_name, pod_namespace)

if not self.pod:
    raise PodNotFoundException("Could not find pod after resuming from deferral")

But hook.get_pod() never returns None — it raises ApiException(404), so the if not self.pod branch is dead code. The 404 escapes uncaught.

PR #39296 added (HTTPError, ApiException) handling around _write_logs(), and PR #56976 added a self.pod is None early-return in _clean. Neither of those help here: both run after the unguarded get_pod() call.

Proposed fix: wrap the get_pod() call in try/except ApiException and:

  • On non-404, re-raise unchanged.
  • On 404 + event["status"] == "success": log a warning and return (the pod already completed successfully per the trigger — logs/XCom are unrecoverable but the task succeeded).
  • On 404 + non-success event: raise PodNotFoundException (matches the existing dead-code intent).

I have a draft PR with the fix and unit tests linked below.

How to reproduce

Run any KubernetesPodOperator with deferrable=True against a cluster that GCs pods aggressively (or just kubectl delete pod the running pod between the trigger firing success and the worker re-entering). The worker resumes, get_pod returns 404, and the task fails with the traceback above. We see this routinely under GKE Autopilot when daemonsets preempt nodes hosting just-completed task pods.

Anything else

Affects any deployment that uses KubernetesPodOperator in deferrable mode with a cluster that aggressively reclaims completed pods (GKE Autopilot, EKS with node-pressure eviction, preemption-by-priority-class). Workaround until this is fixed is a SafeKubernetesPodOperator subclass that catches ApiException(404) (and AttributeError on older providers) in trigger_reentry.

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:providerskind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new releaseprovider:cncf-kubernetesKubernetes (k8s) provider related issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions