KubernetesPodOperator deferrable: trigger_reentry crashes when pod is GC'd before re-entry

### Apache Airflow Provider(s)

cncf-kubernetes

### Versions of Apache Airflow Providers

`apache-airflow-providers-cncf-kubernetes` (reproduced against current `main`; same code path exists at least back to 10.x).

### Apache Airflow version

Reproduced on 2.x with provider 10.x; the buggy code path also exists on `main`.

### What happened

When `KubernetesPodOperator` runs in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task, `trigger_reentry` crashes.

The unguarded `self.pod = self.hook.get_pod(pod_name, pod_namespace)` call raises `kubernetes.client.exceptions.ApiException: (404)` and that exception escapes `trigger_reentry`. On older provider versions where `_clean` does not yet guard `self.pod is None`, the `finally` block additionally crashes with `AttributeError: 'NoneType' object has no attribute 'metadata'`, masking the original cause. Either way, a task whose pod actually **completed successfully** is marked failed and retried.

Real-world traceback we hit (one of many; the cluster reclaims completed pods aggressively):

```
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 834, in trigger_reentry
    self.pod = self.hook.get_pod(pod_name, pod_namespace)
...
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
  "message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" not found",
  "reason":"NotFound","details":{"name":"...","kind":"pods"},"code":404}

During handling of the above exception, another exception occurred:

File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 895, in trigger_reentry
    self._clean(event, context)
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 905, in _clean
    self.pod = self.pod_manager.await_pod_completion(
File ".../airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 808, in read_pod
    return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'
```

The trigger had already emitted `status: success` — the pod ran to completion, was GC'd, and then the worker resumed and tried to fetch it.

### What you think should happen instead

The author's clear intent in the existing code is to translate a missing pod into the operator-level `PodNotFoundException`:

```python
self.pod = self.hook.get_pod(pod_name, pod_namespace)

if not self.pod:
    raise PodNotFoundException("Could not find pod after resuming from deferral")
```

But `hook.get_pod()` never returns `None` — it raises `ApiException(404)`, so the `if not self.pod` branch is dead code. The 404 escapes uncaught.

PR #39296 added `(HTTPError, ApiException)` handling around `_write_logs()`, and PR #56976 added a `self.pod is None` early-return in `_clean`. Neither of those help here: both run *after* the unguarded `get_pod()` call.

Proposed fix: wrap the `get_pod()` call in `try/except ApiException` and:
- On non-404, re-raise unchanged.
- On 404 + `event["status"] == "success"`: log a warning and return (the pod already completed successfully per the trigger — logs/XCom are unrecoverable but the task succeeded).
- On 404 + non-success event: raise `PodNotFoundException` (matches the existing dead-code intent).

I have a draft PR with the fix and unit tests linked below.

### How to reproduce

Run any `KubernetesPodOperator` with `deferrable=True` against a cluster that GCs pods aggressively (or just `kubectl delete pod` the running pod between the trigger firing `success` and the worker re-entering). The worker resumes, `get_pod` returns 404, and the task fails with the traceback above. We see this routinely under GKE Autopilot when daemonsets preempt nodes hosting just-completed task pods.

### Anything else

Affects any deployment that uses `KubernetesPodOperator` in deferrable mode with a cluster that aggressively reclaims completed pods (GKE Autopilot, EKS with node-pressure eviction, preemption-by-priority-class). Workaround until this is fixed is a `SafeKubernetesPodOperator` subclass that catches `ApiException(404)` (and `AttributeError` on older providers) in `trigger_reentry`.

### Are you willing to submit PR?

- [x] Here it is: https://github.com/apache/airflow/pull/66716

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesPodOperator deferrable: trigger_reentry crashes when pod is GC'd before re-entry #66715

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

KubernetesPodOperator deferrable: trigger_reentry crashes when pod is GC'd before re-entry #66715

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions