Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry by jekeanyanwu · Pull Request #66716 · apache/airflow

jekeanyanwu · 2026-05-11T14:03:00Z

Problem

When KubernetesPodOperator runs in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task, trigger_reentry crashes.

The unguarded self.pod = self.hook.get_pod(pod_name, pod_namespace) raises ApiException(404) and escapes trigger_reentry. On provider versions before #56976 (which added if self.pod is None: return to _clean), the finally block additionally crashes with AttributeError: 'NoneType' object has no attribute 'metadata', masking the original cause.

The existing dead-code branch right below the call:

self.pod = self.hook.get_pod(pod_name, pod_namespace)

if not self.pod:
    raise PodNotFoundException("Could not find pod after resuming from deferral")

was clearly intended to handle this, but hook.get_pod() raises rather than returning None, so the translation never happens.

We hit this routinely on a Kubernetes cluster that aggressively reclaims completed pods:

File ".../operators/pod.py", line 834, in trigger_reentry
    self.pod = self.hook.get_pod(pod_name, pod_namespace)
kubernetes.client.exceptions.ApiException: (404) Not Found
{"message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" not found", ...}

During handling of the above exception, another exception occurred:
...
File ".../operators/pod.py", line 905, in _clean
    self.pod = self.pod_manager.await_pod_completion(...)
File ".../utils/pod_manager.py", line 808, in read_pod
    return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'

The trigger had already emitted status: success — the pod ran to completion successfully, was GC'd, and the worker resumed only to fail the task.

Solution

Wrap the get_pod call so that:

Non-404 ApiException re-raises unchanged.
404 + event["status"] == "success" logs a warning and returns. The trigger already observed the pod completed successfully; logs/XCom are unrecoverable but the task itself succeeded, so retrying is wrong.
404 + non-success event raises PodNotFoundException, matching the existing dead-code intent.

The pre-existing if not self.pod: branch is kept as a defensive guard for any subclass override that returns None instead of raising.

Tests

Three new unit tests in TestKubernetesPodOperatorAsync covering the three branches:

test_async_trigger_reentry_returns_when_pod_gcd_on_success
test_async_trigger_reentry_raises_pod_not_found_on_failure
test_async_trigger_reentry_propagates_non_404_api_exception

uv run --project providers/cncf/kubernetes pytest \
  providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py::TestKubernetesPodOperatorAsync \
  -k trigger_reentry -q

Related prior work

Handling exception getting logs when pods finish success #39296 added (HTTPError, ApiException) handling around _write_logs() — runs after the unguarded get_pod(), doesn't help.
improve deferrable KPO handling of deleted pods in between polls #56976 added if self.pod is None: return to _clean — runs after, doesn't help.
This PR closes the remaining gap at the unguarded get_pod() call itself.

boring-cyborg · 2026-05-11T14:03:07Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

shahar1

Static checks currently fail

jscheffl · 2026-05-11T21:57:25Z

This PR has an (code) overlap with #66705 - can you check the other as well?

jekeanyanwu · 2026-05-11T23:03:04Z

Thanks for the heads-up @jscheffl — I had a look at #66705.

The two PRs touch the same trigger_reentry block but address orthogonal failure modes:

Re-defer KubernetesPodOperator when pod is still running after trigger error #66705 handles event["status"] == "error" when the pod is still alive (transient triggerer/API issues) by re-deferring up to MAX_REDEFER_ATTEMPTS. The get_pod call itself is left unguarded — moved outside the try/finally in that PR.
Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry #66716 (this one) handles the case where the pod is already gone by the time trigger_reentry runs — get_pod raises ApiException(404) and escapes uncaught. On success events the trigger already observed completion, so we return cleanly; otherwise we raise PodNotFoundException.

So they're complementary rather than conflicting in intent, but whichever lands second will need a small rebase. Happy to do that on this side if #66705 goes in first — the fix here collapses to a small try/except ApiException around the (now unguarded) get_pod call in that PR's structure.

Also just pushed a ruff-format fix for the static-checks failure (function signatures had been wrapped where ruff wants them on one line).

jscheffl · 2026-05-12T21:15:39Z

So they're complementary rather than conflicting in intent, but whichever lands second will need a small rebase. Happy to do that on this side if #66705 goes in first — the fix here collapses to a small try/except ApiException around the (now unguarded) get_pod call in that PR's structure.

Oh sorry just wanted to connect both streams - of course both PRs are vaild! Did not want to question this. Actually just thought of whatever PR we merge first the other need to resolve conflicts.

Sorry your's was a nit later than the other so you PR is now a victim that needs a conluict resolved. After I thik also good to merge.

jekeanyanwu · 2026-05-13T01:41:24Z

Sounds great - thanks. I've resolved the conflicts on my end 👍

…ntry When KubernetesPodOperator runs in deferrable mode and the pod is reclaimed by Kubernetes between the trigger firing and the worker re-entering the task, the unguarded `self.pod = self.hook.get_pod(...)` in `trigger_reentry` raises `ApiException(404)` and escapes. The dead-code `if not self.pod:` branch intended to translate this to `PodNotFoundException` is never reached because `get_pod` raises rather than returning `None`. Wrap the `get_pod` call so that: - Non-404 ApiExceptions re-raise unchanged. - 404 + event status "success" returns cleanly (the trigger already observed the pod completed successfully; logs/XCom are unrecoverable but the task itself succeeded). - 404 + non-success event raises `PodNotFoundException` (matches existing dead-code intent). Refs apache#66715.

boring-cyborg · 2026-05-13T20:46:23Z

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 11, 2026

jekeanyanwu mentioned this pull request May 11, 2026

KubernetesPodOperator deferrable: trigger_reentry crashes when pod is GC'd before re-entry #66715

Closed

2 tasks

jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch from dd49cd3 to faf14fe Compare May 11, 2026 14:25

jekeanyanwu marked this pull request as ready for review May 11, 2026 14:34

jekeanyanwu requested review from hussein-awala, jedcunningham and jscheffl as code owners May 11, 2026 14:34

shahar1 reviewed May 11, 2026

View reviewed changes

jscheffl mentioned this pull request May 11, 2026

Re-defer KubernetesPodOperator when pod is still running after trigger error #66705

Merged

1 task

jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch 3 times, most recently from 2cf9421 to ea23ac3 Compare May 12, 2026 13:48

jekeanyanwu requested a review from shahar1 May 12, 2026 13:49

jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch 3 times, most recently from 16ca555 to 5cb4475 Compare May 12, 2026 20:12

jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch from 5cb4475 to 1beecde Compare May 13, 2026 01:34

jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch 4 times, most recently from 81fbc07 to daab151 Compare May 13, 2026 15:41

jekeanyanwu added 2 commits May 13, 2026 13:06

fix: apply ruff-format to new test signatures

0888aeb

jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch from daab151 to 0888aeb Compare May 13, 2026 17:06

jscheffl approved these changes May 13, 2026

View reviewed changes

jscheffl merged commit bda472d into apache:main May 13, 2026
112 checks passed

jscheffl mentioned this pull request May 19, 2026

Status of testing Providers that were prepared on May 19, 2026 #67213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry#66716

Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry#66716
jscheffl merged 2 commits into
apache:mainfrom
jekeanyanwu:fix-kpo-trigger-reentry-404

jekeanyanwu commented May 11, 2026 •

edited by eladkal

Loading

Uh oh!

boring-cyborg Bot commented May 11, 2026

Uh oh!

shahar1 left a comment

Uh oh!

jscheffl commented May 11, 2026

Uh oh!

jekeanyanwu commented May 11, 2026 •

edited

Loading

Uh oh!

jscheffl commented May 12, 2026

Uh oh!

jekeanyanwu commented May 13, 2026

Uh oh!

Uh oh!

boring-cyborg Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jekeanyanwu commented May 11, 2026 • edited by eladkal Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Tests

Related prior work

Uh oh!

boring-cyborg Bot commented May 11, 2026

Uh oh!

shahar1 left a comment

Choose a reason for hiding this comment

Uh oh!

jscheffl commented May 11, 2026

Uh oh!

jekeanyanwu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jscheffl commented May 12, 2026

Uh oh!

jekeanyanwu commented May 13, 2026

Uh oh!

Uh oh!

boring-cyborg Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jekeanyanwu commented May 11, 2026 •

edited by eladkal

Loading

jekeanyanwu commented May 11, 2026 •

edited

Loading