Skip to content

Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry#66716

Merged
jscheffl merged 2 commits into
apache:mainfrom
jekeanyanwu:fix-kpo-trigger-reentry-404
May 13, 2026
Merged

Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry#66716
jscheffl merged 2 commits into
apache:mainfrom
jekeanyanwu:fix-kpo-trigger-reentry-404

Conversation

@jekeanyanwu
Copy link
Copy Markdown
Contributor

@jekeanyanwu jekeanyanwu commented May 11, 2026

closes: #66715

Problem

When KubernetesPodOperator runs in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task, trigger_reentry crashes.

The unguarded self.pod = self.hook.get_pod(pod_name, pod_namespace) raises ApiException(404) and escapes trigger_reentry. On provider versions before #56976 (which added if self.pod is None: return to _clean), the finally block additionally crashes with AttributeError: 'NoneType' object has no attribute 'metadata', masking the original cause.

The existing dead-code branch right below the call:

self.pod = self.hook.get_pod(pod_name, pod_namespace)

if not self.pod:
    raise PodNotFoundException("Could not find pod after resuming from deferral")

was clearly intended to handle this, but hook.get_pod() raises rather than returning None, so the translation never happens.

We hit this routinely on a Kubernetes cluster that aggressively reclaims completed pods:

File ".../operators/pod.py", line 834, in trigger_reentry
    self.pod = self.hook.get_pod(pod_name, pod_namespace)
kubernetes.client.exceptions.ApiException: (404) Not Found
{"message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" not found", ...}

During handling of the above exception, another exception occurred:
...
File ".../operators/pod.py", line 905, in _clean
    self.pod = self.pod_manager.await_pod_completion(...)
File ".../utils/pod_manager.py", line 808, in read_pod
    return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'

The trigger had already emitted status: success — the pod ran to completion successfully, was GC'd, and the worker resumed only to fail the task.

Solution

Wrap the get_pod call so that:

  • Non-404 ApiException re-raises unchanged.
  • 404 + event["status"] == "success" logs a warning and returns. The trigger already observed the pod completed successfully; logs/XCom are unrecoverable but the task itself succeeded, so retrying is wrong.
  • 404 + non-success event raises PodNotFoundException, matching the existing dead-code intent.

The pre-existing if not self.pod: branch is kept as a defensive guard for any subclass override that returns None instead of raising.

Tests

Three new unit tests in TestKubernetesPodOperatorAsync covering the three branches:

  • test_async_trigger_reentry_returns_when_pod_gcd_on_success
  • test_async_trigger_reentry_raises_pod_not_found_on_failure
  • test_async_trigger_reentry_propagates_non_404_api_exception
uv run --project providers/cncf/kubernetes pytest \
  providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py::TestKubernetesPodOperatorAsync \
  -k trigger_reentry -q

Related prior work

@boring-cyborg boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 11, 2026
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 11, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Copy Markdown
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static checks currently fail

@jscheffl
Copy link
Copy Markdown
Contributor

This PR has an (code) overlap with #66705 - can you check the other as well?

@jekeanyanwu
Copy link
Copy Markdown
Contributor Author

jekeanyanwu commented May 11, 2026

Thanks for the heads-up @jscheffl — I had a look at #66705.

The two PRs touch the same trigger_reentry block but address orthogonal failure modes:

So they're complementary rather than conflicting in intent, but whichever lands second will need a small rebase. Happy to do that on this side if #66705 goes in first — the fix here collapses to a small try/except ApiException around the (now unguarded) get_pod call in that PR's structure.

Also just pushed a ruff-format fix for the static-checks failure (function signatures had been wrapped where ruff wants them on one line).

@jekeanyanwu jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch 3 times, most recently from 2cf9421 to ea23ac3 Compare May 12, 2026 13:48
@jekeanyanwu jekeanyanwu requested a review from shahar1 May 12, 2026 13:49
@jekeanyanwu jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch 3 times, most recently from 16ca555 to 5cb4475 Compare May 12, 2026 20:12
@jscheffl
Copy link
Copy Markdown
Contributor

So they're complementary rather than conflicting in intent, but whichever lands second will need a small rebase. Happy to do that on this side if #66705 goes in first — the fix here collapses to a small try/except ApiException around the (now unguarded) get_pod call in that PR's structure.

Oh sorry just wanted to connect both streams - of course both PRs are vaild! Did not want to question this. Actually just thought of whatever PR we merge first the other need to resolve conflicts.

Sorry your's was a nit later than the other so you PR is now a victim that needs a conluict resolved. After I thik also good to merge.

@jekeanyanwu jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch from 5cb4475 to 1beecde Compare May 13, 2026 01:34
@jekeanyanwu
Copy link
Copy Markdown
Contributor Author

Sounds great - thanks. I've resolved the conflicts on my end 👍

@jekeanyanwu jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch 4 times, most recently from 81fbc07 to daab151 Compare May 13, 2026 15:41
…ntry

When KubernetesPodOperator runs in deferrable mode and the pod is reclaimed by
Kubernetes between the trigger firing and the worker re-entering the task,
the unguarded `self.pod = self.hook.get_pod(...)` in `trigger_reentry` raises
`ApiException(404)` and escapes. The dead-code `if not self.pod:` branch
intended to translate this to `PodNotFoundException` is never reached because
`get_pod` raises rather than returning `None`.

Wrap the `get_pod` call so that:
- Non-404 ApiExceptions re-raise unchanged.
- 404 + event status "success" returns cleanly (the trigger already observed
  the pod completed successfully; logs/XCom are unrecoverable but the task
  itself succeeded).
- 404 + non-success event raises `PodNotFoundException` (matches existing
  dead-code intent).

Refs apache#66715.
@jekeanyanwu jekeanyanwu force-pushed the fix-kpo-trigger-reentry-404 branch from daab151 to 0888aeb Compare May 13, 2026 17:06
@jscheffl jscheffl merged commit bda472d into apache:main May 13, 2026
112 checks passed
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 13, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KubernetesPodOperator deferrable: trigger_reentry crashes when pod is GC'd before re-entry

3 participants