Fix deferrable KPO trigger_reentry crash when pod is GC'd before re-entry#66716
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
dd49cd3 to
faf14fe
Compare
shahar1
left a comment
There was a problem hiding this comment.
Static checks currently fail
|
This PR has an (code) overlap with #66705 - can you check the other as well? |
|
Thanks for the heads-up @jscheffl — I had a look at #66705. The two PRs touch the same
So they're complementary rather than conflicting in intent, but whichever lands second will need a small rebase. Happy to do that on this side if #66705 goes in first — the fix here collapses to a small Also just pushed a ruff-format fix for the static-checks failure (function signatures had been wrapped where ruff wants them on one line). |
2cf9421 to
ea23ac3
Compare
16ca555 to
5cb4475
Compare
Oh sorry just wanted to connect both streams - of course both PRs are vaild! Did not want to question this. Actually just thought of whatever PR we merge first the other need to resolve conflicts. Sorry your's was a nit later than the other so you PR is now a victim that needs a conluict resolved. After I thik also good to merge. |
5cb4475 to
1beecde
Compare
|
Sounds great - thanks. I've resolved the conflicts on my end 👍 |
81fbc07 to
daab151
Compare
…ntry When KubernetesPodOperator runs in deferrable mode and the pod is reclaimed by Kubernetes between the trigger firing and the worker re-entering the task, the unguarded `self.pod = self.hook.get_pod(...)` in `trigger_reentry` raises `ApiException(404)` and escapes. The dead-code `if not self.pod:` branch intended to translate this to `PodNotFoundException` is never reached because `get_pod` raises rather than returning `None`. Wrap the `get_pod` call so that: - Non-404 ApiExceptions re-raise unchanged. - 404 + event status "success" returns cleanly (the trigger already observed the pod completed successfully; logs/XCom are unrecoverable but the task itself succeeded). - 404 + non-success event raises `PodNotFoundException` (matches existing dead-code intent). Refs apache#66715.
daab151 to
0888aeb
Compare
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
closes: #66715
Problem
When
KubernetesPodOperatorruns in deferrable mode and the Kubernetes garbage collector deletes the pod in the window between the trigger firing a success/error/timeout event and the worker re-entering the task,trigger_reentrycrashes.The unguarded
self.pod = self.hook.get_pod(pod_name, pod_namespace)raisesApiException(404)and escapestrigger_reentry. On provider versions before #56976 (which addedif self.pod is None: returnto_clean), thefinallyblock additionally crashes withAttributeError: 'NoneType' object has no attribute 'metadata', masking the original cause.The existing dead-code branch right below the call:
was clearly intended to handle this, but
hook.get_pod()raises rather than returningNone, so the translation never happens.We hit this routinely on a Kubernetes cluster that aggressively reclaims completed pods:
The trigger had already emitted
status: success— the pod ran to completion successfully, was GC'd, and the worker resumed only to fail the task.Solution
Wrap the
get_podcall so that:ApiExceptionre-raises unchanged.event["status"] == "success"logs a warning and returns. The trigger already observed the pod completed successfully; logs/XCom are unrecoverable but the task itself succeeded, so retrying is wrong.PodNotFoundException, matching the existing dead-code intent.The pre-existing
if not self.pod:branch is kept as a defensive guard for any subclass override that returnsNoneinstead of raising.Tests
Three new unit tests in
TestKubernetesPodOperatorAsynccovering the three branches:test_async_trigger_reentry_returns_when_pod_gcd_on_successtest_async_trigger_reentry_raises_pod_not_found_on_failuretest_async_trigger_reentry_propagates_non_404_api_exceptionRelated prior work
(HTTPError, ApiException)handling around_write_logs()— runs after the unguardedget_pod(), doesn't help.if self.pod is None: returnto_clean— runs after, doesn't help.get_pod()call itself.