Fix KubernetesPodTrigger.get_task_state KeyError on mapped TIs (#67296)#67297
Merged
jscheffl merged 1 commit intoMay 21, 2026
Merged
Conversation
…e#67296) The execution API's /states endpoint encodes the response key as ``f"{task_id}_{map_index}"`` for mapped TIs but the trigger was looking the value up by plain ``task_id``. For any mapped deferrable KubernetesPodOperator task that lookup raised KeyError, which cleanup()'s broad ``except Exception`` swallowed and skipped ``hook.delete_pod()`` -- so Mark Failed in the UI left the pod running until ``active_deadline_seconds`` expired. Compose the lookup key with the ``_{map_index}`` suffix when the TI is mapped, matching how the API serialises the response. cleanup() now sees the real state, ``safe_to_cancel()`` returns the right value, and mark-failed actually deletes the pod within the grace period. Co-authored-by: Cursor <cursoragent@cursor.com>
jscheffl
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix a key-shape mismatch between the execution API's
/statesendpoint andKubernetesPodTrigger.get_task_state: the endpoint suffixes the responsekey with
_{map_index}for mapped TIs, but the trigger looked the valueup by plain
task_id, so the lookupKeyError-ed for every mappeddeferrable
KubernetesPodOperatortask.cleanup()'s broadexcept Exceptionswallowed the error and defensively skippedhook.delete_pod()-- leaving Mark Failed in the UI useless on mappeddeferrable KPO tasks. The pod stayed
Runninguntilactive_deadline_secondsexpired (often hours).For continuous deferrable pollers expanded via
.expand(...)this alsocaused overlapping-writer races against external systems on the next
schedule, because the failed run's pod was still alive when the next run
spawned its own pod.
This PR composes the lookup key with the
_{map_index}suffix when theTI is mapped, matching how the API serialises the response.
closes: #67296
Test plan
Unit tests
Three new tests in
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_pod.py:test_get_task_state_uses_task_id_for_non_mapped_ti-- regression guard for the non-mapped branch (response keyed by plaintask_id).test_get_task_state_uses_composite_key_for_mapped_ti-- the bug repro: mapped TI withtask_id=\"map_group.task_a\"+map_index=2returns the state stored under key\"map_group.task_a_2\".test_get_task_state_raises_when_mapped_key_missing-- pins the wrappedAirflowExceptionshape so callers (e.g.safe_to_cancel'sexcept Exception) keep matching.Local run:
```
uv run --no-sync --project providers/cncf/kubernetes pytest \
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_pod.py \
-k "get_task_state or safe_to_cancel"
```
Result:
5 passed, 2 skipped, 1 warning in 2.40s(2 skips are legacy Airflow < 3.3 paths unrelated to this PR).End-to-end smoke test on an EKS sandbox cluster
Airflow 3.2.1 with
apache-airflow-providers-cncf-kubernetes==10.16.0, deferrable KPO tasks inside a mapped@task_group.expand(...)(the same shape as the issue's repro DAG).Before this PR (provider 10.16.0 unpatched) -- one of three TI logs after Mark Failed:
```
[17:46:25.889914Z] WARNING - Could not determine task state during cleanup;
skipping pod deletion to be safe.
AirflowException: ('TaskInstance with dag_id: %s, ...', '...',
'map_group.task_a', '...', 2)
,KeyError: 'map_group.task_a'
File pod.py, line 399 in get_task_state
```
Pods stayed
Runningfor the full 600s sleep; nothing in the kubelet event stream untilactive_deadline_seconds(would have been 7200s here).After this PR -- same DAG, same Mark Failed:
```
18:17:54 pod: starting task_a-0 (duration=600s exit_code=0) (x3)
18:20:07 kubelet: Killing: Stopping container base (x3)
18:20:07 pod: received SIGTERM (x3)
18:20:07 runtime: Task ... deleted with exit code 0 (x3)
```
End-to-end
Mark Failed->delete_pod-> kubeletSIGTERM-> pod-sidetrap -> container
exit 0-> kubelet cleanup completes in ~1 second forall three mapped TIs. The previous
Could not determine task statewarningno longer fires.
Static checks
prek run ruff --files <changed>: passprek run ruff-format --files <changed>: passWas generative AI tooling used to co-author this PR?
Generated-by: Cursor / Claude Opus 4.7 following the guidelines
Made with Cursor