Skip to content

Recover stuck TIs when direct terminal-state API call fails#66574

Merged
vatsrahul1001 merged 4 commits into
apache:mainfrom
potiuk:fix/tasksdk-terminal-state-after-api-success
May 19, 2026
Merged

Recover stuck TIs when direct terminal-state API call fails#66574
vatsrahul1001 merged 4 commits into
apache:mainfrom
potiuk:fix/tasksdk-terminal-state-after-api-success

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 7, 2026

Summary

The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask, and RescheduleTask set _terminal_state before calling the matching client.task_instances.{succeed,retry,defer,reschedule}() API. If that API call raised (transient network blip, server 5xx, etc.), _terminal_state was set on the supervisor but the server never saw the transition. The supervisor's update_task_state_if_needed then saw final_state in STATES_SENT_DIRECTLY and short-circuited the recovery finish() call — leaving the TaskInstance stuck RUNNING on the server forever, blocking downstream dependencies and triggering false alerts.

Fix (two parts)

1. Set _terminal_state after the direct API call succeeds

Make the direct API call first. Only set _terminal_state and the new _terminal_state_synced_to_server flag after the call returns successfully. If the API raises, both stay unset and the exception propagates to handle_requests, where the existing catch-all sends an ErrorResponse to the task subprocess.

2. Recovery in update_task_state_if_needed

Always call finish() when _terminal_state_synced_to_server is False, regardless of what final_state happens to return. The finish() API takes the state value, so a SUCCESS / DEFERRED / etc. transition that originally failed is re-attempted via finish() on subprocess exit. Pre-existing semantics for the no-direct-API states (FAILED, UP_FOR_RETRY without RetryTask, etc.) are preserved — those land in the same finish() branch.

Tests

  • _terminal_state not set when succeed() raises.
  • update_task_state_if_needed calls finish() when synced flag is False, even with final_state == SUCCESS.
  • update_task_state_if_needed skips finish() when synced flag is True (preserves the existing happy-path optimisation).

Reported by

L3 ASVS sweep — apache/tooling-agents#24 (FINDING-007).


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

@potiuk potiuk requested review from amoghrajesh, ashb and kaxil as code owners May 7, 2026 22:28
@potiuk potiuk force-pushed the fix/tasksdk-terminal-state-after-api-success branch from 1c22726 to 6e75113 Compare May 8, 2026 00:46
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
@potiuk potiuk force-pushed the fix/tasksdk-terminal-state-after-api-success branch from 6e75113 to 2f7437b Compare May 17, 2026 19:31
@potiuk
Copy link
Copy Markdown
Member Author

potiuk commented May 17, 2026

I'd love to get this one merged — and would love it in 3.2.2 if it's not too late. cc @vatsrahul1001 (3.2.2 RM)


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@potiuk potiuk added this to the Airflow 3.2.2 milestone May 17, 2026
@potiuk potiuk added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 17, 2026
Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py Outdated
@vatsrahul1001
Copy link
Copy Markdown
Contributor

@potiuk can you resolve the comments ?

Copy link
Copy Markdown
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, needs update to test

Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py Outdated
potiuk added a commit to potiuk/airflow that referenced this pull request May 19, 2026
…tates

Address review feedback on apache#66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.
potiuk and others added 3 commits May 19, 2026 19:14
The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask,
and RescheduleTask set _terminal_state BEFORE calling the matching
client.task_instances.{succeed,retry,defer,reschedule}() API. If that
API call raised (transient network blip, server 5xx, etc.),
_terminal_state was set on the supervisor but the server never saw
the transition. The supervisor's update_task_state_if_needed then
saw final_state in STATES_SENT_DIRECTLY and short-circuited the
recovery finish() call -- leaving the TaskInstance stuck RUNNING
on the server forever, blocking downstream dependencies and
triggering false alerts.

Two-part fix:

1. Make the direct API call FIRST. Only set _terminal_state and the
   new _terminal_state_synced_to_server flag after the call returns
   successfully. If the API raises, both stay unset and the exception
   propagates to handle_requests, where the existing catch-all sends
   an ErrorResponse to the task subprocess.

2. Have update_task_state_if_needed always call finish() when
   _terminal_state_synced_to_server is False, regardless of what
   final_state happens to return. The finish() API takes the state
   value, so a SUCCESS / DEFERRED / etc. transition that originally
   failed is re-attempted via finish() on subprocess exit.
   Pre-existing semantics for the no-direct-API states (FAILED,
   UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in
   the same finish() branch.

Tests added:

- _terminal_state not set when succeed() raises.
- update_task_state_if_needed calls finish() when synced flag is
  False, even with final_state == SUCCESS.
- update_task_state_if_needed skips finish() when synced flag is
  True (preserves the existing happy-path optimisation).

Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007).
…tates

Address review feedback on apache#66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.
The field was annotated as BaseModel | None, but _send_terminal_state_msg
expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy
couldn't prove the narrowing at the _replay_pending_terminal_state_msg
call site. Tighten the field type to the exact union the setter assigns
and the consumer accepts.
@vatsrahul1001 vatsrahul1001 merged commit 173c2a1 into apache:main May 19, 2026
113 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

Backport failed to create: v3-2-test. View the failure log Run details

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 173c2a1 v3-2-test

This should apply the commit to the v3-2-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

If you don't have cherry-picker installed, see the installation guide.

@vatsrahul1001
Copy link
Copy Markdown
Contributor

Manual backport for review #67204

vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…ls (#66574) (#67204)

* Recover stuck TIs when direct terminal-state API call fails (#66574)

* Recover stuck TIs when direct terminal-state API call fails

The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask,
and RescheduleTask set _terminal_state BEFORE calling the matching
client.task_instances.{succeed,retry,defer,reschedule}() API. If that
API call raised (transient network blip, server 5xx, etc.),
_terminal_state was set on the supervisor but the server never saw
the transition. The supervisor's update_task_state_if_needed then
saw final_state in STATES_SENT_DIRECTLY and short-circuited the
recovery finish() call -- leaving the TaskInstance stuck RUNNING
on the server forever, blocking downstream dependencies and
triggering false alerts.

Two-part fix:

1. Make the direct API call FIRST. Only set _terminal_state and the
   new _terminal_state_synced_to_server flag after the call returns
   successfully. If the API raises, both stay unset and the exception
   propagates to handle_requests, where the existing catch-all sends
   an ErrorResponse to the task subprocess.

2. Have update_task_state_if_needed always call finish() when
   _terminal_state_synced_to_server is False, regardless of what
   final_state happens to return. The finish() API takes the state
   value, so a SUCCESS / DEFERRED / etc. transition that originally
   failed is re-attempted via finish() on subprocess exit.
   Pre-existing semantics for the no-direct-API states (FAILED,
   UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in
   the same finish() branch.

Tests added:

- _terminal_state not set when succeed() raises.
- update_task_state_if_needed calls finish() when synced flag is
  False, even with final_state == SUCCESS.
- update_task_state_if_needed skips finish() when synced flag is
  True (preserves the existing happy-path optimisation).

Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007).

* Refactor terminal-state dispatch and parametrize tests across all 4 states

Address review feedback on #66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.

* Narrow _pending_terminal_state_msg type to satisfy mypy

The field was annotated as BaseModel | None, but _send_terminal_state_msg
expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy
couldn't prove the narrowing at the _replay_pending_terminal_state_msg
call site. Tighten the field type to the exact union the setter assigns
and the consumer accepts.

---------

Co-authored-by: vatsrahul1001 <rah.sharma11@gmail.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
(cherry picked from commit 173c2a1)

* Don't pass retry_delay_seconds/retry_reason to retry() — not in v3-2-test signature

The cherry-picked _send_terminal_state_msg dispatcher passed
retry_delay_seconds and retry_reason kwargs (via getattr defensive
fallback) to client.task_instances.retry(). The v3-2-test version of
retry() in task-sdk/src/airflow/sdk/api/client.py only accepts
(id, end_date, rendered_map_index) — those kwargs don't exist on this
branch yet.

In mock-based unit tests the extra kwargs were silently accepted by
Mock but tripped assert_called_once_with. In real DB tests (Postgres
test_task_instance_history_is_created_when_ti_goes_for_retry,
MySQL/SQLite equivalents) the retry() call raised TypeError, the API
server never received the retry transition, and TaskInstanceHistory
never got created — the test's UUID-rotation assertion failed.

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…ls (#66574) (#67204)

* Recover stuck TIs when direct terminal-state API call fails (#66574)

* Recover stuck TIs when direct terminal-state API call fails

The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask,
and RescheduleTask set _terminal_state BEFORE calling the matching
client.task_instances.{succeed,retry,defer,reschedule}() API. If that
API call raised (transient network blip, server 5xx, etc.),
_terminal_state was set on the supervisor but the server never saw
the transition. The supervisor's update_task_state_if_needed then
saw final_state in STATES_SENT_DIRECTLY and short-circuited the
recovery finish() call -- leaving the TaskInstance stuck RUNNING
on the server forever, blocking downstream dependencies and
triggering false alerts.

Two-part fix:

1. Make the direct API call FIRST. Only set _terminal_state and the
   new _terminal_state_synced_to_server flag after the call returns
   successfully. If the API raises, both stay unset and the exception
   propagates to handle_requests, where the existing catch-all sends
   an ErrorResponse to the task subprocess.

2. Have update_task_state_if_needed always call finish() when
   _terminal_state_synced_to_server is False, regardless of what
   final_state happens to return. The finish() API takes the state
   value, so a SUCCESS / DEFERRED / etc. transition that originally
   failed is re-attempted via finish() on subprocess exit.
   Pre-existing semantics for the no-direct-API states (FAILED,
   UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in
   the same finish() branch.

Tests added:

- _terminal_state not set when succeed() raises.
- update_task_state_if_needed calls finish() when synced flag is
  False, even with final_state == SUCCESS.
- update_task_state_if_needed skips finish() when synced flag is
  True (preserves the existing happy-path optimisation).

Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007).

* Refactor terminal-state dispatch and parametrize tests across all 4 states

Address review feedback on #66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.

* Narrow _pending_terminal_state_msg type to satisfy mypy

The field was annotated as BaseModel | None, but _send_terminal_state_msg
expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy
couldn't prove the narrowing at the _replay_pending_terminal_state_msg
call site. Tighten the field type to the exact union the setter assigns
and the consumer accepts.

---------

Co-authored-by: vatsrahul1001 <rah.sharma11@gmail.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
(cherry picked from commit 173c2a1)

* Don't pass retry_delay_seconds/retry_reason to retry() — not in v3-2-test signature

The cherry-picked _send_terminal_state_msg dispatcher passed
retry_delay_seconds and retry_reason kwargs (via getattr defensive
fallback) to client.task_instances.retry(). The v3-2-test version of
retry() in task-sdk/src/airflow/sdk/api/client.py only accepts
(id, end_date, rendered_map_index) — those kwargs don't exist on this
branch yet.

In mock-based unit tests the extra kwargs were silently accepted by
Mock but tripped assert_called_once_with. In real DB tests (Postgres
test_task_instance_history_is_created_when_ti_goes_for_retry,
MySQL/SQLite equivalents) the retry() call raised TypeError, the API
server never received the retry transition, and TaskInstanceHistory
never got created — the test's UUID-rotation assertion failed.

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…ls (#66574) (#67204)

* Recover stuck TIs when direct terminal-state API call fails (#66574)

* Recover stuck TIs when direct terminal-state API call fails

The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask,
and RescheduleTask set _terminal_state BEFORE calling the matching
client.task_instances.{succeed,retry,defer,reschedule}() API. If that
API call raised (transient network blip, server 5xx, etc.),
_terminal_state was set on the supervisor but the server never saw
the transition. The supervisor's update_task_state_if_needed then
saw final_state in STATES_SENT_DIRECTLY and short-circuited the
recovery finish() call -- leaving the TaskInstance stuck RUNNING
on the server forever, blocking downstream dependencies and
triggering false alerts.

Two-part fix:

1. Make the direct API call FIRST. Only set _terminal_state and the
   new _terminal_state_synced_to_server flag after the call returns
   successfully. If the API raises, both stay unset and the exception
   propagates to handle_requests, where the existing catch-all sends
   an ErrorResponse to the task subprocess.

2. Have update_task_state_if_needed always call finish() when
   _terminal_state_synced_to_server is False, regardless of what
   final_state happens to return. The finish() API takes the state
   value, so a SUCCESS / DEFERRED / etc. transition that originally
   failed is re-attempted via finish() on subprocess exit.
   Pre-existing semantics for the no-direct-API states (FAILED,
   UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in
   the same finish() branch.

Tests added:

- _terminal_state not set when succeed() raises.
- update_task_state_if_needed calls finish() when synced flag is
  False, even with final_state == SUCCESS.
- update_task_state_if_needed skips finish() when synced flag is
  True (preserves the existing happy-path optimisation).

Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007).

* Refactor terminal-state dispatch and parametrize tests across all 4 states

Address review feedback on #66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.

* Narrow _pending_terminal_state_msg type to satisfy mypy

The field was annotated as BaseModel | None, but _send_terminal_state_msg
expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy
couldn't prove the narrowing at the _replay_pending_terminal_state_msg
call site. Tighten the field type to the exact union the setter assigns
and the consumer accepts.

---------

Co-authored-by: vatsrahul1001 <rah.sharma11@gmail.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
(cherry picked from commit 173c2a1)

* Don't pass retry_delay_seconds/retry_reason to retry() — not in v3-2-test signature

The cherry-picked _send_terminal_state_msg dispatcher passed
retry_delay_seconds and retry_reason kwargs (via getattr defensive
fallback) to client.task_instances.retry(). The v3-2-test version of
retry() in task-sdk/src/airflow/sdk/api/client.py only accepts
(id, end_date, rendered_map_index) — those kwargs don't exist on this
branch yet.

In mock-based unit tests the extra kwargs were silently accepted by
Mock but tripped assert_called_once_with. In real DB tests (Postgres
test_task_instance_history_is_created_when_ti_goes_for_retry,
MySQL/SQLite equivalents) the retry() call raised TypeError, the API
server never received the retry transition, and TaskInstanceHistory
never got created — the test's UUID-rotation assertion failed.

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001 added a commit that referenced this pull request May 21, 2026
…ls (#66574) (#67204)

* Recover stuck TIs when direct terminal-state API call fails (#66574)

* Recover stuck TIs when direct terminal-state API call fails

The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask,
and RescheduleTask set _terminal_state BEFORE calling the matching
client.task_instances.{succeed,retry,defer,reschedule}() API. If that
API call raised (transient network blip, server 5xx, etc.),
_terminal_state was set on the supervisor but the server never saw
the transition. The supervisor's update_task_state_if_needed then
saw final_state in STATES_SENT_DIRECTLY and short-circuited the
recovery finish() call -- leaving the TaskInstance stuck RUNNING
on the server forever, blocking downstream dependencies and
triggering false alerts.

Two-part fix:

1. Make the direct API call FIRST. Only set _terminal_state and the
   new _terminal_state_synced_to_server flag after the call returns
   successfully. If the API raises, both stay unset and the exception
   propagates to handle_requests, where the existing catch-all sends
   an ErrorResponse to the task subprocess.

2. Have update_task_state_if_needed always call finish() when
   _terminal_state_synced_to_server is False, regardless of what
   final_state happens to return. The finish() API takes the state
   value, so a SUCCESS / DEFERRED / etc. transition that originally
   failed is re-attempted via finish() on subprocess exit.
   Pre-existing semantics for the no-direct-API states (FAILED,
   UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in
   the same finish() branch.

Tests added:

- _terminal_state not set when succeed() raises.
- update_task_state_if_needed calls finish() when synced flag is
  False, even with final_state == SUCCESS.
- update_task_state_if_needed skips finish() when synced flag is
  True (preserves the existing happy-path optimisation).

Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007).

* Refactor terminal-state dispatch and parametrize tests across all 4 states

Address review feedback on #66574:

- Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch
  for succeed / retry / defer / reschedule lives in one place. Both
  `_handle_request` and `_replay_pending_terminal_state_msg` now go
  through it instead of duplicating the four-branch isinstance chain.
- Parametrize the two recovery tests over all four terminal-state
  message types (was only Succeed + Defer); add UP_FOR_RETRY and
  UP_FOR_RESCHEDULE coverage.

* Narrow _pending_terminal_state_msg type to satisfy mypy

The field was annotated as BaseModel | None, but _send_terminal_state_msg
expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy
couldn't prove the narrowing at the _replay_pending_terminal_state_msg
call site. Tighten the field type to the exact union the setter assigns
and the consumer accepts.

---------

Co-authored-by: vatsrahul1001 <rah.sharma11@gmail.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
(cherry picked from commit 173c2a1)

* Don't pass retry_delay_seconds/retry_reason to retry() — not in v3-2-test signature

The cherry-picked _send_terminal_state_msg dispatcher passed
retry_delay_seconds and retry_reason kwargs (via getattr defensive
fallback) to client.task_instances.retry(). The v3-2-test version of
retry() in task-sdk/src/airflow/sdk/api/client.py only accepts
(id, end_date, rendered_map_index) — those kwargs don't exist on this
branch yet.

In mock-based unit tests the extra kwargs were silently accepted by
Mock but tripped assert_called_once_with. In real DB tests (Postgres
test_task_instance_history_is_created_when_ti_goes_for_retry,
MySQL/SQLite equivalents) the retry() call raised TypeError, the API
server never received the retry transition, and TaskInstanceHistory
never got created — the test's UUID-rotation assertion failed.

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants