Tighten on_task_instance_failed error type to BaseException | None#66399
Draft
1fanwang wants to merge 3 commits intoapache:mainfrom
Draft
Tighten on_task_instance_failed error type to BaseException | None#663991fanwang wants to merge 3 commits intoapache:mainfrom
1fanwang wants to merge 3 commits intoapache:mainfrom
Conversation
The ``error`` arg on ``on_task_instance_failed`` was typed ``None | str | BaseException``. The string variant only ever appeared on the manual-set FAILED state path on the API server, where the call site passed a hard-coded human-readable message. Listener implementations had to ``isinstance(error, str)`` to detect that path even though ``str(error)`` worked uniformly across both branches. This change wraps the manual-set message in a ``RuntimeError`` at the call site, tightens the hookspec type to ``BaseException | None``, and updates the example listener and test fixture accordingly. Listeners now always receive an exception type with ``str(error)`` carrying the message; the ``msg`` arg added in apache#66394 (``msg == "manually_set_to_failed"``) remains the canonical signal for "this came from the API path". Backwards compatibility: listeners relying on ``isinstance(error, str)`` will need to read the message via ``str(error)`` and route on ``msg``.
This was referenced May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
errorargument onon_task_instance_failedwas typedNone | str | BaseException. The string variant only ever appeared onthe manual-set FAILED state path on the API server, where the call site
passed a hard-coded human-readable message. Listener implementations had
to
isinstance(error, str)to detect that path even thoughstr(error)worked uniformly across both branches.This PR:
RuntimeErrorat the API call sitein
_emit_state_listener_hooks.on_task_instance_failedhookspec type toBaseException | None.ClassBasedListenertestfixture to use the new type, and exposes
last_erroron the fixture.test_patch_task_instance_notifies_listenerstest: when the newstate is
failed, asserts the listener received aRuntimeErrorwhose
str()carries the human-readable manual-set message.Why now
Listeners now have a uniform contract:
erroris always either anexception or
None. Themsgarg added in #66394 already lets alistener distinguish manual-set from worker paths
(
msg == \"manually_set_to_failed\"); the stringerrorwas a second,redundant signal carrying the same intent in lossier form.
Backwards compatibility
This is a typed-API change. Listener implementations that rely on
isinstance(error, str)to detect the manual-set path will need toread the message via
str(error)and route onmsg(orisinstance(error, RuntimeError)).The existing
msgkeyword argument is the recommended dispatch axisgoing forward — it doesn't depend on inspecting
errorat all.^ Add meaningful description above
Read the Pull Request Guidelines for more information.
E2E validation
Listener authors who previously did
isinstance(error, str)to detect the manual-set path now readstr(error)for the message and route onmsg == "manually_set_to_failed"(introduced in #66394).Real e2e validation (Airflow standalone)
Re-ran with
airflow standaloneagainst the worktree's editable install. Triggered a successful task run, then PATCHed its state tofailedvia the API server's/api/v2/dags/.../taskInstances/{task}endpoint to exercise_emit_state_listener_hooks. Recording listener captured theerrorarg type:Compare to the same path on PR #66394 (without this PR applied):
Same human-readable message, but the listener now receives a
RuntimeErrorinstance —isinstance(error, BaseException)works uniformly across worker-side and API-driven failure paths. Listener authors no longer need anisinstance(error, str)check to detect the manual-set path.Integrated mega-branch validation (all 7 PRs composed)
This PR was independently validated, plus all seven PRs in this stack (#66394, #66395, #66397, #66399, #66402, #66405, #66410) were merged onto a single branch and exercised end-to-end through real services —
airflow standalonerunning scheduler + API server + LocalExecutor + Postgres-equivalent (sqlite for the test). A single listener plugin declaring every new hook and parameter was registered, then 5 DAGs covering every state-transition path were triggered + a manual-set-state PATCH via the public API was issued. The listener log is below — every annotation maps a line to the PR that introduced it:What this validates jointly:
msg=...started,success,failed,skipped,up_for_retry,manually_set_to_failed)error: BaseException | NoneRuntimeError(wasstron PR-A alone)AirflowTaskCheckpointedrunning → checkpointedtransition observed at the listener and at the supervisor message boundaryfailure_detailskwargfailure_details=Noneflowing through every failure (no executor populates yet)checkpointed task=checkpoint_task checkpoint_data={'step': 5, ...}Repro
Bugs surfaced and fixed during this validation
This step caught 6 bugs that the layer-2 unit-test pass missed — every fix is a separate commit on its respective PR's branch:
AirflowTaskCheckpointedimport inrun()(NameError)_generated.pyTaskInstanceStatemissing CHECKPOINTED (AttributeError)TaskStatesupervisor message Literal rejected CHECKPOINTED (PydanticValidationError)_generated.pyIntermediateTIStatemissing CHECKPOINTEDfailure_details=Nonedefault silently received Noneon_task_instance_failedcall site missingfailure_detailskwarg (HookCallError: hook call must provide argument)Last two would have broken every task failure on apache/airflow
mainif the foundation PRs landed without the call-site fixes. The standalone-against-editable-install harness is a fast catch for this class.Documented gap (deliberately not fixed in this stack)
task-sdk/.../supervisor.py:STATES_SENT_DIRECTLYlists the states the worker sends to the supervisor with a dedicated direct-send branch. CHECKPOINTED is not in that list, so it falls back toclient.task_instances.finish()which the API server constrains to terminal states. The mega listener log shows the worker successfully loggingTask checkpointed; reporting CHECKPOINTED state.andon_task_instance_checkpointedfiring with the correct payload — but the DB row eventually transitions tofailedbecause the supervisor cannot persist CHECKPOINTED throughfinish(). This is the AIP-96 design knob (auto-resume vs manual-resume-only) we deliberately want the discussion to settle, not silently pick. Documented in #66402.