Honor AirflowFailException raised inside on_retry_callback by 1fanwang · Pull Request #66781 · apache/airflow

1fanwang · 2026-05-12T15:20:35Z

_run_task_state_change_callbacks in task-sdk/src/airflow/sdk/execution_time/task_runner.py catches every exception from a callback and logs it. That's the right default for noisy cleanup work, but it also swallows the explicit AirflowFailException signal — a user raising it inside on_retry_callback to say "fail without retrying" had no way to actually fail the task. The state stayed UP_FOR_RETRY and another attempt was scheduled.

The fix narrows the catch in the retry-callback path:

AirflowFailException re-raises so the caller can react.
Every other Exception is still logged + swallowed — Wei's concern on the prior attempt (PR Fix AirflowFailException in on_retry_callback not preventing retries #64198) about cleanup callbacks that may fail is preserved.

run() now defers the supervisor RetryTask message until after the retry callback runs. finalize() gained an optional msg parameter: when the retry callback raises AirflowFailException, finalize promotes the state to FAILED, replaces the pending RetryTask with TaskState(FAILED), and runs the failure-path finalizers (on_failure_callback, listener hook, email_on_failure).

Two prior attempts (#60415 closed 2026-03-05, #64198 closed 2026-05-05) tried to solve this; this PR picks up the Wei-shaped design and ships it with a regression test.

Tests

task-sdk/tests/task_sdk/execution_time/test_task_runner.py::TestTaskRunnerCallsCallbacks::test_airflow_fail_exception_in_on_retry_callback_fails_task exercises the full path: retry callback raises AirflowFailException, the failure callback runs, the supervisor receives TaskState(FAILED) instead of RetryTask. The test fails on main and passes with this PR; existing callback tests in the file behave unchanged (generic Exception from a retry callback still swallowed + retry still scheduled).

Reproducer

Reverting just the production changes (task-sdk/src/airflow/sdk/execution_time/task_runner.py and supervisor.py) back to upstream/main while keeping the new test surfaces the bug directly. With the call site adapted to the older finalize() signature (no msg kwarg), runtime_ti.state lands on UP_FOR_RETRY instead of FAILED:

task-sdk/tests/task_sdk/execution_time/test_task_runner.py::TestTaskRunnerCallsCallbacks::test_airflow_fail_exception_in_on_retry_callback_fails_task FAILED [100%]

=================================== FAILURES ===================================
_ TestTaskRunnerCallsCallbacks.test_airflow_fail_exception_in_on_retry_callback_fails_task _
task-sdk/tests/task_sdk/execution_time/test_task_runner.py:4542: in test_airflow_fail_exception_in_on_retry_callback_fails_task
    assert runtime_ti.state == TaskInstanceState.FAILED
E   AssertionError: assert <TaskInstanceState.UP_FOR_RETRY: 'up_for_retry'> == <TaskInstanceState.FAILED: 'failed'>
E
E     - failed
E     + up_for_retry
========================= 1 failed, 1 warning in 9.30s =========================

Restoring the production code (the full PR diff) flips the same test to PASSED:

task-sdk/tests/task_sdk/execution_time/test_task_runner.py::TestTaskRunnerCallsCallbacks::test_airflow_fail_exception_in_on_retry_callback_fails_task PASSED [100%]

========================= 1 passed, 1 warning in 1.72s =========================

The UP_FOR_RETRY line is the literal symptom from #60172 — the user raised AirflowFailException inside on_retry_callback and the task still ended up scheduled for another attempt. With the fix, the same callback raise promotes the task to FAILED and the failure-path finalizers fire under that state.

SameerMesiah97

Approach looks fine to me but someone more familiar with this area should weigh in. I have left some comments.

SameerMesiah97 · 2026-05-12T21:33:47Z

+        # For UP_FOR_RETRY, defer sending the message until after on_retry_callback has run
+        # (finalize() sends it). This lets an AirflowFailException raised inside the callback
+        # promote the state to FAILED instead of letting the supervisor record a retry that
+        # the user explicitly asked to skip. See #60172.


nit: this could be clearer:

# Delay reporting UP_FOR_RETRY to the supervisor until after # on_retry_callback runs so AirflowFailException can promote # the task to FAILED and suppress the retry.

I don't think the issue reference is needed. But this is a more subjective point.

Done in 193d2d6 — applied the shorter wording and dropped the issue reference.

SameerMesiah97 · 2026-05-12T21:47:49Z

+            except Exception:
+                log.exception("error calling listener")
+            if error and task.email_on_retry and task.email:
+                _send_error_email_notification(task, ti, context, error, log)


Right now, this is a bit hard to follow. I would recommend extracting the listener and email notification into a helper like this:

def _handle_failure_notifications( *, task, ti, context, error, log, send_email: bool, ) -> None: try: get_listener_manager().hook.on_task_instance_failed( previous_state=TaskInstanceState.RUNNING, task_instance=ti, error=error, ) except Exception: log.exception("error calling listener") if send_email and task.email: _send_error_email_notification(task, ti, context, error, log)

Then plug _handle_failure_notifications in so that lines 1960-1976 are replaced by this:

_handle_failure_notifications( task=task, ti=ti, context=context, error=error, log=log, send_email=task.email_on_failure, ) else: _handle_failure_notifications( task=task, ti=ti, context=context, error=error, log=log, send_email=bool(error and task.email_on_retry), )

Done in 193d2d6 — added _handle_failure_notifications and applied it across all three failure-path branches in finalize (AirflowFailException, retry, and FAILED) so the listener+email pattern only lives in one place.

SameerMesiah97 · 2026-05-12T21:49:08Z

+        """
+        AirflowFailException raised in on_retry_callback should fail the task without retrying.
+
+        Regression test for #60172.


Remove this.

Done in f59182e — dropped the docstring; the test name already describes the behavior.

SameerMesiah97 · 2026-05-12T21:53:27Z

+        # Both callbacks should have run (retry callback first, then failure callback after
+        # AirflowFailException promoted the state to FAILED).
+        assert len(retry_callback_calls) == 1
+        assert len(failure_callback_calls) == 1


Maybe it would be better to assert the state transition here instead of the number of calls:

assert retry_callback_calls == [TaskInstanceState.UP_FOR_RETRY] assert failure_callback_calls == [TaskInstanceState.FAILED]

Done in f59182e — retry_callback_calls == [UP_FOR_RETRY] and failure_callback_calls == [FAILED] pin both the count and the state at callback time.

Raising AirflowFailException is the documented way to fail a task without retrying. Until now that signal was silently swallowed when raised from on_retry_callback: the catch-all `except Exception` inside `_run_task_state_change_callbacks` ate it, the task stayed UP_FOR_RETRY, and another attempt was scheduled. The retry path now defers sending its terminal message until after on_retry_callback has run. If the callback raises AirflowFailException, the state is promoted to FAILED, the pending RetryTask is replaced with TaskState(FAILED), and the failure-path finalizers (on_failure_callback, listener.on_task_instance_failed, email_on_failure) run as if the task had failed without ever attempting a retry. Other exceptions from on_retry_callback are still logged and swallowed, so callbacks that optimistically clean up partial data continue to work unchanged. closes: apache#60172

…mment - Rename newsfragment to match PR number (60172 -> 66781). - Tighten the comment in `run()` explaining why UP_FOR_RETRY messages are deferred until after on_retry_callback runs. - Extract `_handle_failure_notifications` for the listener + email pattern repeated across the three failure-path branches in `finalize` (AirflowFailException, retry, and FAILED). Signed-off-by: 1fanwang <1fannnw@gmail.com>

…transitions - Drop the docstring on `test_airflow_fail_exception_in_on_retry_callback_fails_task`; the test name already describes the assertion, and the project style guide prefers no docstrings in tests that merely repeat the function name. - Assert the recorded callback states directly (`retry_callback_calls == [UP_FOR_RETRY]`, `failure_callback_calls == [FAILED]`) instead of the call counts. Pins both the count AND the state at the moment each callback ran -- proving on_retry_callback fired while the task was UP_FOR_RETRY and on_failure_callback fired after AirflowFailException promoted it to FAILED. Signed-off-by: 1fanwang <1fannnw@gmail.com>

1fanwang requested review from amoghrajesh, ashb and kaxil as code owners May 12, 2026 15:20

boring-cyborg Bot added the area:task-sdk label May 12, 2026

SameerMesiah97 reviewed May 12, 2026

View reviewed changes

1fanwang force-pushed the fix/airflow-fail-in-retry-callback branch from f59182e to 8f538c8 Compare May 13, 2026 21:25

potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 19, 2026

1fanwang added 3 commits May 19, 2026 11:52

1fanwang force-pushed the fix/airflow-fail-in-retry-callback branch from 8f538c8 to c47b237 Compare May 19, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor AirflowFailException raised inside on_retry_callback#66781

Honor AirflowFailException raised inside on_retry_callback#66781
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/airflow-fail-in-retry-callback

1fanwang commented May 12, 2026 •

edited

Loading

Uh oh!

SameerMesiah97 left a comment

Uh oh!

Uh oh!

SameerMesiah97 May 12, 2026

Uh oh!

1fanwang May 13, 2026

Uh oh!

SameerMesiah97 May 12, 2026

Uh oh!

1fanwang May 13, 2026

Uh oh!

SameerMesiah97 May 12, 2026

Uh oh!

1fanwang May 13, 2026

Uh oh!

SameerMesiah97 May 12, 2026

Uh oh!

1fanwang May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

1fanwang commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Reproducer

Uh oh!

SameerMesiah97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1fanwang commented May 12, 2026 •

edited

Loading