Skip to content

Honor AirflowFailException raised inside on_retry_callback#66781

Open
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/airflow-fail-in-retry-callback
Open

Honor AirflowFailException raised inside on_retry_callback#66781
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/airflow-fail-in-retry-callback

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 12, 2026

Closes #60172.

_run_task_state_change_callbacks in task-sdk/src/airflow/sdk/execution_time/task_runner.py catches every exception from a callback and logs it. That's the right default for noisy cleanup work, but it also swallows the explicit AirflowFailException signal — a user raising it inside on_retry_callback to say "fail without retrying" had no way to actually fail the task. The state stayed UP_FOR_RETRY and another attempt was scheduled.

The fix narrows the catch in the retry-callback path:

run() now defers the supervisor RetryTask message until after the retry callback runs. finalize() gained an optional msg parameter: when the retry callback raises AirflowFailException, finalize promotes the state to FAILED, replaces the pending RetryTask with TaskState(FAILED), and runs the failure-path finalizers (on_failure_callback, listener hook, email_on_failure).

Two prior attempts (#60415 closed 2026-03-05, #64198 closed 2026-05-05) tried to solve this; this PR picks up the Wei-shaped design and ships it with a regression test.

Tests

task-sdk/tests/task_sdk/execution_time/test_task_runner.py::TestTaskRunnerCallsCallbacks::test_airflow_fail_exception_in_on_retry_callback_fails_task exercises the full path: retry callback raises AirflowFailException, the failure callback runs, the supervisor receives TaskState(FAILED) instead of RetryTask. The test fails on main and passes with this PR; existing callback tests in the file behave unchanged (generic Exception from a retry callback still swallowed + retry still scheduled).

Reproducer

Reverting just the production changes (task-sdk/src/airflow/sdk/execution_time/task_runner.py and supervisor.py) back to upstream/main while keeping the new test surfaces the bug directly. With the call site adapted to the older finalize() signature (no msg kwarg), runtime_ti.state lands on UP_FOR_RETRY instead of FAILED:

task-sdk/tests/task_sdk/execution_time/test_task_runner.py::TestTaskRunnerCallsCallbacks::test_airflow_fail_exception_in_on_retry_callback_fails_task FAILED [100%]

=================================== FAILURES ===================================
_ TestTaskRunnerCallsCallbacks.test_airflow_fail_exception_in_on_retry_callback_fails_task _
task-sdk/tests/task_sdk/execution_time/test_task_runner.py:4542: in test_airflow_fail_exception_in_on_retry_callback_fails_task
    assert runtime_ti.state == TaskInstanceState.FAILED
E   AssertionError: assert <TaskInstanceState.UP_FOR_RETRY: 'up_for_retry'> == <TaskInstanceState.FAILED: 'failed'>
E
E     - failed
E     + up_for_retry
========================= 1 failed, 1 warning in 9.30s =========================

Restoring the production code (the full PR diff) flips the same test to PASSED:

task-sdk/tests/task_sdk/execution_time/test_task_runner.py::TestTaskRunnerCallsCallbacks::test_airflow_fail_exception_in_on_retry_callback_fails_task PASSED [100%]

========================= 1 passed, 1 warning in 1.72s =========================

The UP_FOR_RETRY line is the literal symptom from #60172 — the user raised AirflowFailException inside on_retry_callback and the task still ended up scheduled for another attempt. With the fix, the same callback raise promotes the task to FAILED and the failure-path finalizers fire under that state.

Copy link
Copy Markdown
Contributor

@SameerMesiah97 SameerMesiah97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks fine to me but someone more familiar with this area should weigh in. I have left some comments.

Comment thread airflow-core/newsfragments/66781.bugfix.rst
# For UP_FOR_RETRY, defer sending the message until after on_retry_callback has run
# (finalize() sends it). This lets an AirflowFailException raised inside the callback
# promote the state to FAILED instead of letting the supervisor record a retry that
# the user explicitly asked to skip. See #60172.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be clearer:


# Delay reporting UP_FOR_RETRY to the supervisor until after
# on_retry_callback runs so AirflowFailException can promote
# the task to FAILED and suppress the retry.

I don't think the issue reference is needed. But this is a more subjective point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 193d2d6 — applied the shorter wording and dropped the issue reference.

except Exception:
log.exception("error calling listener")
if error and task.email_on_retry and task.email:
_send_error_email_notification(task, ti, context, error, log)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this is a bit hard to follow. I would recommend extracting the listener and email notification into a helper like this:

def _handle_failure_notifications(
    *,
    task,
    ti,
    context,
    error,
    log,
    send_email: bool,
) -> None:
    try:
        get_listener_manager().hook.on_task_instance_failed(
            previous_state=TaskInstanceState.RUNNING,
            task_instance=ti,
            error=error,
        )
    except Exception:
        log.exception("error calling listener")

    if send_email and task.email:
        _send_error_email_notification(task, ti, context, error, log)

Then plug _handle_failure_notifications in so that lines 1960-1976 are replaced by this:

        _handle_failure_notifications(
            task=task,
            ti=ti,
            context=context,
            error=error,
            log=log,
            send_email=task.email_on_failure,
        )

    else:
        _handle_failure_notifications(
            task=task,
            ti=ti,
            context=context,
            error=error,
            log=log,
            send_email=bool(error and task.email_on_retry),
        )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 193d2d6 — added _handle_failure_notifications and applied it across all three failure-path branches in finalize (AirflowFailException, retry, and FAILED) so the listener+email pattern only lives in one place.

Comment on lines +4489 to +4516
"""
AirflowFailException raised in on_retry_callback should fail the task without retrying.

Regression test for #60172.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f59182e — dropped the docstring; the test name already describes the behavior.

# Both callbacks should have run (retry callback first, then failure callback after
# AirflowFailException promoted the state to FAILED).
assert len(retry_callback_calls) == 1
assert len(failure_callback_calls) == 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to assert the state transition here instead of the number of calls:

assert retry_callback_calls == [TaskInstanceState.UP_FOR_RETRY]
assert failure_callback_calls == [TaskInstanceState.FAILED]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f59182eretry_callback_calls == [UP_FOR_RETRY] and failure_callback_calls == [FAILED] pin both the count and the state at callback time.

@1fanwang 1fanwang force-pushed the fix/airflow-fail-in-retry-callback branch from f59182e to 8f538c8 Compare May 13, 2026 21:25
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 19, 2026
1fanwang added 3 commits May 19, 2026 11:52
Raising AirflowFailException is the documented way to fail a task without
retrying. Until now that signal was silently swallowed when raised from
on_retry_callback: the catch-all `except Exception` inside
`_run_task_state_change_callbacks` ate it, the task stayed UP_FOR_RETRY,
and another attempt was scheduled.

The retry path now defers sending its terminal message until after
on_retry_callback has run. If the callback raises AirflowFailException,
the state is promoted to FAILED, the pending RetryTask is replaced with
TaskState(FAILED), and the failure-path finalizers (on_failure_callback,
listener.on_task_instance_failed, email_on_failure) run as if the task
had failed without ever attempting a retry. Other exceptions from
on_retry_callback are still logged and swallowed, so callbacks that
optimistically clean up partial data continue to work unchanged.

closes: apache#60172
…mment

- Rename newsfragment to match PR number (60172 -> 66781).
- Tighten the comment in `run()` explaining why UP_FOR_RETRY messages are
  deferred until after on_retry_callback runs.
- Extract `_handle_failure_notifications` for the listener + email pattern
  repeated across the three failure-path branches in `finalize`
  (AirflowFailException, retry, and FAILED).

Signed-off-by: 1fanwang <1fannnw@gmail.com>
…transitions

- Drop the docstring on `test_airflow_fail_exception_in_on_retry_callback_fails_task`;
  the test name already describes the assertion, and the project style guide
  prefers no docstrings in tests that merely repeat the function name.
- Assert the recorded callback states directly
  (`retry_callback_calls == [UP_FOR_RETRY]`, `failure_callback_calls == [FAILED]`)
  instead of the call counts. Pins both the count AND the state at the moment
  each callback ran -- proving on_retry_callback fired while the task was
  UP_FOR_RETRY and on_failure_callback fired after AirflowFailException
  promoted it to FAILED.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
@1fanwang 1fanwang force-pushed the fix/airflow-fail-in-retry-callback branch from 8f538c8 to c47b237 Compare May 19, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix handling AirflowFailException on_retry_callback

3 participants