Handle and log exceptions raised during task callback #17347

SamWheating · 2021-07-30T16:41:39Z

Currently, an exception thrown in a task-level callback will be unhandled, so potentially important parts of the callback (for example, sending a slack notification after a job failure) may be skipped. This can make it really difficult to identify and fix errors in callbacks.

To demonstrate, here'a a simple DAG with an error in the callback function:

from airflow.models import DAG
from airflow import utils
from airflow.operators.python import PythonOperator

def message():
    print('Hello, world.')

def broken_callback(context=None):
    print(f"Callback started")
    print(f"Task {context['taskID']} finished successfully")  # raises keyError
    print("Callback complete.")

with DAG(f'dag-with-broken-callback', schedule_interval=None, start_date=utils.dates.days_ago(1), ) as dag2:
    
    task = PythonOperator(
        task_id='task',
        python_callable=message,
        on_success_callback=broken_callback,
    )

The task execution logs will cut off where the exception is thrown:

[2021-07-30, 16:25:44 UTC] {logging_mixin.py:109} INFO - Hello, world.
[2021-07-30, 16:25:44 UTC] {python.py:151} INFO - Done. Returned value was: None
[2021-07-30, 16:25:44 UTC] {taskinstance.py:1144} INFO - Marking task as SUCCESS. dag_id=dag-with-broken-callback, task_id=task, execution_date=20210730T162540, start_date=20210730T162544, end_date=20210730T162544
[2021-07-30, 16:25:44 UTC] {local_task_job.py:151} INFO - Task exited with return code 0
[2021-07-30, 16:25:44 UTC] {logging_mixin.py:109} INFO - Callback started
<nothing>

And the unhandled exception will show up in the executor logs:

Running <TaskInstance: dag-with-broken-callback.task 2021-07-30T16:25:40.187231+00:00 [queued]> on host 466ea2c2e992
Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 33, in <module>
    sys.exit(load_entry_point('apache-airflow', 'console_scripts', 'airflow')())
  File "/opt/airflow/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/opt/airflow/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/opt/airflow/airflow/utils/cli.py", line 91, in wrapper
    return f(*args, **kwargs)
  File "/opt/airflow/airflow/cli/commands/task_command.py", line 256, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/opt/airflow/airflow/cli/commands/task_command.py", line 84, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/opt/airflow/airflow/cli/commands/task_command.py", line 141, in _run_task_by_local_task_job
    run_job.run()
  File "/opt/airflow/airflow/jobs/base_job.py", line 245, in run
    self._execute()
  File "/opt/airflow/airflow/jobs/local_task_job.py", line 128, in _execute
    self.handle_task_exit(return_code)
  File "/opt/airflow/airflow/jobs/local_task_job.py", line 163, in handle_task_exit
    self.task_instance._run_finished_callback(error=error)
  File "/opt/airflow/airflow/models/taskinstance.py", line 1377, in _run_finished_callback
    task.on_success_callback(context)
  File "/opt/airflow/dags/test_callback.py", line 10, in broken_callback
    print(f"Task {context['taskID']} finished successfully")  # raises keyError
KeyError: 'taskID'
[2021-07-30 16:25:45,078] {sequential_executor.py:66} ERROR - Failed to execute task Command '['airflow', 'tasks', 'run', 'dag-with-broken-callback', 'task', '2021-07-30T16:25:40.187231+00:00', '--local', '--pool', 'default_pool', '--subd
ir', '/opt/airflow/dags/test_callback.py']' returned non-zero exit status 1..
[2021-07-30 16:25:45,079] {scheduler_job.py:577} INFO - Executor reports execution of dag-with-broken-callback.task execution_date=2021-07-30 16:25:40.187231+00:00 exited with status failed for try_number 1
[2021-07-30 16:25:45,129] {dagrun.py:435} INFO - Marking run <DagRun dag-with-broken-callback @ 2021-07-30 16:25:40.187231+00:00: manual__2021-07-30T16:25:40.187231+00:00, externally triggered: True> successful

By handling this exception and logging it, we can make callback failures much easier to identify and fix. After this change, the task logs look like this:

[2021-07-30, 16:26:45 UTC] {logging_mixin.py:109} INFO - Hello, world.
[2021-07-30, 16:26:45 UTC] {python.py:151} INFO - Done. Returned value was: None
[2021-07-30, 16:26:45 UTC] {taskinstance.py:1144} INFO - Marking task as SUCCESS. dag_id=dag-with-broken-callback, task_id=task, execution_date=20210730T162642, start_date=20210730T162645, end_date=20210730T162645
[2021-07-30, 16:26:45 UTC] {local_task_job.py:151} INFO - Task exited with return code 0
[2021-07-30, 16:26:45 UTC] {logging_mixin.py:109} INFO - Callback started
[2021-07-30, 16:26:45 UTC] {taskinstance.py:1357} ERROR - Failed when executing on_success_callback
Traceback (most recent call last):
  File "/opt/airflow/airflow/models/taskinstance.py", line 1355, in _run_finished_callback
    task.on_success_callback(context)
  File "/opt/airflow/dags/test_callback.py", line 10, in broken_callback
    print(f"Task {context['taskID']} finished successfully")  # raises keyError
KeyError: 'taskID'
[2021-07-30, 16:26:45 UTC] {local_task_job.py:258} INFO - 0 downstream tasks scheduled from follow-on schedule check

This removes the stacktrace from the executor logs. If we want to avoid changing this behaviour then we could also re-raise the exception after logging. Thoughts?

airflow/models/taskinstance.py

ephraimbuddy

Please also add tests for on_failure_callback and on_retry_callback. This is to avoid regression.

tests/models/test_taskinstance.py

github-actions · 2021-08-03T07:54:42Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

ephraimbuddy · 2021-08-03T11:29:48Z

Can you rebase @SamWheating

ephraimbuddy · 2021-08-04T11:14:44Z

Closing and reopening to trigger tests

ephraimbuddy · 2021-08-04T11:15:12Z

@ashb please take a look

ephraimbuddy · 2021-08-05T19:08:13Z

@SamWheating please rebase again

Add missing exception handling in success/retry/failure callbacks (cherry picked from commit faf9f73)

SamWheating requested review from ashb, kaxil and XD-DENG as code owners July 30, 2021 16:41

subkanthi approved these changes Jul 30, 2021

View reviewed changes

ephraimbuddy reviewed Jul 30, 2021

View reviewed changes

airflow/models/taskinstance.py Show resolved Hide resolved

ephraimbuddy requested changes Aug 2, 2021

View reviewed changes

tests/models/test_taskinstance.py Outdated Show resolved Hide resolved

SamWheating force-pushed the sw-verbose-callback-failures branch from 5acd814 to 2120671 Compare August 2, 2021 18:29

ephraimbuddy approved these changes Aug 3, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Aug 3, 2021

ephraimbuddy closed this Aug 3, 2021

ephraimbuddy reopened this Aug 3, 2021

ephraimbuddy added this to the Airflow 2.1.3 milestone Aug 3, 2021

ephraimbuddy closed this Aug 3, 2021

ephraimbuddy reopened this Aug 3, 2021

SamWheating force-pushed the sw-verbose-callback-failures branch from 2120671 to 0bd7b3b Compare August 3, 2021 18:11

ephraimbuddy closed this Aug 4, 2021

ephraimbuddy reopened this Aug 4, 2021

ashb approved these changes Aug 4, 2021

View reviewed changes

ephraimbuddy closed this Aug 5, 2021

ephraimbuddy reopened this Aug 5, 2021

SamWheating added 3 commits August 5, 2021 14:24

Add missing exception handling in success/retry/failure callbacks

b907f0b

Added test and replaced log with self.log

dfad1da

Expand test coverage to all three task-finished callbacks.

5dd6848

SamWheating force-pushed the sw-verbose-callback-failures branch from 0bd7b3b to 5dd6848 Compare August 5, 2021 21:25

ephraimbuddy merged commit faf9f73 into apache:main Aug 6, 2021

jhtimmins pushed a commit that referenced this pull request Aug 9, 2021

Handle and log exceptions raised during task callback (#17347)

798237a

Add missing exception handling in success/retry/failure callbacks (cherry picked from commit faf9f73)

jhtimmins pushed a commit that referenced this pull request Aug 13, 2021

Handle and log exceptions raised during task callback (#17347)

0678a94

Add missing exception handling in success/retry/failure callbacks (cherry picked from commit faf9f73)

kaxil pushed a commit that referenced this pull request Aug 17, 2021

Handle and log exceptions raised during task callback (#17347)

e04188d

Add missing exception handling in success/retry/failure callbacks (cherry picked from commit faf9f73)

jhtimmins pushed a commit that referenced this pull request Aug 17, 2021

Handle and log exceptions raised during task callback (#17347)

87c999a

Add missing exception handling in success/retry/failure callbacks (cherry picked from commit faf9f73)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle and log exceptions raised during task callback #17347

Handle and log exceptions raised during task callback #17347

SamWheating commented Jul 30, 2021 •

edited

Loading

ephraimbuddy left a comment

github-actions bot commented Aug 3, 2021

ephraimbuddy commented Aug 3, 2021

ephraimbuddy commented Aug 4, 2021

ephraimbuddy commented Aug 4, 2021

ephraimbuddy commented Aug 5, 2021

Handle and log exceptions raised during task callback #17347

Handle and log exceptions raised during task callback #17347

Conversation

SamWheating commented Jul 30, 2021 • edited Loading

ephraimbuddy left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 3, 2021

ephraimbuddy commented Aug 3, 2021

ephraimbuddy commented Aug 4, 2021

ephraimbuddy commented Aug 4, 2021

ephraimbuddy commented Aug 5, 2021

SamWheating commented Jul 30, 2021 •

edited

Loading