Properly handle ti state difference between executor and scheduler #17819

ephraimbuddy · 2021-08-24T23:13:30Z

When a task fails to start, the executor fails it and its state in
scheduler is queued while its state in executor is failed. Currently
we fail this task without retries to avoid getting stuck.

This PR changes this to only fail the task if the callback cannot be
executed. This ensures the task does not get stuck

closes: #16625

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

kaxil

We don't want to run any user code in Scheduler. That is why callbacks are currently run in the DAG file processor. Long term they should be run in the Worker

ephraimbuddy · 2021-08-24T23:43:02Z

We don't want to run any user code in Scheduler. That is why callbacks are currently run in the DAG file processor. Long term they should be run in the Worker

Alright, let me think of another way...Thanks!

ephraimbuddy · 2021-08-25T12:53:21Z

Hi @kaxil, what do you think about this implementation? still working on tests...

tests/jobs/test_scheduler_job.py

ephraimbuddy · 2021-08-25T13:57:22Z

This is how to reproduce the error: Run this dag and assert that it's successful. Then uncomment the depend_on_past arg so it's proper but depend_on_past instead of depends_on_past. You will have import error on the UI, Run the dag and it would enter queued and retry twice before failing.

import time
from datetime import datetime, timedelta
from airflow import DAG

def on_failure(ctx):
    print('hello world')
    print(ctx)

default_args = {'on_failure_callback': on_failure}

        

dag = DAG(
    dag_id='Give-wrong-arg',
    schedule_interval=None,
    catchup=False,
    start_date=datetime(2021,7,12),
    default_args=default_args,
)

@dag.task(retries=2, retry_delay=timedelta(seconds=20))#, depend_on_past=False)
def task_wrong_arg():
    time.sleep(5)

@dag.task
def myfunc():
    return 1



task_wrong_arg() >> myfunc()

tests/jobs/test_scheduler_job.py

airflow/utils/callback_requests.py

ashb · 2021-08-30T11:51:02Z

By changing a dag file to have a parse error and then triggering callbacks you've hit a different problem too, so that isn't a good way of triggering the behaviour you are trying to test.

(Because to run the callbacks might need to run the on_failure_callback we need the actual loaded dag file in many cases.)

airflow/jobs/scheduler_job.py

airflow/models/taskinstance.py

When a task fails to start, the executor fails it and the report says that its state in scheduler is queued while its state in executor is failed. Currently we fail this task without retries to avoid getting stuck. This change modifies the above to only fail the task if there's no retries left

…uler

…d scheduler

kaxil · 2021-09-21T19:51:14Z

All the reviews have been addressed

stale review

github-actions · 2021-09-21T19:52:45Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

kaxil · 2021-09-21T20:17:38Z

Well done @ephraimbuddy 👏

taylorfinnell · 2021-09-22T19:07:21Z

@ephraimbuddy @kaxil

We applied this patch directly on top of the 2.1.4 tag and noticed issues almost instantly.

Tasks would queue
Tasks would throw an exception - the exception in this case was from our DAG code signaling we are not ready to process, so retry later. It was not an internal airflow exception
Task logs would indicate that state was being set to UP_FOR_RETRY
The UI would show the task still RUNNING. The task would have ~16 retries left to go and would sit on attempt 1
Clearing the task would result in shutdown state
To get the task to run again we first had to fail it, then clear it again

We also tried the main branch up to this commit and saw very similar issues.

Here are some logs.

...
[2021-09-22 04:47:15,214] {taskinstance.py:1463} ERROR - Task failed with exception
...
raise FileNotFoundError(f"{file} does not exist")

[2021-09-22 04:47:15,299] {logging_mixin.py:109} WARNING - /opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/strategies.py:911 SAWarning: Multiple rows returned with uselist=False for lazily-loaded attribute 'DagRun.task_instances'
[2021-09-22 04:47:15,301] {taskinstance.py:1512} INFO - Marking task as UP_FOR_RETRY. dag_id=foo_dag_split, task_id=foo_task, execution_date=20210921T043000, start_date=20210922T044714, end_date=20210922T044715
[2021-09-22 04:47:15,302] {logging_mixin.py:109} WARNING - /opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py:2193 SAWarning: Instance <TaskInstance at 0x7f01f5adc070> is already pending in this Session yet is being merged again; this is probably not what you want to do
[2021-09-22 04:47:15,367] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-09-22 04:47:15,517] {logging_mixin.py:109} WARNING - /opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/strategies.py:911 SAWarning: Multiple rows returned with uselist=False for lazily-loaded attribute 'DagRun.task_instances'
[2021-09-22 04:47:15,517] {taskinstance.py:1512} INFO - Marking task as UP_FOR_RETRY. dag_id=foo_dag_split, task_id=foo_task, execution_date=20210921T043000, start_date=20210922T044714, end_date=20210922T044715
[2021-09-22 04:47:15,518] {logging_mixin.py:109} WARNING - /opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py:2193 SAWarning: Instance <TaskInstance at 0x7f01f5ad6580> is already pending in this Session yet is being merged again; this is probably not what you want to do

ephraimbuddy · 2021-09-22T19:43:18Z

Thanks for checking this @taylorfinnell.
Can you modify your patch and use ti.try_number+=1 instead of ti._try_number+=1 in this line:

airflow/airflow/models/taskinstance.py

Line 1727 in 13a558d

self._try_number += 1

If I understand correctly, throwing the above exception is not a bug?

taylorfinnell · 2021-09-22T19:50:52Z

That's correct we expect that exception to be raised when data is not in place for the DAG to process. We then rely on the retry mechanism to try to process the data at a later time when it is available.

Do you have any suggestions on how we could reproduce this in a test case? If we can get a test that is fixed by your suggestion I would feel more comfortable trying the change. Unfortunately, we didn't see the issue until we got to production scale

ephraimbuddy · 2021-09-22T20:10:52Z

I don't have an idea on a test case other than the one we have in the unittest for this PR.

To reproduce with a dag manually, raise AirflowException after this line:

airflow/airflow/cli/commands/task_command.py

Line 279 in 13a558d

pass

.
That's add raise AirflowException after the above line and run a dag with retries

WattsInABox · 2021-09-22T20:51:39Z

Thanks for that, we might be able to push a DAG like that to our staging environment. Theoretically also, we could write an integration test for this? Is there a good document to follow for integration testing other than the random blog articles I've found?

ephraimbuddy · 2021-09-22T20:55:38Z

There’s currently no community doc about integration testing that I’m aware of.

Let us know what happens when you apply the patch above for try_number

ephraimbuddy · 2021-09-24T18:01:49Z

@WattsInABox , I have tested this in deployment and it works as expected. I think you should create an issue for the behaviour you are seeing

ephraimbuddy requested review from ashb, kaxil and XD-DENG as code owners August 24, 2021 23:13

boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Aug 24, 2021

kaxil requested changes Aug 24, 2021

View reviewed changes

ephraimbuddy closed this Aug 24, 2021

ephraimbuddy deleted the fix-task-callback-scheduler branch August 24, 2021 23:43

ephraimbuddy restored the fix-task-callback-scheduler branch August 25, 2021 11:21

ephraimbuddy reopened this Aug 25, 2021

ephraimbuddy marked this pull request as draft August 25, 2021 11:22

ephraimbuddy force-pushed the fix-task-callback-scheduler branch from 0de102f to da09f48 Compare August 25, 2021 12:50

ephraimbuddy changed the title ~~Handle task callback inside the scheduler~~ Promptly handle task callback from _process_executor_events Aug 25, 2021

ephraimbuddy commented Aug 25, 2021

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the fix-task-callback-scheduler branch 2 times, most recently from 392f0d5 to f363df6 Compare August 25, 2021 15:42

ephraimbuddy marked this pull request as ready for review August 25, 2021 15:49

ephraimbuddy force-pushed the fix-task-callback-scheduler branch from f363df6 to f45f313 Compare August 25, 2021 19:19

ephraimbuddy commented Aug 25, 2021

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

ashb reviewed Aug 30, 2021

View reviewed changes

airflow/utils/callback_requests.py Outdated Show resolved Hide resolved

ashb self-assigned this Aug 30, 2021

ephraimbuddy closed this Sep 3, 2021

ephraimbuddy deleted the fix-task-callback-scheduler branch September 3, 2021 15:31

ephraimbuddy restored the fix-task-callback-scheduler branch September 3, 2021 17:03

ephraimbuddy reopened this Sep 3, 2021

ephraimbuddy force-pushed the fix-task-callback-scheduler branch from f45f313 to 2913cc3 Compare September 3, 2021 17:11

ephraimbuddy changed the title ~~Promptly handle task callback from _process_executor_events~~ Properly handle ti state difference between executor and scheduler Sep 3, 2021

ashb reviewed Sep 21, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ashb reviewed Sep 21, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Show resolved Hide resolved

ashb reviewed Sep 21, 2021

View reviewed changes

airflow/models/taskinstance.py Outdated Show resolved Hide resolved

ephraimbuddy and others added 8 commits September 21, 2021 20:50

fixup! Properly handle ti state difference between executor and sched…

775d541

…uler

fixup! fixup! Properly handle ti state difference between executor an…

2b4fbda

…d scheduler

Add comment

a08b573

apply suggestions from code review

8c991b7

fixup! apply suggestions from code review

38ac8ba

Apply suggestions from code review

381cd9d

fixup! Apply suggestions from code review

abb8d0a

kaxil force-pushed the fix-task-callback-scheduler branch from 9da31ab to abb8d0a Compare September 21, 2021 19:50

kaxil requested a review from ashb September 21, 2021 19:51

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Sep 21, 2021

kaxil merged commit 44f601e into apache:main Sep 21, 2021

kaxil deleted the fix-task-callback-scheduler branch September 21, 2021 20:17

ashb mentioned this pull request Sep 23, 2021

downstream tasks "upstream failed" when upstream retries and succeeds #18401

Closed

2 tasks

ephraimbuddy mentioned this pull request Oct 4, 2021

Airflow 2.0.0 Tasks not retried on failure when retries>=1 in case of OperationalError #14101

Closed

ephraimbuddy mentioned this pull request Nov 26, 2021

Task is set "upstream_failed" while upstream tasks are success #13671

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle ti state difference between executor and scheduler #17819

Properly handle ti state difference between executor and scheduler #17819

ephraimbuddy commented Aug 24, 2021 •

edited

kaxil left a comment

ephraimbuddy commented Aug 24, 2021

ephraimbuddy commented Aug 25, 2021

ephraimbuddy commented Aug 25, 2021

ashb commented Aug 30, 2021

kaxil commented Sep 21, 2021 •

edited

github-actions bot commented Sep 21, 2021

kaxil commented Sep 21, 2021

taylorfinnell commented Sep 22, 2021

ephraimbuddy commented Sep 22, 2021 •

edited

taylorfinnell commented Sep 22, 2021

ephraimbuddy commented Sep 22, 2021 •

edited

WattsInABox commented Sep 22, 2021

ephraimbuddy commented Sep 22, 2021

ephraimbuddy commented Sep 24, 2021

Properly handle ti state difference between executor and scheduler #17819

Properly handle ti state difference between executor and scheduler #17819

Conversation

ephraimbuddy commented Aug 24, 2021 • edited

kaxil left a comment

Choose a reason for hiding this comment

ephraimbuddy commented Aug 24, 2021

ephraimbuddy commented Aug 25, 2021

ephraimbuddy commented Aug 25, 2021

ashb commented Aug 30, 2021

kaxil commented Sep 21, 2021 • edited

github-actions bot commented Sep 21, 2021

kaxil commented Sep 21, 2021

taylorfinnell commented Sep 22, 2021

ephraimbuddy commented Sep 22, 2021 • edited

taylorfinnell commented Sep 22, 2021

ephraimbuddy commented Sep 22, 2021 • edited

WattsInABox commented Sep 22, 2021

ephraimbuddy commented Sep 22, 2021

ephraimbuddy commented Sep 24, 2021

ephraimbuddy commented Aug 24, 2021 •

edited

kaxil commented Sep 21, 2021 •

edited

ephraimbuddy commented Sep 22, 2021 •

edited

ephraimbuddy commented Sep 22, 2021 •

edited