-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle ti state difference between executor and scheduler #17819
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to run any user code in Scheduler. That is why callbacks are currently run in the DAG file processor. Long term they should be run in the Worker
Alright, let me think of another way...Thanks! |
0de102f
to
da09f48
Compare
Hi @kaxil, what do you think about this implementation? still working on tests... |
This is how to reproduce the error: Run this dag and assert that it's successful. Then uncomment the import time
from datetime import datetime, timedelta
from airflow import DAG
def on_failure(ctx):
print('hello world')
print(ctx)
default_args = {'on_failure_callback': on_failure}
dag = DAG(
dag_id='Give-wrong-arg',
schedule_interval=None,
catchup=False,
start_date=datetime(2021,7,12),
default_args=default_args,
)
@dag.task(retries=2, retry_delay=timedelta(seconds=20))#, depend_on_past=False)
def task_wrong_arg():
time.sleep(5)
@dag.task
def myfunc():
return 1
task_wrong_arg() >> myfunc() |
392f0d5
to
f363df6
Compare
f363df6
to
f45f313
Compare
By changing a dag file to have a parse error and then triggering callbacks you've hit a different problem too, so that isn't a good way of triggering the behaviour you are trying to test. (Because to run the callbacks might need to run the |
f45f313
to
2913cc3
Compare
When a task fails to start, the executor fails it and the report says that its state in scheduler is queued while its state in executor is failed. Currently we fail this task without retries to avoid getting stuck. This change modifies the above to only fail the task if there's no retries left
9da31ab
to
abb8d0a
Compare
All the reviews have been addressed |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Well done @ephraimbuddy 👏 |
We applied this patch directly on top of the 2.1.4 tag and noticed issues almost instantly.
We also tried the Here are some logs.
|
Thanks for checking this @taylorfinnell. airflow/airflow/models/taskinstance.py Line 1727 in 13a558d
If I understand correctly, throwing the above exception is not a bug? |
That's correct we expect that exception to be raised when data is not in place for the DAG to process. We then rely on the retry mechanism to try to process the data at a later time when it is available. Do you have any suggestions on how we could reproduce this in a test case? If we can get a test that is fixed by your suggestion I would feel more comfortable trying the change. Unfortunately, we didn't see the issue until we got to production scale |
I don't have an idea on a test case other than the one we have in the unittest for this PR. To reproduce with a dag manually, raise AirflowException after this line: airflow/airflow/cli/commands/task_command.py Line 279 in 13a558d
That's add raise AirflowException after the above line and run a dag with retries
|
Thanks for that, we might be able to push a DAG like that to our staging environment. Theoretically also, we could write an integration test for this? Is there a good document to follow for integration testing other than the random blog articles I've found? |
There’s currently no community doc about integration testing that I’m aware of. Let us know what happens when you apply the patch above for try_number |
@WattsInABox , I have tested this in deployment and it works as expected. I think you should create an issue for the behaviour you are seeing |
When a task fails to start, the executor fails it and its state in
scheduler is queued while its state in executor is failed. Currently
we fail this task without retries to avoid getting stuck.
This PR changes this to only fail the task if the callback cannot be
executed. This ensures the task does not get stuck
closes: #16625
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.