-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: add more instructions for up_for_retry
#38100
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in Sensors and base.py, it mentioned reschedule
will free up a woker slot when the criteria not yet met.
^ based on this information I have a few more questions:
-
If it can be done for mode=
reschedule
, why can't airflow add smiliar mechanism forup_for_retry
? I think from user's perspective we don't expect retrying taking worker slots. -
In the above "Difference between Mode='reschedule' and Deferrable=True in Sensors" Table, I think it would be better to stress mode='reschedule' will also free up worker slots when sleeping? I just hope it won't cause any misunderstanding because it only shows in the column under
deferrable=True
.
+--------------------------------------------------------+--------------------------------------------------------+ | ||
| state='up_for_retry' | state='deferred' | | ||
+========================================================+========================================================+ | ||
| Keeps resources while waiting. | Releases resources, pauses execution when idle, | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task state up_for_retry
is a state where it waits before doing another attempt. During this wait time, resouces won't be consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think @CongyueZhang - our earlier discussion could a little misleading (and I realized I was talking about a different retry than you were) so it's great that you proposed the documentation here.
There are several things that you can call "retry" in Airflow:
-
Retry done by waiting and retrying by the task itself in a loop - where the task simply performs retry of a certain operation and
sleeps
- usually done by tenacity or other mechanisms like that -> this is what I referred to when it comes to retrying something that is in progress. -
Retry that results it task "failing" and having retry count (
up_for_retry
)
Here @dirrao explained - the task effectively fails, the slot is freed and resources are freed as well. But In this case the state of the task in progrees is not saved. Whatever the task had done so far is lost and when retrying the whole task needs to be restarted from the beginning. Waiting can only be done on a "time" base - and the task will restart from the beginning when retry time passes - and will redo the job from the beginning. While taks is in up_for_retry
state - indeed resources are not used, but also when you retry the task, you need to re-do what was done the first time you attempted to do the previous time, because we do not keep the state of the originally failed task. This MIGHT lead to increased resource usage because every time the task attempts to re-run will have to effectively do the same "preparation" (whatever the preparation is).
- Deferring is a mechanism where you can defer the task and serialize it's state to a disk and let Triggerer do the conditional wait. Which means that effectively your task remains in
half-done
state and the state of doing it is preserved and you can efffectively resume where you left off - because when the condition is met (this might be time-based or possibly waiting for external job completion or other async-io compatible conditoin) the state of the task is restored from the disk and it resumes with the state restored to what it was before it deferred.
So in your progression, roughly speaking:
- takes resources while waiting but initialization is done only once
- retrying causes excessive resource use for redoing initialization part of the task (for tasks that need reinitialization) but waiting does not keep resources
- the state is preserved between deferrals and intitlalization is done only once - so it roughly combines the benefits of 1) and 2) - with an extra overhead of triggerer that does waiting for potentially 1000s of such deferred tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I will update the documentation based on what you explained here.
And just to confirm: it means long retry_delay
wouldn't cause an issue right? resources and slots are freed when it's waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This would only matter if you use "low-level" retry - such as @tenacity
decorators on your external method call - I think that was a bit of a confusion in our past discussion, where I explained how it works and assumed the retry you were talking about was those kind of retries, not "task failures" with retry set.
I'm going to go ahead and close this out. Seems like the confusion has been sorted, and task instances in the |
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.Related discussion: Why up_for_retry takes worker slot?
#37986. Thanks again for Jarek's detailed explanation.