doc: add more instructions for `up_for_retry` #38100

CongyueZhang · 2024-03-13T01:24:38Z

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Related discussion: Why up_for_retry takes worker slot?
#37986. Thanks again for Jarek's detailed explanation.

boring-cyborg · 2024-03-13T01:24:43Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

CongyueZhang

I noticed in Sensors and base.py, it mentioned reschedule will free up a woker slot when the criteria not yet met.

^ based on this information I have a few more questions:

If it can be done for mode=reschedule, why can't airflow add smiliar mechanism for up_for_retry? I think from user's perspective we don't expect retrying taking worker slots.
In the above "Difference between Mode='reschedule' and Deferrable=True in Sensors" Table, I think it would be better to stress mode='reschedule' will also free up worker slots when sleeping? I just hope it won't cause any misunderstanding because it only shows in the column under deferrable=True.

dirrao · 2024-03-13T06:17:23Z

docs/apache-airflow/authoring-and-scheduling/deferring.rst

+--------------------------------------------------------+--------------------------------------------------------+
+|           state='up_for_retry'                         |          state='deferred'                              |
+========================================================+========================================================+
+| Keeps resources while waiting.                         |  Releases resources, pauses execution when idle,       |


task state up_for_retry is a state where it waits before doing another attempt. During this wait time, resouces won't be consumed.

Yes. I think @CongyueZhang - our earlier discussion could a little misleading (and I realized I was talking about a different retry than you were) so it's great that you proposed the documentation here.

There are several things that you can call "retry" in Airflow:

Retry done by waiting and retrying by the task itself in a loop - where the task simply performs retry of a certain operation and sleeps - usually done by tenacity or other mechanisms like that -> this is what I referred to when it comes to retrying something that is in progress.

Retry that results it task "failing" and having retry count (up_for_retry)

Here @dirrao explained - the task effectively fails, the slot is freed and resources are freed as well. But In this case the state of the task in progrees is not saved. Whatever the task had done so far is lost and when retrying the whole task needs to be restarted from the beginning. Waiting can only be done on a "time" base - and the task will restart from the beginning when retry time passes - and will redo the job from the beginning. While taks is in up_for_retry state - indeed resources are not used, but also when you retry the task, you need to re-do what was done the first time you attempted to do the previous time, because we do not keep the state of the originally failed task. This MIGHT lead to increased resource usage because every time the task attempts to re-run will have to effectively do the same "preparation" (whatever the preparation is).

Deferring is a mechanism where you can defer the task and serialize it's state to a disk and let Triggerer do the conditional wait. Which means that effectively your task remains in half-done state and the state of doing it is preserved and you can efffectively resume where you left off - because when the condition is met (this might be time-based or possibly waiting for external job completion or other async-io compatible conditoin) the state of the task is restored from the disk and it resumes with the state restored to what it was before it deferred.

So in your progression, roughly speaking:

takes resources while waiting but initialization is done only once

retrying causes excessive resource use for redoing initialization part of the task (for tasks that need reinitialization) but waiting does not keep resources

the state is preserved between deferrals and intitlalization is done only once - so it roughly combines the benefits of 1) and 2) - with an extra overhead of triggerer that does waiting for potentially 1000s of such deferred tasks.

Thank you very much! I will update the documentation based on what you explained here.

And just to confirm: it means long retry_delay wouldn't cause an issue right? resources and slots are freed when it's waiting.

Yeah. This would only matter if you use "low-level" retry - such as @tenacity decorators on your external method call - I think that was a bit of a confusion in our past discussion, where I explained how it works and assumed the retry you were talking about was those kind of retries, not "task failures" with retry set.

RNHTTR · 2024-04-06T17:53:58Z

I'm going to go ahead and close this out. Seems like the confusion has been sorted, and task instances in the up_for_retry state indeed do not consume a worker slot.

add more instructions for up_for_retry

6c49a78

CongyueZhang requested a review from potiuk as a code owner March 13, 2024 01:24

boring-cyborg bot added the kind:documentation label Mar 13, 2024

CongyueZhang commented Mar 13, 2024

View reviewed changes

dirrao reviewed Mar 13, 2024

View reviewed changes

RNHTTR closed this Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: add more instructions for `up_for_retry` #38100

doc: add more instructions for `up_for_retry` #38100

CongyueZhang commented Mar 13, 2024

boring-cyborg bot commented Mar 13, 2024

CongyueZhang left a comment •

edited

dirrao Mar 13, 2024

potiuk Mar 13, 2024

workingDice Mar 13, 2024

potiuk Mar 13, 2024

RNHTTR commented Apr 6, 2024

doc: add more instructions for up_for_retry #38100

doc: add more instructions for up_for_retry #38100

Conversation

CongyueZhang commented Mar 13, 2024

boring-cyborg bot commented Mar 13, 2024

CongyueZhang left a comment • edited

Choose a reason for hiding this comment

dirrao Mar 13, 2024

Choose a reason for hiding this comment

potiuk Mar 13, 2024

Choose a reason for hiding this comment

workingDice Mar 13, 2024

Choose a reason for hiding this comment

potiuk Mar 13, 2024

Choose a reason for hiding this comment

RNHTTR commented Apr 6, 2024

doc: add more instructions for `up_for_retry` #38100

doc: add more instructions for `up_for_retry` #38100

CongyueZhang left a comment •

edited