Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add more instructions for up_for_retry #38100

Closed
wants to merge 1 commit into from

Conversation

CongyueZhang
Copy link


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Related discussion: Why up_for_retry takes worker slot?
#37986. Thanks again for Jarek's detailed explanation.

Copy link

boring-cyborg bot commented Mar 13, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Author

@CongyueZhang CongyueZhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in Sensors and base.py, it mentioned reschedule will free up a woker slot when the criteria not yet met.

^ based on this information I have a few more questions:

  1. If it can be done for mode=reschedule, why can't airflow add smiliar mechanism for up_for_retry? I think from user's perspective we don't expect retrying taking worker slots.

  2. In the above "Difference between Mode='reschedule' and Deferrable=True in Sensors" Table, I think it would be better to stress mode='reschedule' will also free up worker slots when sleeping? I just hope it won't cause any misunderstanding because it only shows in the column under deferrable=True.

+--------------------------------------------------------+--------------------------------------------------------+
| state='up_for_retry' | state='deferred' |
+========================================================+========================================================+
| Keeps resources while waiting. | Releases resources, pauses execution when idle, |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task state up_for_retry is a state where it waits before doing another attempt. During this wait time, resouces won't be consumed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think @CongyueZhang - our earlier discussion could a little misleading (and I realized I was talking about a different retry than you were) so it's great that you proposed the documentation here.

There are several things that you can call "retry" in Airflow:

  1. Retry done by waiting and retrying by the task itself in a loop - where the task simply performs retry of a certain operation and sleeps - usually done by tenacity or other mechanisms like that -> this is what I referred to when it comes to retrying something that is in progress.

  2. Retry that results it task "failing" and having retry count (up_for_retry)

Here @dirrao explained - the task effectively fails, the slot is freed and resources are freed as well. But In this case the state of the task in progrees is not saved. Whatever the task had done so far is lost and when retrying the whole task needs to be restarted from the beginning. Waiting can only be done on a "time" base - and the task will restart from the beginning when retry time passes - and will redo the job from the beginning. While taks is in up_for_retry state - indeed resources are not used, but also when you retry the task, you need to re-do what was done the first time you attempted to do the previous time, because we do not keep the state of the originally failed task. This MIGHT lead to increased resource usage because every time the task attempts to re-run will have to effectively do the same "preparation" (whatever the preparation is).

  1. Deferring is a mechanism where you can defer the task and serialize it's state to a disk and let Triggerer do the conditional wait. Which means that effectively your task remains in half-done state and the state of doing it is preserved and you can efffectively resume where you left off - because when the condition is met (this might be time-based or possibly waiting for external job completion or other async-io compatible conditoin) the state of the task is restored from the disk and it resumes with the state restored to what it was before it deferred.

So in your progression, roughly speaking:

  1. takes resources while waiting but initialization is done only once
  2. retrying causes excessive resource use for redoing initialization part of the task (for tasks that need reinitialization) but waiting does not keep resources
  3. the state is preserved between deferrals and intitlalization is done only once - so it roughly combines the benefits of 1) and 2) - with an extra overhead of triggerer that does waiting for potentially 1000s of such deferred tasks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! I will update the documentation based on what you explained here.

And just to confirm: it means long retry_delay wouldn't cause an issue right? resources and slots are freed when it's waiting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This would only matter if you use "low-level" retry - such as @tenacity decorators on your external method call - I think that was a bit of a confusion in our past discussion, where I explained how it works and assumed the retry you were talking about was those kind of retries, not "task failures" with retry set.

@RNHTTR
Copy link
Collaborator

RNHTTR commented Apr 6, 2024

I'm going to go ahead and close this out. Seems like the confusion has been sorted, and task instances in the up_for_retry state indeed do not consume a worker slot.

@RNHTTR RNHTTR closed this Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants