Scheduler to handle incrementing of try_number #39336

dstandish · 2024-04-30T21:21:39Z

Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

no more private _try_number attr
no more getter logic
no more decrementing
no more incrementing as part of task execution

Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

airflow/models/taskinstance.py

tests/sensors/test_base.py

jedcunningham

Overall looks good!

@SamWheating, you might be a good reviewer on this one as well.

hussein-awala

Great work! Working with try_number was confusing for many contributors, this change helps to make it easier.

I just added some nits, but it looks good to merge. LGTM 🚀

airflow/jobs/backfill_job_runner.py

airflow/models/dagrun.py

dstandish · 2024-05-09T02:55:51Z

alright, yolo, here we go

thanks @ashb @hussein-awala @jedcunningham

potiuk · 2024-05-09T05:48:01Z

A few nits, otherwise it looks good, and I don't know why we (I) did it this terrible hacky way in the first place!

I think we will find out when it is released :). But I also think it's worth to take risk to finally get rid of that one.

BTW. Shall we mark it for 2.9.2 ? I think we can treat it as a bug-fix ?

dstandish · 2024-05-09T05:52:06Z

I feel like it is not unreasonable to think of it as bug fix. But I think the point of that convention is that patch releases should get more stable not less. So since this has some stability risk it may violate spirit of convention to include it. Though there could also be cherry pick convenience arguments that force it. If it doesn’t go in 2.9.x then incidentally it may not ever see a release because of Jed’s coming changes re TIs :)

potiuk · 2024-05-09T05:56:39Z

Maybe just a bit more manual test scrutiny for 2.9.2 around try_num will be enough of stability ? I think the sooner we get it out, the easier will be to fix any potential problems there.

dstandish · 2024-05-09T06:00:08Z

I do plan to do more manual testing. I had some trouble with backfill but thought i got it no worse than main (backfill generally (ie separate from this change ) has issues and doesn’t work right in all scenarios). But plan to do a little more. I don’t personally object to including it.

Non backfill is simpler

Previously, there was a lot of bad stuff happening around try_number. We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule. What I do here is try to remove all of that stuff: no more private _try_number attr no more getter logic no more decrementing no more incrementing as part of task execution Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try. One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change. If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage. As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before. --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

jedcunningham · 2024-05-10T22:56:41Z

#protm

eladkal · 2024-05-11T18:04:02Z

airflow/providers/dbt/cloud/CHANGELOG.rst

+main
+.....
+
+In Airflow 2.10.0, we fix the way try_number works, so that it no longer returns different values depending on task instance state.  Importantly, after the task is done, it no longer shows current_try + 1. Thus in 3.8.1 we patch this provider to fix try_number references so they no longer adjust for the old, bad behavior.
+


Why is this here?
How is this PR affect dbt provider?
Providers are supported for older versions of Airflow as well. I'm out of context for what this PR achieves but I am not sure if this note in change log is the right place.

What if I upgrade to provider version while I was on Airflow 2.7? Months later when I will upgrade to Airflow 2.10 I might miss that.

We are discussing this very thing in here with @kacpermuda and @mobuchowski :

Run unit tests with airflow installed from packages #39513 (comment)

But also I need @dstandish to confirm if the SMTP change #39513 (comment) - seem that the DBT change was good actually - as it handles back-compatibility nicely.

This is the case for which I started to implement #39513 (for 2.9.0 compatibility) and follow-ups for 2.8 (#39606) and 2.7 (will be next) will be really helpful, because we are going to run the full suite of provider tests for past versions of airlfow once it is complete, so any kind of changes like that when we change both airflow and provider behaviour will have to be handled either as separate PRs or (if they are implemented as single PR) it will have to pass the back-compatibility tests with supported versions of Airflow.

So those kind of changes will be caught very early in the process - in the PR that modifies tests of provider to accomodate for changed behaviour of Airflow (and it will need to be handled by the author).

BTW. Most likely those kind of changes will have to be done as single PRs, because otherwise it will be difficult to coordinate such change in case provider test implicitly depends on some "airflow" internals - like it was in this case where provider behaviour (and tests) are depending on when try_number gets updated.

I am still not clear about this entry. I will hold release for the relevant providers till clarification

Hi @eladkal

This newsfragment is in reference to the change to airflow/providers/dbt/cloud/utils/openlineage.py in this PR.

The old code did this:

# generate same run id of current task instance parent_run_id = OpenLineageAdapter.build_task_instance_run_id( dag_id=task_instance.dag_id, task_id=operator.task_id, execution_date=task_instance.execution_date, try_number=task_instance.try_number - 1, )

Notice that previously it took the "public" try number attr and minused one. Presumably, this is because when this bit of code is fired, the task has completed and the try_number has been incremented (thus showing the "wrong" try_number). So the old code was correcting for the bad behavior of try_number in airflow.

In the present PR, we correct this behavior. The try_number is now only incremented when a new try is scheduled, and it remains unchanged throughout the task try lifecycle. The code in this provider which previously hacked around the bad behavior is now no longer necessary, and if left in there, it will give the wrong result. Hence the introduction of the compat shim _get_try_number which is added to the same module.

Previously, there was a lot of bad stuff happening around try_number. We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule. What I do here is try to remove all of that stuff: no more private _try_number attr no more getter logic no more decrementing no more incrementing as part of task execution Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try. One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change. If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage. As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before. --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in apache#39336.

Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in #39336.

Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in apache#39336.

boring-cyborg bot added area:API Airflow's REST/HTTP API area:logging area:plugins area:providers area:Scheduler Scheduler or dag parsing Issues area:serialization area:webserver Webserver related Issues provider:amazon-aws AWS/Amazon - related issues provider:elasticsearch labels Apr 30, 2024

dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 1d54320 to 310f923 Compare May 1, 2024 20:15

dstandish commented May 1, 2024

View reviewed changes

airflow/models/taskinstance.py Show resolved Hide resolved

dstandish commented May 1, 2024

View reviewed changes

tests/sensors/test_base.py Show resolved Hide resolved

dstandish marked this pull request as ready for review May 1, 2024 22:05

dstandish requested review from eladkal, o-nikolas, ryanahamilton, ashb, bbovenzi, pierrejeambrun, ephraimbuddy, kaxil and XD-DENG as code owners May 1, 2024 22:05

dstandish requested review from uranusjr and Lee-W May 1, 2024 22:33

jedcunningham reviewed May 2, 2024

View reviewed changes

dstandish mentioned this pull request May 2, 2024

Increment try_number while clearing deferred tasks. #38984

Closed

dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 36fcb5a to ed839d4 Compare May 6, 2024 19:28

dstandish requested a review from mobuchowski as a code owner May 6, 2024 19:28

hussein-awala approved these changes May 8, 2024

View reviewed changes

airflow/jobs/backfill_job_runner.py Show resolved Hide resolved

airflow/jobs/backfill_job_runner.py Outdated Show resolved Hide resolved

airflow/models/dagrun.py Show resolved Hide resolved

dstandish added 3 commits May 8, 2024 14:34

use try_number=1 for k8s synth ti

3428b89

fix test

2277e22

tweaks

da82a1c

dstandish merged commit 3938f71 into apache:main May 9, 2024
67 checks passed

dstandish deleted the remove-try-number-shenanigans branch May 9, 2024 02:56

eladkal reviewed May 11, 2024

View reviewed changes

eladkal mentioned this pull request May 12, 2024

Status of testing Providers that were prepared on May 12, 2024 #39578

Closed

66 tasks

kacpermuda mentioned this pull request May 13, 2024

Run unit tests with airflow installed from packages #39513

Merged

kacpermuda mentioned this pull request May 14, 2024

feat: Add custom provider runtime checks #39609

Merged

dstandish mentioned this pull request May 14, 2024

Remove webserver try_number adjustment #39623

Merged

eladkal mentioned this pull request May 26, 2024

Status of testing Providers that were prepared on May 26, 2024 #39842

Closed

utkarsharma2 added type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log and removed type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log labels Jun 3, 2024

utkarsharma2 added this to the Airflow 2.10.0 milestone Jun 4, 2024

dstandish mentioned this pull request Jun 25, 2024

Ensure try_number incremented for empty operator #40426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler to handle incrementing of try_number #39336

Scheduler to handle incrementing of try_number #39336

dstandish commented Apr 30, 2024 •

edited

Loading

jedcunningham left a comment

hussein-awala left a comment

dstandish commented May 9, 2024

potiuk commented May 9, 2024

dstandish commented May 9, 2024

potiuk commented May 9, 2024

dstandish commented May 9, 2024 •

edited

Loading

jedcunningham commented May 10, 2024

eladkal May 11, 2024 •

edited

Loading

potiuk May 14, 2024 •

edited

Loading

potiuk May 14, 2024 •

edited

Loading

eladkal May 20, 2024

dstandish May 20, 2024

Scheduler to handle incrementing of try_number #39336

Scheduler to handle incrementing of try_number #39336

Conversation

dstandish commented Apr 30, 2024 • edited Loading

jedcunningham left a comment

Choose a reason for hiding this comment

hussein-awala left a comment

Choose a reason for hiding this comment

dstandish commented May 9, 2024

potiuk commented May 9, 2024

dstandish commented May 9, 2024

potiuk commented May 9, 2024

dstandish commented May 9, 2024 • edited Loading

jedcunningham commented May 10, 2024

eladkal May 11, 2024 • edited Loading

Choose a reason for hiding this comment

potiuk May 14, 2024 • edited Loading

Choose a reason for hiding this comment

potiuk May 14, 2024 • edited Loading

Choose a reason for hiding this comment

eladkal May 20, 2024

Choose a reason for hiding this comment

dstandish May 20, 2024

Choose a reason for hiding this comment

dstandish commented Apr 30, 2024 •

edited

Loading

dstandish commented May 9, 2024 •

edited

Loading

eladkal May 11, 2024 •

edited

Loading

potiuk May 14, 2024 •

edited

Loading

potiuk May 14, 2024 •

edited

Loading