Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler to handle incrementing of try_number #39336

Merged
merged 53 commits into from
May 9, 2024

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Apr 30, 2024

Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

  • no more private _try_number attr
  • no more getter logic
  • no more decrementing
  • no more incrementing as part of task execution

Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:logging area:plugins area:providers area:Scheduler Scheduler or dag parsing Issues area:serialization area:webserver Webserver related Issues provider:amazon-aws AWS/Amazon - related issues provider:elasticsearch labels Apr 30, 2024
@dstandish dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 1d54320 to 310f923 Compare May 1, 2024 20:15
Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

@SamWheating, you might be a good reviewer on this one as well.

@dstandish dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 36fcb5a to ed839d4 Compare May 6, 2024 19:28
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Working with try_number was confusing for many contributors, this change helps to make it easier.

I just added some nits, but it looks good to merge. LGTM 🚀

airflow/jobs/backfill_job_runner.py Show resolved Hide resolved
airflow/jobs/backfill_job_runner.py Outdated Show resolved Hide resolved
airflow/models/dagrun.py Show resolved Hide resolved
@dstandish
Copy link
Contributor Author

alright, yolo, here we go

thanks @ashb @hussein-awala @jedcunningham

@dstandish dstandish merged commit 3938f71 into apache:main May 9, 2024
67 checks passed
@dstandish dstandish deleted the remove-try-number-shenanigans branch May 9, 2024 02:56
@potiuk
Copy link
Member

potiuk commented May 9, 2024

A few nits, otherwise it looks good, and I don't know why we (I) did it this terrible hacky way in the first place!

I think we will find out when it is released :). But I also think it's worth to take risk to finally get rid of that one.

BTW. Shall we mark it for 2.9.2 ? I think we can treat it as a bug-fix ?

@dstandish
Copy link
Contributor Author

I feel like it is not unreasonable to think of it as bug fix. But I think the point of that convention is that patch releases should get more stable not less. So since this has some stability risk it may violate spirit of convention to include it. Though there could also be cherry pick convenience arguments that force it. If it doesn’t go in 2.9.x then incidentally it may not ever see a release because of Jed’s coming changes re TIs :)

@potiuk
Copy link
Member

potiuk commented May 9, 2024

Maybe just a bit more manual test scrutiny for 2.9.2 around try_num will be enough of stability ? I think the sooner we get it out, the easier will be to fix any potential problems there.

@dstandish
Copy link
Contributor Author

dstandish commented May 9, 2024

I do plan to do more manual testing. I had some trouble with backfill but thought i got it no worse than main (backfill generally (ie separate from this change ) has issues and doesn’t work right in all scenarios). But plan to do a little more. I don’t personally object to including it.

Non backfill is simpler

RodrigoGanancia pushed a commit to RodrigoGanancia/airflow that referenced this pull request May 10, 2024
Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

no more private _try_number attr
no more getter logic
no more decrementing
no more incrementing as part of task execution
Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@jedcunningham
Copy link
Member

#protm

Comment on lines +31 to +35
main
.....

In Airflow 2.10.0, we fix the way try_number works, so that it no longer returns different values depending on task instance state. Importantly, after the task is done, it no longer shows current_try + 1. Thus in 3.8.1 we patch this provider to fix try_number references so they no longer adjust for the old, bad behavior.

Copy link
Contributor

@eladkal eladkal May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here?
How is this PR affect dbt provider?
Providers are supported for older versions of Airflow as well. I'm out of context for what this PR achieves but I am not sure if this note in change log is the right place.

What if I upgrade to provider version while I was on Airflow 2.7? Months later when I will upgrade to Airflow 2.10 I might miss that.

Copy link
Member

@potiuk potiuk May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are discussing this very thing in here with @kacpermuda and @mobuchowski :

But also I need @dstandish to confirm if the SMTP change #39513 (comment) - seem that the DBT change was good actually - as it handles back-compatibility nicely.

This is the case for which I started to implement #39513 (for 2.9.0 compatibility) and follow-ups for 2.8 (#39606) and 2.7 (will be next) will be really helpful, because we are going to run the full suite of provider tests for past versions of airlfow once it is complete, so any kind of changes like that when we change both airflow and provider behaviour will have to be handled either as separate PRs or (if they are implemented as single PR) it will have to pass the back-compatibility tests with supported versions of Airflow.

So those kind of changes will be caught very early in the process - in the PR that modifies tests of provider to accomodate for changed behaviour of Airflow (and it will need to be handled by the author).

Copy link
Member

@potiuk potiuk May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW. Most likely those kind of changes will have to be done as single PRs, because otherwise it will be difficult to coordinate such change in case provider test implicitly depends on some "airflow" internals - like it was in this case where provider behaviour (and tests) are depending on when try_number gets updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not clear about this entry. I will hold release for the relevant providers till clarification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eladkal

This newsfragment is in reference to the change to airflow/providers/dbt/cloud/utils/openlineage.py in this PR.

The old code did this:

        # generate same run id of current task instance
        parent_run_id = OpenLineageAdapter.build_task_instance_run_id(
            dag_id=task_instance.dag_id,
            task_id=operator.task_id,
            execution_date=task_instance.execution_date,
            try_number=task_instance.try_number - 1,
        )

Notice that previously it took the "public" try number attr and minused one. Presumably, this is because when this bit of code is fired, the task has completed and the try_number has been incremented (thus showing the "wrong" try_number). So the old code was correcting for the bad behavior of try_number in airflow.

In the present PR, we correct this behavior. The try_number is now only incremented when a new try is scheduled, and it remains unchanged throughout the task try lifecycle. The code in this provider which previously hacked around the bad behavior is now no longer necessary, and if left in there, it will give the wrong result. Hence the introduction of the compat shim _get_try_number which is added to the same module.

pateash pushed a commit to pateash/airflow that referenced this pull request May 13, 2024
Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

no more private _try_number attr
no more getter logic
no more decrementing
no more incrementing as part of task execution
Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
dstandish added a commit to astronomer/airflow that referenced this pull request May 14, 2024
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling.  We can remove this code now.  Just missed this in apache#39336.
dstandish added a commit that referenced this pull request May 14, 2024
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling.  We can remove this code now.  Just missed this in #39336.
RNHTTR pushed a commit to RNHTTR/airflow that referenced this pull request Jun 1, 2024
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling.  We can remove this code now.  Just missed this in apache#39336.
@utkarsharma2 utkarsharma2 added type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log and removed type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log labels Jun 3, 2024
@utkarsharma2 utkarsharma2 added this to the Airflow 2.10.0 milestone Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API area:logging area:plugins area:providers area:Scheduler Scheduler or dag parsing Issues area:serialization area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge provider:amazon-aws AWS/Amazon - related issues provider:elasticsearch type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants