-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler to handle incrementing of try_number #39336
Scheduler to handle incrementing of try_number #39336
Conversation
1d54320
to
310f923
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
@SamWheating, you might be a good reviewer on this one as well.
36fcb5a
to
ed839d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Working with try_number was confusing for many contributors, this change helps to make it easier.
I just added some nits, but it looks good to merge. LGTM 🚀
alright, yolo, here we go thanks @ashb @hussein-awala @jedcunningham |
I think we will find out when it is released :). But I also think it's worth to take risk to finally get rid of that one. BTW. Shall we mark it for 2.9.2 ? I think we can treat it as a bug-fix ? |
I feel like it is not unreasonable to think of it as bug fix. But I think the point of that convention is that patch releases should get more stable not less. So since this has some stability risk it may violate spirit of convention to include it. Though there could also be cherry pick convenience arguments that force it. If it doesn’t go in 2.9.x then incidentally it may not ever see a release because of Jed’s coming changes re TIs :) |
Maybe just a bit more manual test scrutiny for 2.9.2 around try_num will be |
I do plan to do more manual testing. I had some trouble with backfill but thought i got it no worse than main (backfill generally (ie separate from this change ) has issues and doesn’t work right in all scenarios). But plan to do a little more. I don’t personally object to including it. Non backfill is simpler |
Previously, there was a lot of bad stuff happening around try_number. We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule. What I do here is try to remove all of that stuff: no more private _try_number attr no more getter logic no more decrementing no more incrementing as part of task execution Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try. One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change. If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage. As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before. --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
#protm |
main | ||
..... | ||
|
||
In Airflow 2.10.0, we fix the way try_number works, so that it no longer returns different values depending on task instance state. Importantly, after the task is done, it no longer shows current_try + 1. Thus in 3.8.1 we patch this provider to fix try_number references so they no longer adjust for the old, bad behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here?
How is this PR affect dbt provider?
Providers are supported for older versions of Airflow as well. I'm out of context for what this PR achieves but I am not sure if this note in change log is the right place.
What if I upgrade to provider version while I was on Airflow 2.7? Months later when I will upgrade to Airflow 2.10 I might miss that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are discussing this very thing in here with @kacpermuda and @mobuchowski :
But also I need @dstandish to confirm if the SMTP change #39513 (comment) - seem that the DBT change was good actually - as it handles back-compatibility nicely.
This is the case for which I started to implement #39513 (for 2.9.0 compatibility) and follow-ups for 2.8 (#39606) and 2.7 (will be next) will be really helpful, because we are going to run the full suite of provider tests for past versions of airlfow once it is complete, so any kind of changes like that when we change both airflow and provider behaviour will have to be handled either as separate PRs or (if they are implemented as single PR) it will have to pass the back-compatibility tests with supported versions of Airflow.
So those kind of changes will be caught very early in the process - in the PR that modifies tests of provider to accomodate for changed behaviour of Airflow (and it will need to be handled by the author).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW. Most likely those kind of changes will have to be done as single PRs, because otherwise it will be difficult to coordinate such change in case provider test implicitly depends on some "airflow" internals - like it was in this case where provider behaviour (and tests) are depending on when try_number gets updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not clear about this entry. I will hold release for the relevant providers till clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eladkal
This newsfragment is in reference to the change to airflow/providers/dbt/cloud/utils/openlineage.py
in this PR.
The old code did this:
# generate same run id of current task instance
parent_run_id = OpenLineageAdapter.build_task_instance_run_id(
dag_id=task_instance.dag_id,
task_id=operator.task_id,
execution_date=task_instance.execution_date,
try_number=task_instance.try_number - 1,
)
Notice that previously it took the "public" try number attr and minused one. Presumably, this is because when this bit of code is fired, the task has completed and the try_number has been incremented (thus showing the "wrong" try_number). So the old code was correcting for the bad behavior of try_number in airflow.
In the present PR, we correct this behavior. The try_number is now only incremented when a new try is scheduled, and it remains unchanged throughout the task try lifecycle. The code in this provider which previously hacked around the bad behavior is now no longer necessary, and if left in there, it will give the wrong result. Hence the introduction of the compat shim _get_try_number
which is added to the same module.
Previously, there was a lot of bad stuff happening around try_number. We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule. What I do here is try to remove all of that stuff: no more private _try_number attr no more getter logic no more decrementing no more incrementing as part of task execution Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try. One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change. If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage. As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before. --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in apache#39336.
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in #39336.
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in apache#39336.
Previously, there was a lot of bad stuff happening around try_number.
We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.
What I do here is try to remove all of that stuff:
Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.
One consequence of this is that try number will no longer be incremented if you run either
airlfow tasks run
orti.run()
in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.
As a courtesy for backcompat we also add property
_try_number
which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.