New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-1424] make the next execution date of DAGs visible #2460
Conversation
The scheduler's DAG run creation logic can be tricky and one is easily confused with the start_date + interval and period end scheduling way of thinking. It would ease airflow's usage to add a *next execution* field to DAGs so that we can very easily see the (un)famous *period end* after which the scheduler will create a new DAG run for our workflows. These patches are a simple way to implement this on the DAG model and make use of this in the interface.
@ultrabug, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mistercrunch, @bolkedebruin and @jlowin to be potential reviewers. |
This PR replaces #2457, thanks for your consideration. |
As a side note, if this PR is merged I will open another one to clean up the code of the |
Codecov Report
@@ Coverage Diff @@
## master #2460 +/- ##
==========================================
+ Coverage 69.4% 69.47% +0.07%
==========================================
Files 146 146
Lines 11289 11306 +17
==========================================
+ Hits 7835 7855 +20
+ Misses 3454 3451 -3
Continue to review full report at Codecov.
|
Good idea. However I'd prefer to avoid adding new column to DAGs table, it is already too wide for smaller screens. Maybe it would make sense to add next execution date as a tooltip for schedule column values instead? |
@pdambrauskas thanks.
I think this is worth seeing without a tooltip.
That's a responsive problem right ;) but still, maybe we could add this info in the "schedule" column. One label with the interval, another one under it when the next execution date ? |
Will wait for another opinion on the web UI before updating PR. |
I would find this really helpful for our operations as well (randomly chiming in after reading over this ticket). Though I'm not sure how to best add this information in terms of presentation.. Not any help but just reaffirming this feature request would benefit others too, namely me :D |
This patch addresses AIRFLOW-1424, but hasn't yet been merged. See apache#2460 for context
Do we hope to see this any time soon? |
I would love so too obviously :) friendly ping @bolkedebruin |
This would be so useful for me too ! Can we have an update on this @bolkedebruin @pdambrauskas @mistercrunch please ? |
@ultrabug you have a merge conflict please rebase this sweet patch |
There is a CLI command merged to master by @XD-DENG that shows the next execution date Also duplicate Jira ticket for this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the feature and it will be very useful.
As @ron819 mentioned there is (now) similar logic in the CLI that should be refactored to use these methods.
Additionally there is also probably similar logic in get_template_context
that should also be re-worked to use these methods.
airflow/models.py
Outdated
@property | ||
def next_execution_date(self): | ||
""" | ||
Returns the next execution date at which the dag will be scheduled by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how these two methods differ. What's the intent behind these two methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code base and terminology makes a distinction between run dates and execution dates
I may be mistaken but run is related to the DAG while execution is related to the scheduler. So the actual execution date (which is what we want to know here) is derived from the next run date of the DAG.
That's why the logic here adds a property of next_run_date
which is derived from the DAG's tasks while the following next_execution_date
is using this next_run_date
property to calculate its own execution time.
I just re-read the code and it still looks valid to me. If not, tell me and I'll adjust my mistakes!
Cheers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a distinction between execution date an run date - it's not something I'm familiar with and to my mind they were two names for the same concept.
I think the overall logic is roughly this:
- If there are any previous dag runs for for the dag, then the next date is
following_schedule(latest_dag_run.execution_date)
(possibly filtering out "adhoc" or backfilled dagruns. Not sure) - Otherwise it is start date
I think it's that simple. I'm not 100% familiar with all the details of the scheduler so I'm not 100% sure. This is what the normalie_scheduled
function does.
This bit of jobs.py seems to be relevant https://github.com/apache/incubator-airflow/blob/e53182cf83501a789fbdd74ee472d29052a5e321/airflow/jobs.py#L840-L851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the logic I've followed and you can see that my proposed implementation is based on the scheduler's jobs.py code indeed
I can shrink the next_run_date function into one to make it simple tho indeed, gonna update
@XD-DENG as you see the scheduler does otherwise and having None as a result looks strange to me since there's always a period end at which the scheduler itself will execute the DAG (if it has a start_date that is).
So it's not because no previous execution has happened that there won't be any right. And the scheduler code above shows how it does it.
Still, @XD-DENG I think your link has a nice example and I'll make sure to validate the fact that this PR behaves exactly as the documentation says. Sound good to you?
Yeah this would be great to have! To be clear: the implementation should live in the model (as you have in this PR) and be called from multiple places. |
OK @ashb I understand your precision and since you're a member of the project I'll work on this and update this PR with the hope for it to be considered. I hope this time we make it :) |
Hello I've rebased the branch to current master and modified the CLI to use the new property so that everything is clean. LGTM now, tested OK |
Thanks @ultrabug Unfortunately Travis isn't entirely happy:
|
Thanks for pointing this out @Fokko, I totally forgot to check the cli test suite |
airflow/models.py
Outdated
next_run_date = None | ||
if not self.latest_execution_date: | ||
# First run | ||
task_start_dates = [t.start_date for t in self.tasks] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this function will make sense when latest_execution_date
is not available.
Understand you're trying to decide the next run using start_date
. But catchup
of the DAG can be either True
or False
, and it will affect the actual next execution date (ref: https://airflow.apache.org/scheduler.html?highlight=backfill#backfill-and-catchup).
Interesting to see this, the author of the CLI implementation assumed that there can't be a next schedule if there has not been any execution yet # `next_execution` function is inapplicable if no execution record found
# It prints `None` in such cases
self.assertEqual(stdout[-1], "None") which contradicts my implementation where when this is the case, I derive the first planned run date from the tasks and normalize it to calculate the actual first execution date: @property
def next_run_date(self):
"""
Returns the next run date for which the dag will be scheduled
"""
next_run_date = None
if not self.latest_execution_date:
# First run
task_start_dates = [t.start_date for t in self.tasks]
if task_start_dates:
next_run_date = self.normalize_schedule(min(task_start_dates)) @ashb @Fokko and all others, I must admit that I'm not sure which point of view is the right one here since the CLI implementation has been accepted in the code base already, does that mean that it's right? |
Codecov Report
@@ Coverage Diff @@
## master #2460 +/- ##
==========================================
+ Coverage 76.67% 76.69% +0.02%
==========================================
Files 199 199
Lines 16212 16233 +21
==========================================
+ Hits 12430 12450 +20
- Misses 3782 3783 +1
Continue to review full report at Codecov.
|
@XD-DENG bad news, the example DAG in the documentation is breaking the scheduler on master so even the documentation is wrong. Fresh installation, if I run the scheduler using the DAG: """
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '@hourly',
}
dag = DAG('tutorial', catchup=False, default_args=default_args) nothing happens, the scheduler does not pick up anything now if I change the catchup parameter to dag = DAG('tutorial', catchup=True, default_args=default_args) I get the scheduler failing with
That None result is annoying even the scheduler :) EDIT: quoting the documentation for expected behavior
|
To be complete: now ofc if I add a DummyOperator task to the example DAG, it does not fail any more since it can find out an actual start_date thanks to the logic from jobs.py... which leads me to think that this "find start_date from tasks" logic is important to keep. |
@ryw Well as you can see the least we can say is that I've been patient and faithful so far :) What I don't want is to put yet again an effort in this PR and to see it rotting another year so if you give me your word on it actually being merged and released I'll spare the time to help the airflow community with pleasure. |
I'll personally see it through :) |
@ryanahamilton maybe you can look at this a bit too - adding a column will stress an already stressed UI, and this is near the work you're doing now #10556 |
I wonder if this can be conflicted / redundant with the idea of #5787 |
@ryw thanks. Taking into account the last comment from @JeffryMAC in #10556 and what @eladkal is pointing out on #5787 I think that I should wait until you guys settle on what you want first if that's okay with you @ryw? Else I have the feeling that this PR will again stand in the middle of a debate so I'd rather have it settled first ;) |
@ultrabug @ryw FYI, I just made a PR to correct the "Start Date" tooltip to actually show the real start date (it was showing the execution date in RBAC UI) #10637 which fixed the redundancy issue Also, maybe an alternative to a new column can be just a 2nd line in that tooltip to show the next execution? I have dealt with making changes to the json that provides the data for the Last Run column so I am open to working on adding this in there or guiding someone how to do it if needed (also the PR I just linked to clearly shows how to add fields to the json in both the |
@ultrabug Going over this whole thread, I want some clarification before I dive deeper into development this:
Also, regarding putting this in the "Last Run" column, if it should be showing the next execution date if the DAG has not run then I think the "Last Run" column might not be the best place to put this in a tooltip because the column currently is empty, maybe it should go in the "Schedule" column as discussed way back when this PR started? |
Another snag I ran into is I tried to replicate this functionality on top of the latest master branch changes: I can't figure out how to access the I guess an alternative would be to make this asynchronous and not use the property directly in the template. |
Hello @alexbegg
As for your last comment, it's too long ago for me to get a proper answer. Did you find out something better since? |
@ultrabug I am a bit too busy to look further into this today, but I will try and take a look this weekend. I have a WIP branch if you want to take a look. I am trying to add in your changes on top of the latest master, but just for testing purposes I am trying to show the two new properties below the schedule link (currently both blank due to issues mentioned in my last comment), when I get it working I'll move the date to its final place as a tooltip. master...alexbegg:add-next-run |
@alexbegg I've tried running your fork on local but I failed to get anything displayed on the label... tried to debug this without success so I dunno if something changed on airflow 2 that prevents a simple property to be displayed on the webserver interface. |
@ryw ping; 1 month later. |
|
Hi @ryw ; just checking in again to see if the moment is right. I promised to commit to this PR so here I am again. |
@ryanahamilton maybe you can have a look |
Let's target 2.2 with this. @ultrabug can you handle the merge conflicts? |
mmm I think the reason behind this PR will no longer be valid after AIP-39 Richer scheduler_interval. In any case I suggest at least to wait after AIP-39 is completed and then revisit to see if this PR brings additional value. |
Yeah, fixing this feature request (if not this PR) is on the cards for AIP-39 https://github.com/apache/airflow/projects/10#card-61511156 (quite literally) |
Dear Airflow maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
The scheduler's DAG run creation logic can be tricky and one is
easily confused with the start_date + interval
and period end scheduling way of thinking.
It would ease airflow's usage to add a next execution field to DAGs
so that we can very easily see the (un)famous period end after which
the scheduler will create a new DAG run for our workflows.
These patches are a simple way to implement this on the DAG model
and make use of this in the interface.
Tests
Tests are provided
Commits