Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-1424] make the next execution date of DAGs visible #2460

Closed
wants to merge 6 commits into from
Closed

[AIRFLOW-1424] make the next execution date of DAGs visible #2460

wants to merge 6 commits into from

Conversation

ultrabug
Copy link
Contributor

Dear Airflow maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots of any UI changes:

The scheduler's DAG run creation logic can be tricky and one is
easily confused with the start_date + interval
and period end scheduling way of thinking.

It would ease airflow's usage to add a next execution field to DAGs
so that we can very easily see the (un)famous period end after which
the scheduler will create a new DAG run for our workflows.

These patches are a simple way to implement this on the DAG model
and make use of this in the interface.

2017-07-20-221839_1160x357_scrot

2017-07-20-221854_1155x302_scrot

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Tests are provided

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

The scheduler's DAG run creation logic can be tricky and one is
easily confused with the start_date + interval
and period end scheduling way of thinking.

It would ease airflow's usage to add a *next execution* field to DAGs
so that we can very easily see the (un)famous *period end* after which
the scheduler will create a new DAG run for our workflows.

These patches are a simple way to implement this on the DAG model
and make use of this in the interface.
@mention-bot
Copy link

@ultrabug, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mistercrunch, @bolkedebruin and @jlowin to be potential reviewers.

@ultrabug
Copy link
Contributor Author

This PR replaces #2457, thanks for your consideration.

@ultrabug
Copy link
Contributor Author

As a side note, if this PR is merged I will open another one to clean up the code of the create_dag_run method in the jobs.py file (which basically duplicates code at the wrong place).

@codecov-io
Copy link

codecov-io commented Jul 20, 2017

Codecov Report

Merging #2460 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2460      +/-   ##
==========================================
+ Coverage    69.4%   69.47%   +0.07%     
==========================================
  Files         146      146              
  Lines       11289    11306      +17     
==========================================
+ Hits         7835     7855      +20     
+ Misses       3454     3451       -3
Impacted Files Coverage Δ
airflow/models.py 87.29% <100%> (+0.05%) ⬆️
airflow/jobs.py 76.72% <0%> (+0.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3927723...18daed2. Read the comment docs.

@pdambrauskas
Copy link
Contributor

Good idea. However I'd prefer to avoid adding new column to DAGs table, it is already too wide for smaller screens. Maybe it would make sense to add next execution date as a tooltip for schedule column values instead?

@ultrabug
Copy link
Contributor Author

@pdambrauskas thanks.

Maybe it would make sense to add next execution date as a tooltip for schedule column values instead?

I think this is worth seeing without a tooltip.

However I'd prefer to avoid adding new column to DAGs table, it is already too wide for smaller screens

That's a responsive problem right ;) but still, maybe we could add this info in the "schedule" column. One label with the interval, another one under it when the next execution date ?

It would look like this
2017-07-21-092554_7040x2160_scrot

@ultrabug
Copy link
Contributor Author

Will wait for another opinion on the web UI before updating PR.

@Acehaidrey
Copy link
Contributor

I would find this really helpful for our operations as well (randomly chiming in after reading over this ticket). Though I'm not sure how to best add this information in terms of presentation.. Not any help but just reaffirming this feature request would benefit others too, namely me :D

astahlman added a commit to lyft/airflow that referenced this pull request Aug 9, 2017
This patch addresses AIRFLOW-1424, but hasn't yet been merged. See
apache#2460 for context
@otienoanyango
Copy link

Do we hope to see this any time soon?

@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 9, 2017

I would love so too obviously :) friendly ping @bolkedebruin

@gfalcone
Copy link

This would be so useful for me too ! Can we have an update on this @bolkedebruin @pdambrauskas @mistercrunch please ?

@tobes
Copy link

tobes commented Mar 28, 2018

@ultrabug you have a merge conflict please rebase this sweet patch

@ron819
Copy link
Contributor

ron819 commented Oct 22, 2018

There is a CLI command merged to master by @XD-DENG that shows the next execution date
#3834
Maybe this PR can use the logic already merged to master to populate the values to the UI.

Also duplicate Jira ticket for this:
https://issues.apache.org/jira/browse/AIRFLOW-2618

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the feature and it will be very useful.

As @ron819 mentioned there is (now) similar logic in the CLI that should be refactored to use these methods.

Additionally there is also probably similar logic in get_template_context that should also be re-worked to use these methods.

@property
def next_execution_date(self):
"""
Returns the next execution date at which the dag will be scheduled by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how these two methods differ. What's the intent behind these two methods?

Copy link
Contributor Author

@ultrabug ultrabug Nov 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code base and terminology makes a distinction between run dates and execution dates

I may be mistaken but run is related to the DAG while execution is related to the scheduler. So the actual execution date (which is what we want to know here) is derived from the next run date of the DAG.

That's why the logic here adds a property of next_run_date which is derived from the DAG's tasks while the following next_execution_date is using this next_run_date property to calculate its own execution time.

I just re-read the code and it still looks valid to me. If not, tell me and I'll adjust my mistakes!

Cheers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a distinction between execution date an run date - it's not something I'm familiar with and to my mind they were two names for the same concept.

I think the overall logic is roughly this:

  • If there are any previous dag runs for for the dag, then the next date is following_schedule(latest_dag_run.execution_date) (possibly filtering out "adhoc" or backfilled dagruns. Not sure)
  • Otherwise it is start date

I think it's that simple. I'm not 100% familiar with all the details of the scheduler so I'm not 100% sure. This is what the normalie_scheduled function does.

This bit of jobs.py seems to be relevant https://github.com/apache/incubator-airflow/blob/e53182cf83501a789fbdd74ee472d29052a5e321/airflow/jobs.py#L840-L851

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the logic I've followed and you can see that my proposed implementation is based on the scheduler's jobs.py code indeed

I can shrink the next_run_date function into one to make it simple tho indeed, gonna update

@XD-DENG as you see the scheduler does otherwise and having None as a result looks strange to me since there's always a period end at which the scheduler itself will execute the DAG (if it has a start_date that is).

So it's not because no previous execution has happened that there won't be any right. And the scheduler code above shows how it does it.

Still, @XD-DENG I think your link has a nice example and I'll make sure to validate the fact that this PR behaves exactly as the documentation says. Sound good to you?

@ultrabug
Copy link
Contributor Author

Well @ron819 @ashb thanks for your updates but I've never heard back from upstream and any interest in this so I will gladly rebase myself from the CLI implementation (which looks very close on a quick glance) if there's a chance for it to go in...

@ashb
Copy link
Member

ashb commented Oct 23, 2018

Yeah this would be great to have!

To be clear: the implementation should live in the model (as you have in this PR) and be called from multiple places.

@ultrabug
Copy link
Contributor Author

OK @ashb I understand your precision and since you're a member of the project I'll work on this and update this PR with the hope for it to be considered.

I hope this time we make it :)

@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 4, 2018

Hello

I've rebased the branch to current master and modified the CLI to use the new property so that everything is clean.

LGTM now, tested OK

@Fokko
Copy link
Contributor

Fokko commented Nov 5, 2018

Thanks @ultrabug

Unfortunately Travis isn't entirely happy:

======================================================================
6) FAIL: test_next_execution (tests.cli.test_cli.TestCLI)
----------------------------------------------------------------------
   Traceback (most recent call last):
    tests/cli/test_cli.py line 267 in test_next_execution
      self.assertEqual(stdout[-1], "None")
   AssertionError: '2018-11-03 00:00:00+00:00' != 'None'

@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 5, 2018

Thanks for pointing this out @Fokko, I totally forgot to check the cli test suite

next_run_date = None
if not self.latest_execution_date:
# First run
task_start_dates = [t.start_date for t in self.tasks]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this function will make sense when latest_execution_date is not available.

Understand you're trying to decide the next run using start_date. But catchup of the DAG can be either True or False, and it will affect the actual next execution date (ref: https://airflow.apache.org/scheduler.html?highlight=backfill#backfill-and-catchup).

@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 5, 2018

Interesting to see this, the author of the CLI implementation assumed that there can't be a next schedule if there has not been any execution yet

            # `next_execution` function is inapplicable if no execution record found
            # It prints `None` in such cases
            self.assertEqual(stdout[-1], "None")

which contradicts my implementation where when this is the case, I derive the first planned run date from the tasks and normalize it to calculate the actual first execution date:

    @property
    def next_run_date(self):
        """
        Returns the next run date for which the dag will be scheduled
        """
        next_run_date = None
        if not self.latest_execution_date:
            # First run
            task_start_dates = [t.start_date for t in self.tasks]
            if task_start_dates:
                next_run_date = self.normalize_schedule(min(task_start_dates))

@ashb @Fokko and all others, I must admit that I'm not sure which point of view is the right one here since the CLI implementation has been accepted in the code base already, does that mean that it's right?

@codecov-io
Copy link

Codecov Report

Merging #2460 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2460      +/-   ##
==========================================
+ Coverage   76.67%   76.69%   +0.02%     
==========================================
  Files         199      199              
  Lines       16212    16233      +21     
==========================================
+ Hits        12430    12450      +20     
- Misses       3782     3783       +1
Impacted Files Coverage Δ
airflow/models.py 92.11% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80a3d6a...da9b738. Read the comment docs.

@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 6, 2018

@XD-DENG bad news, the example DAG in the documentation is breaking the scheduler on master so even the documentation is wrong.

Fresh installation, if I run the scheduler using the DAG:

"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 12, 1),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'schedule_interval': '@hourly',
}

dag = DAG('tutorial', catchup=False, default_args=default_args)

nothing happens, the scheduler does not pick up anything

now if I change the catchup parameter to True

dag = DAG('tutorial', catchup=True, default_args=default_args)

I get the scheduler failing with

Process DagFileProcessor1-Process:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alexys/github/incubator-airflow_numberly/airflow/jobs.py", line 395, in helper
    pickle_dags)
  File "/home/alexys/github/incubator-airflow_numberly/airflow/utils/db.py", line 74, in wrapper
    return func(*args, **kwargs)
  File "/home/alexys/github/incubator-airflow_numberly/airflow/jobs.py", line 1726, in process_file
    self._process_dags(dagbag, dags, ti_keys_to_schedule)
  File "/home/alexys/github/incubator-airflow_numberly/airflow/jobs.py", line 1426, in _process_dags
    dag_run = self.create_dag_run(dag)
  File "/home/alexys/github/incubator-airflow_numberly/airflow/utils/db.py", line 74, in wrapper
    return func(*args, **kwargs)
  File "/home/alexys/github/incubator-airflow_numberly/airflow/jobs.py", line 872, in create_dag_run
    if next_run_date > timezone.utcnow():
TypeError: can't compare datetime.datetime to NoneType

That None result is annoying even the scheduler :)

EDIT: quoting the documentation for expected behavior

In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the command line), a single DAG Run will be created, with an execution_date of 2016-01-01, and the next one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02.

If the dag.catchup value had been True instead, the scheduler would have created a DAG Run for each completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval hasn’t completed) and the scheduler will execute them sequentially. This behavior is great for atomic datasets that can easily be split into periods. Turning catchup off is great if your DAG Runs perform backfill internally.

@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 6, 2018

To be complete: now ofc if I add a DummyOperator task to the example DAG, it does not fail any more since it can find out an actual start_date thanks to the logic from jobs.py... which leads me to think that this "find start_date from tasks" logic is important to keep.

@ultrabug
Copy link
Contributor Author

@ryw Well as you can see the least we can say is that I've been patient and faithful so far :)

What I don't want is to put yet again an effort in this PR and to see it rotting another year so if you give me your word on it actually being merged and released I'll spare the time to help the airflow community with pleasure.

@ryw
Copy link
Member

ryw commented Aug 25, 2020

I'll personally see it through :)

@ryw
Copy link
Member

ryw commented Aug 25, 2020

@ryanahamilton maybe you can look at this a bit too - adding a column will stress an already stressed UI, and this is near the work you're doing now #10556

@eladkal
Copy link
Contributor

eladkal commented Aug 25, 2020

I wonder if this can be conflicted / redundant with the idea of #5787
If user choose start then execution_date = "start date" thus understanding when the next run kicks in will be very simple.

@ultrabug
Copy link
Contributor Author

@ryw thanks.

Taking into account the last comment from @JeffryMAC in #10556 and what @eladkal is pointing out on #5787 I think that I should wait until you guys settle on what you want first if that's okay with you @ryw?

Else I have the feeling that this PR will again stand in the middle of a debate so I'd rather have it settled first ;)

@alexbegg
Copy link
Contributor

alexbegg commented Aug 28, 2020

@ultrabug @ryw FYI, I just made a PR to correct the "Start Date" tooltip to actually show the real start date (it was showing the execution date in RBAC UI) #10637 which fixed the redundancy issue

Also, maybe an alternative to a new column can be just a 2nd line in that tooltip to show the next execution? I have dealt with making changes to the json that provides the data for the Last Run column so I am open to working on adding this in there or guiding someone how to do it if needed (also the PR I just linked to clearly shows how to add fields to the json in both the views.py and the dags.html)

@ryw
Copy link
Member

ryw commented Aug 28, 2020

Yes, I think showing two dates in that pop up is better than a new column for now, and accomplishes the goal that @ultrabug set out to accomplish.

What do you think @ultrabug?

@ultrabug
Copy link
Contributor Author

@alexbegg @ryw I think that indeed that would accomplish the goal of my initial PR and would be awesome to add it there!

@alexbegg
Copy link
Contributor

alexbegg commented Aug 31, 2020

@ultrabug Going over this whole thread, I want some clarification before I dive deeper into development this:

  1. Will this show the next execution date for a DAG which has a future start_date, or the DAG has not ran? From a look of the code it seems it would fail to show the next_run_date if the date is in the future due to this check which would return None before the dry_run returns the next_run_date (this is a snippet from current master as this PR is a bit behind):
    # don't ever schedule in the future or if next_run_date is None
    if not next_run_date or next_run_date > timezone.utcnow():
    return None
    maybe wrap that in parentheses add and not dry_run to that line?
  2. Is the intention to still use the dry_run functionality to return the next expected execution date?
  3. I assume it makes no sense to show the next execution date if the DAG is paused? Or am I mistaken?

Also, regarding putting this in the "Last Run" column, if it should be showing the next execution date if the DAG has not run then I think the "Last Run" column might not be the best place to put this in a tooltip because the column currently is empty, maybe it should go in the "Schedule" column as discussed way back when this PR started?

@alexbegg
Copy link
Contributor

Another snag I ran into is I tried to replicate this functionality on top of the latest master branch changes:

I can't figure out how to access the schedulable_at and scheduled_in properties on the DAG model, because dags.html is only iterating on DagModels instead which is getting columns from the dag table of the db. I know I can access the DAG object by importing the model and doing DAG(dag_id=dag.dag_id) but I don't know how to do that in the template unless I somehow make a method of the DagModel to get the dag object. Sorry, my flask knowledge is a bit rusty.

I guess an alternative would be to make this asynchronous and not use the property directly in the template.

@ultrabug
Copy link
Contributor Author

ultrabug commented Sep 3, 2020

Hello @alexbegg

  1. Yes indeed. The fact that all the scheduling logic is embedded in the create_dag_run which I read as "let's create a dag run" is forcing us to use the dry run trick to actually get the date. We could argue that another suitable manner to implement this would be to move the logic in a "what would be the next execution of this dag" function. But maybe dry run trick is enough.

  2. intention is to get the next expected execution date

  3. sounds sane not to indeed

  4. as for column, I agree indeed

As for your last comment, it's too long ago for me to get a proper answer. Did you find out something better since?

@alexbegg
Copy link
Contributor

alexbegg commented Sep 3, 2020

@ultrabug I am a bit too busy to look further into this today, but I will try and take a look this weekend. I have a WIP branch if you want to take a look. I am trying to add in your changes on top of the latest master, but just for testing purposes I am trying to show the two new properties below the schedule link (currently both blank due to issues mentioned in my last comment), when I get it working I'll move the date to its final place as a tooltip. master...alexbegg:add-next-run

@ultrabug
Copy link
Contributor Author

@alexbegg I've tried running your fork on local but I failed to get anything displayed on the label... tried to debug this without success so I dunno if something changed on airflow 2 that prevents a simple property to be displayed on the webserver interface.

@ultrabug
Copy link
Contributor Author

@ryw ping; 1 month later.

@ryw
Copy link
Member

ryw commented Oct 19, 2020

The Airflow 2 release team decided not to push this into Airflow 2 (got this issue confused w/ another)
We'll target 2.1 and make sure we get it right.

@ryw ryw modified the milestones: Airflow 1.10.13, Airflow 2.1 Oct 19, 2020
@ultrabug
Copy link
Contributor Author

Hi @ryw ; just checking in again to see if the moment is right. I promised to commit to this PR so here I am again.

@ryw
Copy link
Member

ryw commented Mar 13, 2021

@ryanahamilton maybe you can have a look

@ashb ashb modified the milestones: Airflow 2.1, Airflow 2.1.1 May 18, 2021
@ashb ashb added the AIP-39 Timetables label May 20, 2021
@jhtimmins
Copy link
Contributor

Let's target 2.2 with this. @ultrabug can you handle the merge conflicts?

@eladkal
Copy link
Contributor

eladkal commented Jun 20, 2021

Let's target 2.2 with this. @ultrabug can you handle the merge conflicts?

mmm I think the reason behind this PR will no longer be valid after AIP-39 Richer scheduler_interval.
The reason behind adding this column is that execution_date is confusing thus exposing the next_execution_date in the UI will make things more clear to the users. I think AIP-39 will solve the problem from the root because we will have schedule_date that is seperated from execution_date/next_execution_date (data_interval_start/data_interval_end)

In any case I suggest at least to wait after AIP-39 is completed and then revisit to see if this PR brings additional value.

@ashb
Copy link
Member

ashb commented Jun 21, 2021

Yeah, fixing this feature request (if not this PR) is on the cards for AIP-39 https://github.com/apache/airflow/projects/10#card-61511156 (quite literally)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-39 Timetables pinned Protect from Stalebot auto closing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet