Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-XXX] GSoD: Adding Task re-run documentation #6295

Merged
merged 41 commits into from Nov 27, 2019
Merged

Conversation

@KKcorps
Copy link
Contributor

KKcorps commented Oct 9, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release
@@ -83,7 +83,11 @@ Content
ui
concepts
scheduler
<<<<<<< HEAD

This comment has been minimized.

Copy link
@mik-laj

mik-laj Oct 9, 2019

Member

Somethings is broken here.

docs/scheduler.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
@KKcorps KKcorps force-pushed the KKcorps:master branch from 4f6d198 to 3fa4f52 Oct 10, 2019
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Show resolved Hide resolved
docs/dag-run.rst Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/scheduler.rst Outdated Show resolved Hide resolved
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 10, 2019

Codecov Report

Merging #6295 into master will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6295      +/-   ##
==========================================
- Coverage   80.41%   80.34%   -0.08%     
==========================================
  Files         612      616       +4     
  Lines       35473    35733     +260     
==========================================
+ Hits        28525    28709     +184     
- Misses       6948     7024      +76
Impacted Files Coverage Δ
airflow/executors/kubernetes_executor.py 58.89% <0%> (-6.34%) ⬇️
airflow/jobs/backfill_job.py 89.9% <0%> (-1.53%) ⬇️
airflow/gcp/operators/bigquery.py 86.6% <0%> (ø) ⬆️
...gle/marketing_platform/sensors/campaign_manager.py 100% <0%> (ø)
...oogle/marketing_platform/hooks/campaign_manager.py 100% <0%> (ø)
..._platform/example_dags/example_campaign_manager.py 0% <0%> (ø)
...e/marketing_platform/operators/campaign_manager.py 91.73% <0%> (ø)
airflow/gcp/operators/cloud_sql.py 88.98% <0%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c9b44d...52b0bb5. Read the comment docs.

@KKcorps KKcorps requested a review from mik-laj Oct 10, 2019
docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a
series of intervals which the scheduler turn into individual DAG Runs and execute. A key capability
of Airflow is that these DAG Runs are atomic and idempotent items. The scheduler, by default, will
kick off a DAG Run for any interval that has not been run (or has been cleared). This concept is called Catchup.

This comment has been minimized.

Copy link
@ashb

ashb Oct 11, 2019

Member

This is mostly true, but slightly misleading in a few edge cases. I'm not sure if this is worth mentioning here or not.

Catchup, and the scheduler in general will not "fill in gaps" - it will only look forward from the most recent dag run.

For example:

  • I have a daily dag with catchup=False running. This is "d0"
  • I pause that dag for 3 days
  • I then start it again.

At this point we have dagruns for d0, d4, d5, ...

  • I edit the dag to set catchup=True

The scheduler will not go and "fill in" d1, d2, d3.

This comment has been minimized.

Copy link
@KKcorps

KKcorps Oct 18, 2019

Author Contributor

Updated with changes.

default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),

This comment has been minimized.

Copy link
@ashb

ashb Oct 11, 2019

Member

I know this is what we do in most cases in our docs, but can you move start_date out of default_args and in to just a DAG arg please?

docs/dag-run.rst Outdated Show resolved Hide resolved
**Note**: When clearing a set of tasks’ state in hope of getting them to re-run, it is important
to keep in mind the DAG Run’s state too as it defines whether the scheduler should look
into triggering tasks for that run.

This comment has been minimized.

Copy link
@ashb

ashb Oct 11, 2019

Member

I think that if you clear tasks it will also set the dagrun state to running? But I'm not certain of that.

docs/dag-run.rst Outdated Show resolved Hide resolved
docs/dag-run.rst Show resolved Hide resolved
docs/dag-run.rst Outdated Show resolved Hide resolved
@ashb

This comment has been minimized.

Copy link
Member

ashb commented Oct 11, 2019

Looking good overall though!

@mik-laj mik-laj added the area:docs label Oct 15, 2019
@KKcorps KKcorps changed the title [AIRFLOW-XXX] Adding Task re-run documentation [AIRFLOW-XXX] GSoD: Adding Task re-run documentation Oct 16, 2019
@KKcorps

This comment has been minimized.

Copy link
Contributor Author

KKcorps commented Oct 18, 2019

@@ -84,6 +84,7 @@ Content
concepts
scheduler
executor/index
dag-run

This comment has been minimized.

Copy link
@ashb

ashb Oct 18, 2019

Member

I think this probably makes more sense as a page under Concepts -- it doesn't fit with the rest of the top level items we have.

WDYT?

This comment has been minimized.

Copy link
@KKcorps

KKcorps Oct 21, 2019

Author Contributor

@mik-laj and I were having a discussion on the same perspective that the concepts page needs to be broken down and some subpages need to move inside it. Since that would require changes in many pages, can we have that as part of another PR?

This comment has been minimized.

Copy link
@kaxil

kaxil Oct 25, 2019

Contributor

We can have a subpage like https://airflow.readthedocs.io/en/stable/howto/index.html and have different topics there

This comment has been minimized.

Copy link
@ashb

ashb Nov 1, 2019

Member

@KKcorps What did we decide about this? Merge this PR as is then you split up concepts in to multiple pages afterwards?

This comment has been minimized.

Copy link
@KKcorps

KKcorps Nov 1, 2019

Author Contributor

Yes, splitting up the page is a big effort. I think we should merge this and that split it up later.

@@ -32,161 +30,10 @@ Airflow production environment. To kick it off, all you need to do is
execute ``airflow scheduler``. It will use the configuration specified in
``airflow.cfg``.

Note that if you run a DAG on a ``schedule_interval`` of one day,

This comment has been minimized.

Copy link
@ashb

ashb Oct 18, 2019

Member

This was quite an important point. I feel we should direct people to the new dag run page (where ever we decide it lives) from here too.

This comment has been minimized.

Copy link
@ashb

ashb Nov 1, 2019

Member

This isn't done yet. A "see also" link is not strong enough.

Your DAG will be instantiated for each schedule along with a corresponding
DAG Run entry in backend.

**Note**: If you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01

This comment has been minimized.

Copy link
@ashb

ashb Oct 18, 2019

Member

I think this and the next paragraph sould be in a .. note:: block to make them stand out more. What does that look like?

@KKcorps KKcorps force-pushed the KKcorps:master branch from cec7044 to 240e2d9 Oct 21, 2019
@yegeniy

This comment has been minimized.

Copy link
Contributor

yegeniy commented Oct 25, 2019

This would be very helpful as a document to share with folks onboarding to Airflow, even if it's imperfect, I think it would be great to get this into the codebase.

@KKcorps KKcorps force-pushed the KKcorps:master branch from 237a166 to 2840871 Nov 21, 2019
docs/scheduler.rst Outdated Show resolved Hide resolved

To start a scheduler, simply run the command:

.. code:: bash
airflow scheduler
You can start executing a DAG once your scheduler has started running successfully.

This comment has been minimized.

Copy link
@ashb

ashb Nov 22, 2019

Member

"executing" a dag doesn't quite make sense. Either we mean enable it and let the scheduler excute it, or we mean "trigger a manual run".

This comment has been minimized.

Copy link
@KKcorps

KKcorps Nov 23, 2019

Author Contributor

I'll change it to 'your dags will start executing'

docs/dag-run.rst Outdated Show resolved Hide resolved
Re-run DAG
''''''''''
There can be cases where you will want to execute your DAG again. One such case is when the scheduled
DAG run fails. Another can be the scheduled DAG run wasn't executed due to low resources or the DAG being turned off.

This comment has been minimized.

Copy link
@ashb

ashb Nov 22, 2019

Member

Low resources shouldn't stop a dag run from being created, so lets remove that bit.

KKcorps and others added 2 commits Nov 23, 2019
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
Kartik
docs/dag-run.rst Outdated Show resolved Hide resolved

An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a
series of intervals which the scheduler turn into individual DAG Runs and execute. A key capability
of Airflow is that these DAG Runs are atomic and idempotent items. The scheduler, by default, will

This comment has been minimized.

Copy link
@ashb

ashb Nov 25, 2019

Member
Suggested change
of Airflow is that these DAG Runs are atomic and idempotent items. The scheduler, by default, will
of Airflow is that these DAG Runs should be atomic and idempotent items. The scheduler, by default, will

(should be, as Airflow doesn't do this magically, but it depends on how the tasks are written.)

This comment has been minimized.

Copy link
@KKcorps

KKcorps Nov 25, 2019

Author Contributor

Ok. then I guess I should remove the line only as it doesn't make much sense.

This comment has been minimized.

Copy link
@KKcorps

KKcorps Nov 25, 2019

Author Contributor

or I can add The runs should be atomic and idempotent for the catchup to function as expected.

Backfill
---------
There can be the case when you may want to run the dag for a specified historical period e.g. a data pipeline
which dumps data in a DFS every day and another pipeline which requires last 1 month of data in DFS.

This comment has been minimized.

Copy link
@ashb

ashb Nov 25, 2019

Member

I'm not sure this is the best example of backfill.

Backfill's primary aim is for running for date periods before the start date of the task. (Otherwise the scheduler would see the date is before start_date and not do anything)

airflow backfill -s START_DATE -e END_DATE dag_id
The above command will re-run all the instances of the dag_id for all the intervals within the start date and end date.

This comment has been minimized.

Copy link
@ashb

Click on the failed task in the Tree or Graph views and then click on **Clear**.
``
failed to ``None`` and the executor will re-run it.

This comment has been minimized.

Copy link
@ashb

ashb Nov 25, 2019

Member

Odd line break here.


.. code:: bash
airflow tasks clear -h

This comment has been minimized.

Copy link
@ashb

ashb Nov 25, 2019

Member

Link to cli docs here too.

This comment has been minimized.

Copy link
@KKcorps

KKcorps Nov 25, 2019

Author Contributor

Should I remove the example and just have the CLI doc link or put CLI docs link below the example?

docs/dag-run.rst Outdated Show resolved Hide resolved

There are multiple options you can select to re-run -

* Past - All the instances of the task in the runs before the current DAG's execution date

This comment has been minimized.

Copy link
@kaxil

kaxil Nov 25, 2019

Contributor
Suggested change
* Past - All the instances of the task in the runs before the current DAG's execution date
* **Past** - All the instances of the task in the runs before the current DAG's execution date

We should probably bold all the options

docs/dag-run.rst Outdated Show resolved Hide resolved
KKcorps and others added 4 commits Nov 25, 2019
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
Kartik
Kartik
Kartik
@ashb
ashb approved these changes Nov 27, 2019
@ashb ashb merged commit ac2d0be into apache:master Nov 27, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants
You can’t perform that action at this time.