New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-3237] Refactor example DAGs #4071

Merged
merged 1 commit into from Oct 26, 2018

Conversation

Projects
None yet
8 participants
@BasPH
Copy link
Contributor

BasPH commented Oct 20, 2018

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

AIRFLOW-3237

Description

  • Here are some details about my PR, including screenshots of any UI changes:

I think it's important to have the example DAGs reflect good programming standards and how to write clean and concise Airflow code. Therefore, refactor the example DAGs, which mostly involves:

  • Replace set_upstream() and set_downstream() by the bitshift operators, since they exist since Airflow 1.8 and everybody uses those now.
  • Optimise imports (grouping by stdlib/3rd party/own and alphabetic ordering).
  • Place function calls with arguments on single line if line is short enough, or put each function argument on a new line, but don't mix.
  • Passing list of tasks when setting dependencies instead of making multiple set_up/down-stream calls.
  • Switch upstream dependencies into downstream dependencies since these are more logical to read IMO.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Only refactored some code, all existing tests should pass.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

No new functionality.

Code Quality

  • Passes flake8
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
args = {'owner': 'airflow', 'start_date': airflow.utils.dates.days_ago(2)}

This comment has been minimized.

@kaxil

kaxil Oct 20, 2018

Contributor

Don't understand why we need this refactor?

It looked better formatted.

This comment has been minimized.

@KimchaC

KimchaC Oct 22, 2018

Also, shouldn't the start_date be a fixed date?

This comment has been minimized.

@BasPH

BasPH Oct 22, 2018

Contributor

@kaxil will revert back to each pair on separate line.
@KimchaC I didn't refactor the start_date in any of the example DAGs. I imagine dynamic start_date was used to always have the example DAGs start from a recent date instead of a fixed date resulting in a large number of DAG runs when loading the examples.

'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
args = {'owner': 'airflow', 'start_date': airflow.utils.dates.days_ago(2)}

This comment has been minimized.

@kaxil

kaxil Oct 20, 2018

Contributor

Same here

@BasPH

This comment has been minimized.

Copy link
Contributor

BasPH commented Oct 22, 2018

@KimchaC I generally see the 50/50 usage of passing dag object vs using dag context manager in Airflow code. All example DAGs pass the dag object to the operators. Is there a preference for either by the Airflow community?

@Fokko
Copy link
Contributor

Fokko left a comment

Nice work @BasPH

I have some small suggestion, would like to get your opinion.

Show resolved Hide resolved airflow/example_dags/example_branch_python_dop_operator_3.py Outdated
Show resolved Hide resolved airflow/example_dags/example_skip_dag.py Outdated
Show resolved Hide resolved airflow/example_dags/example_trigger_target_dag.py Outdated
@KimchaC

This comment has been minimized.

Copy link

KimchaC commented Oct 22, 2018

Sorry, github is wonky for me today. First it wouldn't let me post comments and then it ended up with many duplicates and when I tried to clean them up it deleted all of them.

I previously wrote:

Consider also changing the code to use the with context manager so that you don't have to repeat the dag=dag parameter on each task:

dag = DAG(
            'my_dag',
            start_date=datetime(2016, 1, 1))
with dag:
    op = DummyOperator('op')

op.dag is dag # True

To which @BasPH replied...

@KimchaC I generally see the 50/50 usage of passing dag object vs using dag context manager in Airflow code. All example DAGs pass the dag object to the operators. Is there a preference for either by the Airflow community?

I think the airflow team should decide on a preference for the community. One of the reasons that it is not used by everyone is probably because not everyone is aware of this feature. One of the reasons for that is that the examples are not using it :)

Personally I think the with statement is awesome, makes the DAG code much cleaner and reduces repetition.

I'd suggest adding a comment to the examples like...

# The with statement allows you to omit the dag parameter when initializing tasks.
with dag:
...

I also think the clode is clearer when the DAG is initiated separately and not inside the with statement.

@BasPH

This comment has been minimized.

Copy link
Contributor

BasPH commented Oct 22, 2018

I processed your changes except for @KimchaC's suggestion for using the DAG context manager, because I'm unsure what the Airflow team thinks about it?


pull.set_upstream([push1, push2])
pull << [push1, push2]

This comment has been minimized.

@kaxil

kaxil Oct 23, 2018

Contributor

Will this work on a list of tasks?

This comment has been minimized.

@BasPH

BasPH Oct 23, 2018

Contributor

Yes, this operation is implemented by __lshift__() which calls set_upstream().

This comment has been minimized.

@XD-DENG

XD-DENG Oct 23, 2018

Contributor

Personally I don't think << is better than .set_upstream. It's less explicit and introduces unnecessary new convention.

May you advise what's the advantage of of this change?

This comment has been minimized.

@Fokko

Fokko Oct 23, 2018

Contributor

So, for me the >> and << is much clearer to me. Bas raises a very valid point, that we should make a decision and make everything consistent (also the docs :-)

This comment has been minimized.

@BasPH

BasPH Oct 23, 2018

Contributor

The << and >> operators were introduced in Airflow 1.8 and I've never really heard or seen anybody prefer using the "old" set_upstream() and set_downstream(). Especially when chaining multiple tasks your code looks the same as in the GUI and I think it's very pleasant to read:

task1 >> task2 >> task3 >> task4

Since my experience is people use the >> and << operators much more, I figured it'd be useful to show that in the example DAGs, but correct me if wrong.

This comment has been minimized.

@BasPH

BasPH Oct 24, 2018

Contributor

@ron819 Setting dependencies with a tuple of tasks will also work because in _set_relatives the right-hand object is simply converted to a list (code).

This comment has been minimized.

@kaxil

kaxil Oct 24, 2018

Contributor

@XD-DENG Well, I was unsure of the task list usage with rshift and lshift operators😂 not >> or << (rshift or lshift) itself...

I think we already had good documentation over using them and I prefer using those as it resembles DAG + it is more clean. Many of our examples already do that.

But yes after @BasPH pointed o the implementation we can surely do that and should be documented.

@BasPH can you please add the docs. One liner with one or two examples is fine.

This comment has been minimized.

@kaxil

kaxil Oct 24, 2018

Contributor

Also I agree with Fokko and you that the behavior should be consistent. Lets use the bitshift composition across all tutorials and examples.

Regarding context managers, I still have a divided opinion. It makes sense to use them. But sometimes I tend to not use them just to make my code cleaner as I get an extra tab space to put my code in.

So let's leave that to users but maybe add a comment like you said earlier to show thw users that we can use with statement

This comment has been minimized.

@XD-DENG

XD-DENG Oct 24, 2018

Contributor

@kaxil I see I see. I still have so many more stuff to learn about Airflow ;-)

This comment has been minimized.

@ron819

ron819 Oct 24, 2018

Contributor

@ron819 Setting dependencies with a tuple of tasks will also work because in _set_relatives the right-hand object is simply converted to a list (code).

Maybe this should be added to the docs as well as it isn't written anywhere.


t2.set_upstream(t1)
t3.set_upstream(t1)
t1 >> [t2, t3]

This comment has been minimized.

@kaxil

kaxil Oct 23, 2018

Contributor

And this?

This comment has been minimized.

@BasPH

BasPH Oct 23, 2018

Contributor

Same here, but using __rshift__().

t3.set_upstream(t2)
t4.set_upstream(t3)
t5.set_upstream(t4)
sensor >> t1 >> t2 >> t3 >> t4 >> t5

This comment has been minimized.

@Fokko

Fokko Oct 23, 2018

Contributor

This is a clear example why I prefer sensor >> t1 >> t2 >> t3 >> t4 >> t5 over:

t1.set_upstream(sensor)
t2.set_upstream(t1)
t3.set_upstream(t2)
t4.set_upstream(t3)
t5.set_upstream(t4)
@Fokko

Fokko approved these changes Oct 24, 2018

@KimchaC

This comment has been minimized.

Copy link

KimchaC commented Oct 24, 2018

So, no love for the with operator?

Adding it to the examples would really go a long way to educate people about its existence. Perhaps one example dag can use it with comments and the others not. So that people can see that both are possible.

@feluelle

This comment has been minimized.

Copy link
Contributor

feluelle commented Oct 24, 2018

I like the with operator.

And i also would like to have mixed examples with and without the with statement.

@kaxil

This comment has been minimized.

Copy link
Contributor

kaxil commented Oct 25, 2018

Can you fix the failing flake8 test

@BasPH BasPH closed this Oct 26, 2018

@BasPH BasPH reopened this Oct 26, 2018


args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
'start_date': airflow.utils.dates.days_ago(2),

This comment has been minimized.

@feluelle

feluelle Oct 26, 2018

Contributor

Is there a reason why you put a comma at the end when it is the last argument?

This comment has been minimized.

@feluelle

feluelle Oct 26, 2018

Contributor

I noticed that you did this very often but not always so..

This comment has been minimized.

@BasPH

This comment has been minimized.

@feluelle

feluelle Oct 26, 2018

Contributor

Interesting, thanks.

@BasPH

This comment has been minimized.

Copy link
Contributor

BasPH commented Oct 26, 2018

Can you fix the failing flake8 test

The flake8 test seems a bit flakey, it was checking files I didn't touch in my PR. Then re-ran it and all was good.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 26, 2018

Codecov Report

Merging #4071 into master will increase coverage by 0.76%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4071      +/-   ##
==========================================
+ Coverage   77.95%   78.71%   +0.76%     
==========================================
  Files         199      199              
  Lines       15974    17086    +1112     
==========================================
+ Hits        12452    13450     +998     
- Misses       3522     3636     +114
Impacted Files Coverage Δ
...low/example_dags/example_trigger_controller_dag.py 60% <100%> (ø) ⬆️
...low/example_dags/example_short_circuit_operator.py 100% <100%> (ø) ⬆️
airflow/example_dags/example_skip_dag.py 95% <100%> (ø) ⬆️
airflow/example_dags/tutorial.py 100% <100%> (ø) ⬆️
airflow/example_dags/example_python_operator.py 94.73% <100%> (ø) ⬆️
airflow/example_dags/example_xcom.py 62.5% <100%> (ø) ⬆️
...ample_dags/example_branch_python_dop_operator_3.py 73.33% <100%> (-1.67%) ⬇️
airflow/example_dags/example_subdag_operator.py 100% <100%> (ø) ⬆️
airflow/example_dags/example_trigger_target_dag.py 91.66% <100%> (ø) ⬆️
...le_dags/example_passing_params_via_test_command.py 100% <100%> (ø) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 334bc1e...f81feba. Read the comment docs.

@kaxil

kaxil approved these changes Oct 26, 2018

@kaxil kaxil merged commit 62b21d7 into apache:master Oct 26, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kaxil

This comment has been minimized.

Copy link
Contributor

kaxil commented Oct 26, 2018

Thanks @BasPH

kaxil added a commit that referenced this pull request Oct 26, 2018

kaxil added a commit that referenced this pull request Oct 26, 2018

@KimchaC

This comment has been minimized.

Copy link

KimchaC commented Oct 29, 2018

One more thing to consider... I noticed the start date is set in the default args.

Isn’t that legacy? Since the start_date has no effect on the tasks, shouldn’t it be set on the DAG level?

galak75 added a commit to VilledeMontreal/incubator-airflow that referenced this pull request Nov 23, 2018

aliceabe pushed a commit to aliceabe/incubator-airflow that referenced this pull request Jan 3, 2019

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Jan 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment