Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-3958] List tasks upstream work in chain #4779

Merged
merged 1 commit into from Jun 21, 2019

Conversation

zhongjiajie
Copy link
Member

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

For now on, airflow.utils.helpers.chain only support list as downstream, but not upstream. This ticket is to support list as upstream.

Only support

task_1 --> task_2 --> task_3 --> task_4

or
                  /--> task_3
task_1 --> task_2 
                  \--> task_4

This improvement to be support

                  /--> task_3 \
task_1 --> task_2              --> task_5
                  \--> task_4 /

Tests

  • My PR adds the following unit tests :
  • airflow/tests/utils/helpers/test_chain

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

Code Quality

  • Passes flake8

@zhongjiajie
Copy link
Member Author

@ashb @Fokko @feng-tao PLAT, Thanks.

@feluelle
Copy link
Member

feluelle commented Feb 26, 2019

t1 >> t2 >> [t3, t4] >> t5 works already. I mean what is chain() actually for?

@feluelle
Copy link
Member

feluelle commented Feb 26, 2019

I only found https://github.com/apache/airflow/blob/master/airflow/example_dags/example_short_circuit_operator.py#L47 using it of the entire project.

Therefore I would suggest we remove the chain() function completely and changing the example to use the shift >> operator.

@zhongjiajie
Copy link
Member Author

zhongjiajie commented Feb 27, 2019

@feluelle I don't think we should remove chain function. And we should and this featrue to master branch.

I think function in helper mean help you, and make it easy. Image some situation, we should make couple of similar task in ONLY upstream and downstream dependent. We could use code like

t1 = DummyOperator(task_id='t1', dag=dag)
t2 = DummyOperator(task_id='t2', dag=dag)
...
tn = DummyOperator(task_id='tn', dag=dag)
t1 >> t2 >> ... >> tn

or we just using function chain

list_for_task = [DummyOperator(task_id='t{}'.format(i), dag=dag) for i in range(1, n + 1)]
chain(*list_for_task)

I prefer the second way rather than the first one.

One classic situation in my daily use is in data house, I want to transfer couple of Hive table and LOAD to different output system. I do like

prepare_dw_table_for_ana_system = [
    'prepare_step_1',
    'prepare_step_2',
    'prepare_step_3',
    'prepare_step_4',
]
tasks_prepare_dw_table_for_ana_system = [
    HiveOperator(
        task='prepare_dw_table_for_ana_system_{}'.format(step),
        sql='{}.sql'.format(step),
        dag=dag
    ) for step in prepare_dw_table_for_ana_system
]

diff_system_adapter = [
    'system_1',
    'system_2',
    'system_3',
    'system_4',
]
tasks_diff_system_adapter = [
    HiveOperator(
        task='diff_system_adapter_{}'.format(system),
        sql='{}.sql'.format(system),
        dag=dag
    ) for system in diff_system_adapter
]

post_step = [DummyOperator(task_id='post_step_{}'.format(i), dag=dag) for i in range(3)]

chain(*tasks_prepare_dw_table_for_ana_system, tasks_diff_system_adapter, *post_step)

I think make task in list could make the job more significative, because sometime we use muliti step to do one thing, and I want I refactor DAG just have to know what group tasks tasks_prepare_dw_table_for_ana_system mean, rather than each single task like prepare_dw_table_for_ana_system_xxx mean

So, I don't think we should remove function chain. And we should let chain(t1, t2, [t3, t4, t5], t6) work in chain.

@zhongjiajie
Copy link
Member Author

cc @XD-DENG @ashb

Copy link
Member

@mik-laj mik-laj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful. This method is not available in the API reference documentation. We need to think about how to add it to this in future. Now the group of users of this function is very narrow.

Reference:
https://airflow.readthedocs.io/en/latest/code.html

@zhongjiajie
Copy link
Member Author

@mik-laj I have a plan to add set-dependencies.rst in docs/howto section after this PR be merge. I think we could ref from https://airflow.readthedocs.io/en/latest/tutorial.html#setting-up-dependencies to set-dependencies.rst and write some example let user know dependencies skill in daily job.

what do you think?

@mik-laj
Copy link
Member

mik-laj commented Mar 11, 2019

I think that now the documentation has one problem that should be solved first. The concepts.rst file is too large. We must think how to divide it and what new files to create. For example, the .airflowignore section is a rarely used feature and should not be included in this file. This file should contain the main concepts.

Creating new files in the howto directory is the easiest way, because it is not required for the file to be one with the rest of the documentation. If you do not have a lot of time, you can create a new guide. Otherwise, you should think about creating a new file or files in main directory, which will contain other elements from the concept file.

@feng-tao
Copy link
Member

Agree with @feluelle that t1 >> t2 >> [t3, t4] >> t5 is supported and should be the suggested or prefer way to set dependency.

@zhongjiajie
Copy link
Member Author

@mik-laj In this situation, I think it better to create new file in docs/howto first, due to we don't have clearly idea on how to divide exists docs

@zhongjiajie
Copy link
Member Author

@feng-tao But chain is helpful in some situation. do you think so.

@feng-tao
Copy link
Member

@zhongjiajie , I personally prefer one suggested approach instead of multiple different approaches. I am open to other committers' opinions. But IMO, I would prefer we stick with the t1 >> t2 >> [t3, t4] >> t5 approach and remove the chain method which was created 3 year ago and never get updated.

@zhongjiajie
Copy link
Member Author

@feng-tao I got your point. thanks

@mik-laj
Copy link
Member

mik-laj commented Mar 14, 2019

I think that there is no one solution to the problem. The same as the existing several arithmetic operations, when you only need to add. There are several arithmetic operations when you only need to add. Subtraction is the negation of addition. Multiplication is a shortened of the addition operation. Each operation has its purpose. Each arithmetic operation has its purpose and solves a real problem. In my opinion, we should not limit ourselves to just one solution. We should show a solution for a given category of problems.

In my opinion, this change is correct if documentation is added. It should explain the differences between one and the other way and show examples for each method.

Airflow is growing. Adding(bitwise operator) is not a solution for all problems.

@zhongjiajie
Copy link
Member Author

@ashb @XD-DENG @kaxil @Fokko I would like to hear yours opinion if you have time to take a look.

@Fokko
Copy link
Contributor

Fokko commented Mar 17, 2019

@zhongjiajie I'm on the same page as @feng-tao

I think having two ways of doing the same thing is bad. I find the >> more visual and easier to explain. It is only used in the short circuit example, so removing it is quite straightforward: https://github.com/apache/airflow/search?q=chain&unscoped_q=chain

@XD-DENG
Copy link
Member

XD-DENG commented Mar 18, 2019

Hi @zhongjiajie , I agree with @feng-tao and @Fokko .

Having multiple ways is a potential confusion. We are encouraging the usage of >>.

@zhongjiajie
Copy link
Member Author

@feng-tao @Fokko @XD-DENG I got this, I will close this and submit a new PR to remove airflow.utils.helpers.chain, thank for your advice.

@zhongjiajie zhongjiajie deleted the support_list_upstream_in_chain branch March 19, 2019 01:08
@zhongjiajie zhongjiajie restored the support_list_upstream_in_chain branch April 8, 2019 00:43
@zhongjiajie zhongjiajie reopened this Apr 8, 2019
@zhongjiajie
Copy link
Member Author

@codecov-io
Copy link

codecov-io commented Apr 8, 2019

Codecov Report

Merging #4779 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4779      +/-   ##
==========================================
+ Coverage   78.93%   78.94%   +<.01%     
==========================================
  Files         480      480              
  Lines       30129    30141      +12     
==========================================
+ Hits        23782    23794      +12     
  Misses       6347     6347
Impacted Files Coverage Δ
airflow/utils/helpers.py 83.66% <100%> (+1.39%) ⬆️
airflow/models/taskinstance.py 92.43% <0%> (-0.18%) ⬇️
airflow/models/baseoperator.py 94.24% <0%> (+0.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8a4d32...37c27da. Read the comment docs.

@zhongjiajie
Copy link
Member Author

I will add some feature we discuss in Slack.

@zhongjiajie zhongjiajie changed the title [AIRFLOW-3958] Support list tasks as upstream in chain [WIP][AIRFLOW-3958] Support list tasks as upstream in chain Apr 8, 2019
@bcb
Copy link
Contributor

bcb commented Apr 8, 2019

Looks good @zhongjiajie however I wouldn't bother validating the input, it's not Pythonic and adds too much complexity.

@zhongjiajie
Copy link
Member Author

@bcb Thanks for review, I think some validate is nessarary, becase zip will use the shorter sequence as it base

>>> a = [1,2,3]
>>> b = [4,5]
>>> list(zip(a,b))
[(1, 4), (2, 5)]  # make user get the wrong dependence.

for a, b in zip(task, down_task):
a.set_downstream(b)
else:
raise AirflowException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.python.org/2/library/itertools.html#itertools.product is what you want I think?

# product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, that the cross mean, we already have cross_downstream in airflow.utils.helper.I am inspired in https://apache-airflow.slack.com/archives/CCPRP7943/p1554633832061000?thread_ts=1554374078.016800&cid=CCPRP7943 , thinking we should have a function to make parallel pipelines work

     /  -> t1 -> t3
t0
     \  -> t2 -> t4

If we do that, chain could do chain(t0, [t1, t2, t3], [t4, t5, t6]) like

     /  -> t1 -> t4
t0      -> t2 -> t5
     \  -> t3 -> t6

and could cross_downstream([t1, t2, t3], [t4, t5, t6])

t1   \     /  -> t4
        \ /
t2    -> X    -> t5
        / \
t3   /      \ -> t6

make our dependent more possibilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @zhongjiajie , it wouldn't make sense to do that.

However I still think you should just zip and join the lists, without checking they're the same size. Some tasks in the bigger list won't be joined to anything - that's fine, the user will realise their mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, product is totally different behaviour.

I'm not so sure that it would be obvious - the "extra" tasks from the long list would instead end up with no deps set. That sounds like it would be more confusing than the behaviour in the PR: an error would show up in the UI.

But I am not a user of this fn so I can't say for certain.

@zhongjiajie
Copy link
Member Author

ci green , please take a look if you have time @potiuk

@potiuk
Copy link
Member

potiuk commented May 16, 2019

I still think adding docs in the same PR is necessary. I am in favour of adding documentation together with the relevant code that is modified.

Especially in this case, this is a new way of building the DAG relationships which is the "core" of Airflow really and everyone should know how to use it. Having some PR / slack discussion describing such feature is really bad way of documenting it. Especially that this discussion was long and had many twists so it is unclear what the final intention was without reading the whole thread.

Also several other people had their strong opinions here so maybe it's good if everyone who participated will state their opinion on the current code. I don't think we all have to fully agree but maybe just one last round of comments would be great, then documentation should be added to reflect it and then we merge it. @zhongjiajie @feng-tao @Fokko @feluelle @mik-laj @XD-DENG ?

Copy link
Member

@feluelle feluelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and tests LGTM 👍

+1 on adding docs to this PR. Docs should come always along with a new feature.
@zhongjiajie you want that the feature gets attention - docs would help on that :)

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some docs on it

@zhongjiajie
Copy link
Member Author

@potiuk @kaxil @feluelle Thanks for reviewing, I will add docs this week.

helpers.chain only support list as downstream
This PR make list as upstream work, also make
list parallel work, which like below

     / -> t2 -> t4 \
t1 ->               -> t6
     \ -> t3 -> t5 /
@zhongjiajie zhongjiajie force-pushed the support_list_upstream_in_chain branch from 91d1469 to 37c27da Compare May 28, 2019 13:12
@zhongjiajie
Copy link
Member Author

@potiuk @kaxil @feluelle Add some docs on PR, PTAL

Copy link
Member Author

@zhongjiajie zhongjiajie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why I change old docs

@@ -185,6 +185,9 @@ Bitshift Composition

*Added in Airflow 1.8*

We recommend you setting operator relationships with bitshift operators rather than ``set_upstream()``
and ``set_downstream()``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to let users know that we recommend set relationships using bitshift operators

@@ -248,21 +251,92 @@ Bitshift can also be used with lists. For example:

.. code:: python

op1 >> [op2, op3]
op1 >> [op2, op3] >> op4
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let users know >> could work on list[operatos] as upstream

@feluelle
Copy link
Member

feluelle commented Jun 3, 2019

LGTM 👍

@zhongjiajie
Copy link
Member Author

ping @potiuk @ashb again.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looks like quite reasonable way of building more complex chains and it is quite well documented now. @ashb @feluelle @mik-laj @kaxil - what do you think - before we merge it - are you ok with it as well?

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -150,19 +151,50 @@ def as_flattened_list(iterable):


def chain(*tasks):
"""
r"""
Copy link
Member

@kaxil kaxil Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this r for? Is it a typo or a way to render something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and treat backslashes as literal characters

This makes the \ in the ascii diagram be a backslash. I imagine otherwise it would complain about \ being an invalid escape.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍 Thanks @ashb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ash for the clarification

@kaxil kaxil merged commit 75bec88 into apache:master Jun 21, 2019
@zhongjiajie zhongjiajie deleted the support_list_upstream_in_chain branch June 22, 2019 01:40
@zhongjiajie
Copy link
Member Author

Thanks @kaxil for merging and everyone review and suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet