Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIP-31] Create XComArg model #8052

Closed
casassg opened this issue Apr 2, 2020 · 15 comments · Fixed by #8652
Closed

[AIP-31] Create XComArg model #8052

casassg opened this issue Apr 2, 2020 · 15 comments · Fixed by #8652
Assignees
Labels
AIP-31 Task Flow API for nicer DAG definition kind:feature Feature Requests

Comments

@casassg
Copy link
Contributor

casassg commented Apr 2, 2020

Description

XComArg is a class that references an XCom entry from an operator. Can be used to explicitly set XCom dependencies and explictly pass results between operators.

Class attributes:

  • operator: BaseOperator: Origin operator instance.
  • key: str: Stores key for XCom value. Defaults to airflow.models.xcom. XCOM_RETURN_KEY

Class methods

  • get(context: dict)-> Any: Resolves XCom value given operator and key.
  • __getitem__(key, str) -> XComArg: Easy method to create new XComArg using the same operator but a different key.
  • add_downstream_dependent(operator: BaseOperator)->None: Add an operator as a downstream dependency of operator. [Optional] Makes this task simpler.

Proposed implementation: https://github.com/casassg/corrent/blob/master/corrent/xcom_arg.py

Use case / motivation

  • Explicit way to pass around results between operators.
  • This object is a reference to an XCom value that has not been created and will need to be resolved in the future.
  • It can generate new XComArgs that point to the same operator but has a different key by using the key accessor. E.g: output['train'].
  • To do so, it includes as class attributes both the originating operator instance and a key (both needed to pull the XCom).
  • Has a property op that allows access to the origin operator.

Stretch goal

  • Store type class for the XCom and lie to MyPy when trying to check class signature.
@casassg casassg added the kind:feature Feature Requests label Apr 2, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 2, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@turbaszek
Copy link
Member

Hey @casassg any update on this issue? :)

@casassg
Copy link
Contributor Author

casassg commented Apr 18, 2020

I've been a bit overwhelmed with the work from home situation. Will try to get to this on the weekend or next week. Otherwise, please feel free to take it from me 😄

@jonathanshir
Copy link
Contributor

Hey guys, the databand team and I decided to implement [AIP-31] starting tomorrow.
Will be in touch with you soon for implementation discussions.
Thanks, Johnny :)

@casassg
Copy link
Contributor Author

casassg commented Apr 22, 2020

Sounds great to me! I can help review and give directions. Let's sync some time! I also do plan to contribute some part

@turbaszek turbaszek assigned jonathanshir and unassigned casassg Apr 24, 2020
@evgenyshulman
Copy link
Contributor

now the main purpose of this XComArgs is about being able to inject “dependency” on another operator + route to fetch that dependency from XCom. I assume generic class is required around jinja templates baking + defining dependencies in the Arg. Let’s create class ArgBase that will have these 3 interfaces defined :

  1. it can be “resolved”, it can be represented as Jinja template
  2. it can provide list of dependencies ( list of upstream operators)
  3. Jinja string that represents it

@turbaszek
Copy link
Member

Should "normal" operators be also XComArgs?

Once we have functional operators, should I as a user expect this DAG to work?

with DAG("example"):
    op1 = CreateClusterOperator("op", ...)
    op2 = RunJobOperator("op2", cluster_name=op1)

# Or with custom XCom key

with DAG("example"):
    op1 = CreateClusterOperator("op", ...)
    op2 = RunJobOperator("op2", cluster_name=op1["cluster_id"])

If no, why not?

@casassg
Copy link
Contributor Author

casassg commented Apr 25, 2020

it can be represented as Jinja template

I do not agree on this. Why do we need to stick to Jinja templates? Seems we are trying to fit the new solution to the existing workaround, when we should think of this as a more generic solution. I dont see why we need a ArgBase class. Could you illustrate with an example?

Should "normal" operators be also XComArgs?

No let's use #8055 instead:

with DAG("example"):
    op1 = CreateClusterOperator("op", ...)
    op2 = RunJobOperator("op2", cluster_name=op1.output)

# Or with custom XCom key

with DAG("example"):
    op1 = CreateClusterOperator("op", ...)
    op2 = RunJobOperator("op2", cluster_name=op1.output["cluster_id"])

@evgenyshulman
Copy link
Contributor

An example would be cases when people build command lines, text messages, SQL queries, and so on based on the previous operator's results
for example:
1. BashOperator with bash_command ( that is expected to be string)
BashOperator( "do_something --cluster-id %s" %..)
2. HiveOperator( hql = "... %s" % some_op_result["table_id")

I personally don't like using Jinja templating, but till now, this was the only option for Airflow. I would implement it for "backward compatibility," not only on the technical side but also on the philosophical side. Also, it worth mentioning, I would not do that as separate features, but everything else (Jinja, Jinja resolving, templated_params is already part of airflow, so we just need to implement "to string" representation for this feature.

@evgenyshulman
Copy link
Contributor

The reasoning for base class is only because there is a proposal about having different implementation for the class. So it worth defining what is expected fro the new implementation in the base class. (for example what if some implementation doesn't use XCom at all, why should it inherit from XcomArg class? These are non-critical question, I am ok with no base class at all.

@turbaszek
Copy link
Member

An example would be cases when people build command lines, text messages, SQL queries, and so on based on the previous operator's results
for example:

  1. BashOperator with bash_command ( that is expected to be string)
    BashOperator( "do_something --cluster-id %s" %..)
  2. HiveOperator( hql = "... %s" % some_op_result["table_id")

Imho here we should be opinionated and "purely functional". Thus I would suggest using factory function (operator) like prepare_bash_cmd or build_hql which result (XComArg) will be then passed as an input.

@evgenyshulman
Copy link
Contributor

it's definitely an option for people who wants a "purely function" approach from the beginning. Most of Airflow users are already very invested in the current implementation without functional operators.

  1. They will have to rewrite their current implementation ( extract into function with @task).
  2. In case it's part of internal operator implementation it's even harder to do
  3. It will affect an execution graph (extra node), while some Airflow users are trying to keep graph very clean.

IMHO: ( as I wrote at current PR), we should decide before we push implementation into Airflow/master and we should base our decision on some examples we wrote with end to end usage of AIP-31 implementation. I see a point in keeping everything as simple as possible, at the same time, I want all users to be happy if it doesn't affect implementation in a bad way :)

@casassg
Copy link
Contributor Author

casassg commented Apr 26, 2020

Regarding templating, I understand why you say that now. An option we implemented internally to do this was to add a map function to XComArg which is lazy evaluated and that enables transforming the value pulled before passing it to the operator.

My main worry is enabling the Jinja string is that it may cause some non-expected behaviour. For example, what happens when the value is not a string? I rather not enable it mostly because it still can be done using context['ti'].xcom_pull if users needed, while not causing errors for users that are just starting to use Airflow.

Regarding BaseArg class. I would say no as we are moving the extensibility piece to XCom itself. That should allow to inject custom serialization/deserialization behaviour consistently across Airflow and not only XComArg. Can't think of any other reason to overwrite XComArg apart from custom serial/deserial logic.

@evgenyshulman
Copy link
Contributor

Regarding templating, I understand why you say that now. An option we implemented internally to do this was to add a map function to XComArg which is lazy evaluated and that enables transforming the value pulled before passing it to the operator.

My main worry is enabling the Jinja string is that it may cause some non-expected behavior. For example, what happens when the value is not a string? I rather not enable it mostly because it still can be done using context['ti'].xcom_pull if users needed, while not causing errors for users that are just starting to use Airflow.

I agree that Jinja + XCom was always problematic, however, most of the time it does work. if object is nonstr -> Jinja will convert it into str as it was doing for any other object coming directly from xcom_pull. We can remove str implementation if we have any problems around it before we merge into apache-airflow

Regarding BaseArg class. I would say no as we are moving the extensibility piece to XCom itself. That should allow injecting custom serialization/deserialization behavior consistently across Airflow and not only XComArg. Can't think of any other reason to overwrite XComArg apart from custom serial/deserial logic.

My working assumption is that replacing XCom only is going to be not enough to implement "real" replacement for serialization/deserialization, but.. if I am wrong, we can easily implement that base class. So let's just wait till somebody will do that for real.

@casassg
Copy link
Contributor Author

casassg commented Apr 26, 2020

Sounds good. Regarding Jinja, maybe a way is to make it xcom_arg.jinja property instead of making it under str. But it's mostly a nitpick, so fine by me.

If we keep XComArg as a simple XCom wrapper, that should work. Want to make sure we keep XComArg as low complex as possible, and if we need to add some extra XCom logic add it to XCom class itself instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-31 Task Flow API for nicer DAG definition kind:feature Feature Requests
Projects
None yet
4 participants