Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce @task.bash TaskFlow decorator #30176

Merged
merged 2 commits into from Dec 19, 2023
Merged

Conversation

josh-fell
Copy link
Contributor

@josh-fell josh-fell commented Mar 18, 2023

Adding a @task.bash TaskFlow decorator to the collection of existing core TaskFlow decorators. This particular decorator will use the return value of the decorated callable as the Bash command to execute using the existing BashOperator functionality.

TODO:

  • Add docs
  • Add examples

@josh-fell
Copy link
Contributor Author

josh-fell commented Mar 18, 2023

A casual example of how this could work can be taken from this tutorial on using the BashOperator.

Current:

get_ISS_coordinates = BashOperator(
    task_id="get_ISS_coordinates",
    bash_command="node $AIRFLOW_HOME/include/my_java_script.js"
)

print_ISS_coordinates = BashOperator(
    task_id="print_ISS_coordinates",
    bash_command="Rscript $AIRFLOW_HOME/include/my_R_script.R $ISS_COORDINATES",
    env={
        "ISS_COORDINATES": "{{ task_instance.xcom_pull(task_ids='get_ISS_coordinates') }}"
    },
    append_env=True
)

get_ISS_coordinates >> print_ISS_coordinates

Using @task.bash:

@task.bash(append_env=True)
def get_ISS_coordinates():
    return "node $AIRFLOW_HOME/include/my_java_script.js"

@task.bash(append_env=True, push_op_kwargs_to_env=True)
def print_ISS_coordinates(coords):
    return f"Rscript $AIRFLOW_HOME/include/my_R_script.R $coords"

print_ISS_coordinates(coords=get_ISS_coordinates())

@uranusjr
Copy link
Member

I wonder that, instead of a decorator, if it would be better to have

task.bash(
    """... bash command ...""",
    **kwargs,
)

@potiuk
Copy link
Member

potiuk commented Mar 22, 2023

I agree with @uranusjr (I think at least why I think the comment was raised). it sounds strange to me to have a Python code to return bash commmand to execute as string. I see where it might be useful, but it has certain properties, that make it difficult to debug and reason about. While other decorators focus on the python to execue and simply change the context where they are executed (docker/k8s) - having bash decorator with returning a string that should be executed as bash command is kinda different beast altogether. I am not sure if it will be intuitive enough :)

@fritz-astronomer
Copy link
Contributor

fritz-astronomer commented Apr 21, 2023

Although @potiuk - I would say that although this does feel a little weird it does effectively match the pattern?

@task.bash
def fn(...):
   return "echo hi"

You are running echo hi, in bash. Handing off to the other system, in effect. Yes not the whole function, but 🤷

@uranusjr 's does look clean, but it also doesn't feel like it matches other existing patterns that we have (except maybe some astro-sdk stuff that has veered in that direction - e.g. load_file( ...)

@potiuk
Copy link
Member

potiuk commented Apr 27, 2023

Actually, yeah. why not. I think it's quite nice actuallly. Especially if you can make multi-line bash with """ that does fill like useful thing.

@sethwoodworth
Copy link
Contributor

I'm a big fan of the approach, and have been using a half-baked version of a @bash decorator. Are you setting the default of append_env to False to match the behavior of the BashOperator? In my use-cases, I could see wanting them always available.

@github-actions
Copy link

github-actions bot commented Jul 9, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 9, 2023
@josh-fell josh-fell removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 9, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 24, 2023
@josh-fell josh-fell removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 25, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 10, 2023
@josh-fell josh-fell removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 10, 2023
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 25, 2023
@josh-fell
Copy link
Contributor Author

Reviving this after finally getting out of a large hole.

@fritz-astronomer Yes, this approach is very much inspired by the Astro SDK.

@potiuk I can confirm this works for multi-line strings.

@sethwoodworth Initially I was thinking about keeping the existing default for append_env. However, I think I will ultimately do away with the push_op_kwargs_to_env parameter. It feels a bit quirky and can lead to less explicit scripting. For an "introduction" of this decorator, it seems a bit more pragmatic to not gum-up this PR but rather discuss that particular feature in a future PR. WDYT?

@josh-fell josh-fell removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 28, 2023
@potiuk
Copy link
Member

potiuk commented Nov 29, 2023

@potiuk I can confirm this works for multi-line strings.

My comment was rather about "why do we need method at all?" (following @uranusjr comment.

For me the proposed syntax:

task.bash(
    """
node $AIRFLOW_HOME/include/my_java_script.js"
""",
    **kwargs,
)

felt a bit better than:

@task.bash(append_env=True)
def get_ISS_coordinates():
    return """
node $AIRFLOW_HOME/include/my_java_script.js
"""

However ... I reconsidered....

I actually found a very good reason why the second approach is better.

It allows for much more dynamic construction of the bash command - with all the logic executed during task execution rather than during DAG parsing. The first one allows f-strings (rsolved during parsing) and JINJA (resolved during execution) - and pretty much nothing else.

When you return string from Python funtionm you kind of combine both Python code wiht logic to build command nd Bash operator to actually do it - all during task execution. This is really powerful - especially in case of mapped operators for example - much more powerful than simply passing string.

For example this one would be much more difficult to express nicely in case 1):

@task.bash(append_env=True)
def get_ISS_coordinates():
    base = "mypy "
    for file in get_all_possible_files_to_check():
        base += file if not file.startswith("test_")
    return base

(and that's only an example).

So - summarizing: I am all in for the proposal @josh-fell

@josh-fell
Copy link
Contributor Author

@potiuk Just about summed everything I was thinking! I completely agree being able to enrich Bash with Python is strong. Also, it's a more straightforward approach to parametrize the command string itself rather than the current approach of using env (which IMO is really for static attributes).

@josh-fell
Copy link
Contributor Author

josh-fell commented Nov 30, 2023

One question I do have is should the bash_command attribute (aka the return value from the callable) be templated? Clearly BashOperator has this now, but with TaskFlow historically the args passed in to the callable can be templated. So should the behavior for this particular decorator stay the same? I've been wavering a little so does anyone have any thoughts?

For example, with the BashOperator you can do:

BashOperator(task_id="bash", bash_command="echo 'ti={{ task_instance_key_str }}'")

But should this be allowed with the decorator?

@task.bash
def do_bash_things():
    return "echo 'ti={{ task_instance_key_str }}'"

Or not, and users continue with the current TaskFlow approach of:

@task.bash
def do_bash_things(task_instance_key_str=None):
    return f"echo 'ti={task_instance_key_str}'"

Allowing a templated string to be returned is consistent with classic BashOperator authoring, but disallowing it seems to align with TaskFlow authoring generally. I'm leaning towards handling a templated string to be consistent with the BashOperator.

@potiuk
Copy link
Member

potiuk commented Nov 30, 2023

Disabke it by default add ad parameter is_templated?

WHY?

I think the reason why we want it templated is to allow people to convert old operators easily but I think with python return value - f-string is more natural way and we should encourage that rather than Jinja

@josh-fell
Copy link
Contributor Author

Disabke it by default add ad parameter is_templated?

WHY?

I think the reason why we want it templated is to allow people to convert old operators easily but I think with python return value - f-string is more natural way and we should encourage that rather than Jinja

Add a parameter might add some confusion. The current implementation will be able to read a templated Bash script out of the box and there could be Jinja expressions in its contents. But would users be confused about needing to add is_templated_cmd parameter when using a script? Right now, no, but they would if they returned a string from the function that was templated. Maybe we just handle it and the authoring experience is less specialized and consistent between these use cases.

DAG authors can write the command/script they way they need it to run. It's possible DAG authors are implementing tasks from other folks who aren't involved in the Airflow implementation (e.g. a security engineer wanting to orchestrate a series of commands that a data engineer (the DAG author in this scenario) enriches to make idempotent and better fit into an Airflow context).

@josh-fell
Copy link
Contributor Author

josh-fell commented Dec 7, 2023

We can absolutely add a note to the docs that using the f-string approach is encouraged. I completely agree it's more "Pythonic" and fits better in the TaskFlow paradigm.

@potiuk
Copy link
Member

potiuk commented Dec 7, 2023

We can absolutely add a note to the docs that using the f-string approach is encouraged. I completely agree it's more "Pythonic" and fits better in the TaskFlow paradigm.

Fine with that :)

@uranusjr
Copy link
Member

uranusjr commented Dec 7, 2023

I would argue if you need logic it is better to create a shell script file and use templating to pass in the values instead. Trying to generate a shell command with Python code—or more generally, ad-hoc code generation by hand—is a very good way to shoot yourself in the foot.

@potiuk
Copy link
Member

potiuk commented Dec 7, 2023

I would argue if you need logic it is better to create a shell script file and use templating to pass in the values instead. Trying to generate a shell command with Python code—or more generally, ad-hoc code generation by hand—is a very good way to shoot yourself in the foot.

I do not think about much logic - I think it's good to give our users a versaitle set of tools. I find ot really easy to have some patterns usable. Simple if x return "this script" else return "other script" is much better if the logic is done in Python for example.

@josh-fell josh-fell force-pushed the bash-deco branch 2 times, most recently from c6d3cf7 to 3d1367e Compare December 15, 2023 20:05
@josh-fell josh-fell marked this pull request as ready for review December 15, 2023 20:05
@josh-fell josh-fell force-pushed the bash-deco branch 2 times, most recently from 7cd3612 to cb9ef21 Compare December 16, 2023 15:06
Adding a @task.bash TaskFlow decorator to the collection of existing core TaskFlow decorators. This particular decorator will use the return value of the decorated callable as the Bash command to execute using the existing BashOperator functionality.
@josh-fell josh-fell merged commit e920344 into apache:main Dec 19, 2023
52 checks passed
@josh-fell josh-fell deleted the bash-deco branch December 19, 2023 21:51
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.0 milestone Jan 10, 2024
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants