Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide GCP credentials in Bash/Python operators #8432

Closed
wants to merge 1 commit into from

Conversation

mik-laj
Copy link
Member

@mik-laj mik-laj commented Apr 18, 2020

I wanted to provide the ability to use GCP credentials in the BashOperator and PythonOperator. Unfortunately, some users must use these operators in their workflows.

I care about two things the most.

  • the gcloud commands in the bash operator
  • the gcp credentials in PythonVirtualenvOperator, when system_site_packages=False.

Make sure to mark the boxes below before creating PR: [x]


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg boring-cyborg bot added area:docs area:webserver Webserver related Issues labels Apr 18, 2020
@codecov-io
Copy link

codecov-io commented Apr 18, 2020

Codecov Report

Merging #8432 into master will decrease coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #8432      +/-   ##
=========================================
- Coverage    6.23%   6.22%   -0.01%     
=========================================
  Files         946     950       +4     
  Lines       45661   45723      +62     
=========================================
  Hits         2846    2846              
- Misses      42815   42877      +62     
Impacted Files Coverage Δ
...rflow/example_dags/example_google_bash_operator.py 0.00% <0.00%> (ø)
...dags/example_google_bash_operator_custom_script.py 0.00% <0.00%> (ø)
...low/example_dags/example_google_python_operator.py 0.00% <0.00%> (ø)
airflow/operators/bash.py 0.00% <0.00%> (ø)
airflow/operators/python.py 0.00% <0.00%> (ø)
airflow/utils/documentation.py 0.00% <0.00%> (ø)
airflow/www/app.py 0.00% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96df427...235382f. Read the comment docs.

@potiuk
Copy link
Member

potiuk commented Apr 18, 2020

I think it should be a gcp_bash_operator.py deriving from Bash operator and it should be in providers/google.

@mik-laj
Copy link
Member Author

mik-laj commented Apr 18, 2020

@potiuk This contradicts the whole idea and the need for this operator. BashOperator and PythonOperaator are very useful because it is universal. Bash and Python are also built by compositions. New applications are installed on the system and can be used by any tool. If we inherit and make customization GCP-specific, we will limit its functionality. It will no longer be a universal operator. You will only be able to use it with one provider. I think this is a similar problem to Connection.get_hook.
https://github.com/apache/airflow/blob/master/airflow/models/connection.py#L301
This method is useful because it is universal and can be used regardless of the provider.
In the future, if we need to separate the core and providers, we can extend this class with a plugin. A plugin that will add new parameters to the class. Like the get_hook method, it should use the plugin mechanism.

I hope that in the future new parameters will be added for other cloud providers, e.g. AWS.

    cross_platform_task = BashOperator(
        task_id='gcloud',
        bash_command=(
            'gsutil cp gs//bucket/a.txt a.txt && aws s3 cp test.txt s3://mybucket/test2.txt'
        ),
        gcp_conn_id=GCP_PROJECT_ID,
        aws_conn_id=AWS_PROJECT_ID,
    )

Then it will still be a universal operator and we will not build a vendor-lock for one providers.

From an architectural point of view. Here the use of inheritance will be bad, but we should composition. Inheritance will limit these operators too much.
I invite you to read the article.https://en.wikipedia.org/wiki/Composition_over_inheritance

I will only cite one fragment.

Note that multiple inheritance is dangerous if not implemented carefully, as it can lead to the diamond problem. One solution to avoid this is to create classes such as VisibleAndSolid, VisibleAndMovable, VisibleAndSolidAndMovable, etc. for every needed combination, though this leads to a large amount of repetitive code.

If we replace some words, we have our problem.

Note that multiple inheritance is dangerous if not implemented carefully, as it can lead to the diamond problem. One solution to avoid this is to create classes such as GoogleAndAws, GoogleAndAzure, AwsAndAzureAndGoogle, etc. for every needed combination, though this leads to a large amount of repetitive code.

@mik-laj
Copy link
Member Author

mik-laj commented Apr 18, 2020

As I started thinking about it for a long time, we can create a get_subprocess_context_manager method in hook and also use the get_hook method here. I'm afraid it might be overenginnering. However, if you agree with me that we should use the composition, I can try to do it.

@potiuk
Copy link
Member

potiuk commented Apr 18, 2020

I disagree. It's not a matter about inheritance vs. composition. I think we can find a good solution for that - but It's the matter of bringing GCP specific code to shared Bash Operator. Not everyone uses GCP and having code that is GCP-specific in general-purpose operator hurts my eyes.

BashOperator should know nothing about GCP. Maybe indeed this should be done via plugins or similar solution.

I wonder what others thing about it @turbaszek @kaxil @ashb ? Should BashOperator contain GCP-specific code? WDYT?

@ashb
Copy link
Member

ashb commented Apr 18, 2020

Yeah I agree this feels very leaky.

Being able to pass any Airflow connection securely in to Bash (or SSH) Op feels like a useful ability. GCP, or any special case less so.

@ashb
Copy link
Member

ashb commented Apr 18, 2020

If we did this, we'd need an AWS one, plus one for most databases, then a Spark+all of them etc.

@kaxil
Copy link
Member

kaxil commented Apr 18, 2020

I agree with Jarek here, any cloud specific code in Core Operators would hurt my eyes.

Each cloud provider also provides a way to authenticate using an Environment Variable too. In case of GCP it is GOOGLE_APPLICATION_CREDENTIALS which can be used for Bash and Python Operators.

And if the Virtual Machine is on that Cloud Provider itself you don't need to authenticate as you can VM's credentials too.

We already allow injecting env vars to BashOperator.

Users have enough flexibility in Bash and Python, is my main point. I would let them take care of this, rather than having us to maintain this code.

@turbaszek
Copy link
Member

I agree that having provider-specific code in Bash / Python ops doesn't sound good. However, I think that it would be nice to help users somehow to authorize in those ops.

Other idea I have is to pass an authorization context manager to operators:

class BashOperator:
    def __init__(..., authctx=None):
        self.authctx = authctx

    def pre_execute(self):
        self.authctx.enter()

     def post_execute(self):
        self.authctx.close()

and then users can do something like this:

bop = BashOperator(task_id="id", ..., authctx=gcp_auth(KEY_NAME))

@kaxil
Copy link
Member

kaxil commented Apr 18, 2020

Yeah I am fine if we can create something "generic"

@mik-laj
Copy link
Member Author

mik-laj commented Apr 18, 2020

I will present something generic for moments, which will not require the transfer of a complex object. It's as easy to use as it is now, but it will be more generic. It will be something similar to the code below.

                    exit_stack.enter_context(  # pylint: disable=no-member
                        BaseHook.get_connection(conn).get_hook(
                            conn_id=self.conn_id,
                        ).provide_authorization()
                    )

New services will continue to be added by composition rather than inheritance.

Thank you very much for the comments. This shows that this feature can be useful.

@potiuk
Copy link
Member

potiuk commented Apr 18, 2020

I will present something generic for moments, which will not require the transfer of a complex object. It's as easy to use as it is now, but it will be more generic. It will be something similar to the code below.

That looks much better indeed. Let's see how this will look like eventually :)

@ashb ashb marked this pull request as draft April 20, 2020 13:20
@ashb ashb added area:providers and removed area:webserver Webserver related Issues area:docs labels Apr 20, 2020
@mik-laj mik-laj mentioned this pull request Apr 25, 2020
6 tasks
@evgenyshulman
Copy link
Contributor

evgenyshulman commented May 5, 2020

I think we can have a nice solution together with #8651
user can define global context via user_defined_execution_context( probably we should use different name for the variable)
and if context defined like this:

@contextlib.contextmanager
def gcp_context(context):
    if  not ( "some condition on config that provide what gcp project to use ( maybe conf.get('gcp', 'project_id', None)..)"):
        yield None 
        return
    gcp_gcp_delegate_to = ...
    gcp_conn_id = ...
    with TemporaryDirectory(prefix='airflowtmp') as tmp_dir:
        try:
            from airflow.providers.google.common.hooks.base_google import GoogleBaseHook

        except ImportError:
            raise AirflowException(
                'Additional packages gcp are not installed. Please install it to use gcp_conn_id '
                'parameter.'
                'For more information, please look at: '
                f'{DOC_BASE_URL}/installation' )
        with GoogleBaseHook(gcp_conn_id=gcp_conn_id, delegate_to=gcp_gcp_delegate_to).provide_authorized_gcloud() as gcp_context :
                    yield gcp_context  # will not be in use

all operators ( or only selected by function ) will benefit from the change
@mik-laj

@stale
Copy link

stale bot commented Jun 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 20, 2020
@stale stale bot closed this Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants