Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DockerHook: obtain credentials and login to Amazon ECR #26162

Closed

Conversation

Taragolis
Copy link
Contributor

@Taragolis Taragolis commented Sep 5, 2022

Amazon ECR Private registry do not have permanent credentials. Users required manually obtain credentials and renew them every 12 hours or use amazon-ecr-credential-helper.

This PR allow DockerHook get credentials not only from Airflow Connection but also use internal BaseDockerCredentialHelper class for obtain credentials to private registries which not support permanent/service credentials

@potiuk
Copy link
Member

potiuk commented Sep 5, 2022

I think this is good idea, but the ECRDockerCredentialHelper should be part of Amazon provider, and likely there should be an ECRDockerOperator that uses it by default. This is very similar to what we have for example in GKEStartPodOperator which is based on generic KubernetesPodOperator and pretty much does only authentication specific for GCP.

I think this is a very similar case.

credential_helper = AirflowConnectionDockerCredentialHelper
credential_helper_kwargs = {}
else:
credential_helper = import_string(credential_helper)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not secure. We should not load user-defined classes as this makes our application vulnerable to CWE-502: Deserialization of Untrusted Data weakness.

We should change the logic so that it is not needed as @poituk suggest, or add a list of allowed classes as is done during DAG deserialization. See:

if _operator_link_class_path in get_operator_extra_links():
single_op_link_class = import_string(_operator_link_class_path)

Copy link
Member

@mik-laj mik-laj Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know also Secrets Backends, REST API auth_backends and Amazon Session Factory also load by import_string method without any validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also personally I add import retry handlers to SlackHook by #25852
If it security issue it could be removed. No slack provider released since this feature added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secrets Backends, REST API auth_backends

In this case, the values are not controlled by the user (e.g. via the Web UI), but by the administrator. You must be able to change the files on the disk to change the secret backend or auth backend.

The problem occurs when the value of the import_string parameter is controlled remotely, e.g. it is read from the database.

In the case of Slack, we should fix it so that only allowed/safe values can be imported.

Copy link
Member

@mik-laj mik-laj Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the class to be specially downloaded, as the class/function may already exist in any packages installed in the system, but we can misuse it.
I will have a hard time giving an example for Python now, but here you can see what an attack looks like in Java.
https://wololo.net/2022/06/11/ps5-ps4-hacker-theflow-discloses-blu-ray-disc-exploit-toolchain-ps5-piracy-not-a-matter-of-if-but-when/

The class com.sony.gemstack.org.dvb.user.UserPreferenceManagerImpl deserializes the userprefs file under privileged context using readObject() which is insecure
The class com.oracle.security.Service contains a method newInstance which calls Class.forName on an arbitrary class name. This allows arbitrary classes, even restricted ones (for example in sun.), to be instantiated.
The class com.sony.gemstack.org.dvb.io.ixc.IxcProxy contains the protected method invokeMethod which can call methods under privileged context. Permission checks in methods can be bypassed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we would have to look for a class or function or method that takes a conn argument or ignores unknown key arguments in order to be able to exploit that class, but it can be any class in any package, so we have quite a lot of potential attack vector.

Copy link
Contributor Author

@Taragolis Taragolis Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean most possible attack it is the fact that both import_string or import load module first.

Some sample

# airflow.providers.exploit.hooks.some_db_api

def unsafe_code():
    """Grab fernet key from configs, nudes, decode, send to someone and post on Twitter."""
    ...

unsafe_code()

class SomeDbApiHook(DbApiHook):
    conn_type = "awesome_conn_type"
    ...

So it would be the same if user call

  1. from airflow.providers.exploit.hooks.some_db_api import SomeDbApiHook
  2. import_string("airflow.providers.exploit.hooks.some_db_api.SomeDbApiHook")
  3. Or use airflow.providers.slack.transfers.sql_to_slack.SqlToSlackOperator in Airflow < 2.3 with connection type referenced to awesome_conn_type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DAG File is considered trusted content as it is recommended that it passes through code review. Changing the value in the database does not require code review, so there is a risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conn_id also part of the DAG, so it should pass the review as well as all components which DAG expected to use.

Let's imagine that someone change connection after DAG deployed (some one with Admin or Op role) that could be mainly by two reason

  1. Someone with access to change connections compromised their account and "bad guy" also have access to Airflow which mostly intend to live in private subnets with access by proxy, bastion/jump hosts and VPNs.
  2. Someone with Admin or Op is "bad guy/imposer"

If some kind of exploit already in environment there is no problem to use Option 2 or might be Option 3 - it also uses import_string for DB Api Hooks.

But also it a lot of Operator expected to run template especially BashOperator might not be safe if templates stored in Airflow Variables. 'bash_command' + 'env' = run whatever you want.

One side effect of all above - required to much steps and a lot of things have to happen: docker installed, bash operators uses with Variables, credentials of airflow leaked, Airflow webserver exposed to the Internet.

Don't get me wrong I've also think that too much security Is Never Enough.

But I believe that more chance that user expose somehow cloud credentials which use by Airflow (a lot of chances with this creds has access to almost everything): e.g. some cred info stored in extra field in Connection.

Or also almost all providers which grant ability to send messages e.g.: Slack, Discord, Telegram and MS Teams (just a PR) has ability to get token/credentials directly in operator code (without connections) and do not mask it as secrets (actually it doesn't required because user will store directly in DAG code). A lot of chance that it expose in logs especially if it uses HttpHook. Some "bad guy" could easily use this creds for send messages in corporate messenger - e.g. nice link to change domain user/password.

@mik-laj
Copy link
Member

mik-laj commented Sep 6, 2022

@potiuk FYI: The same problem exists in Postgres hook:

def get_iam_token(self, conn: Connection) -> Tuple[str, str, int]:

@Taragolis
Copy link
Contributor Author

I think this is good idea, but the ECRDockerCredentialHelper should be part of Amazon provider

Yeah it could be. I put it into docker provider for reduce complexity of dependency.

and likely there should be an ECRDockerOperator that uses it by default

Why I move this part inside of DockerHook hook:

  1. docker_conn_id not only use in DockerOperator also it use in DockerSwarmOperator and @task.docker task-flow decorator which won't work with EcrDockerOperator
  2. This is optional parameter

@Taragolis
Copy link
Contributor Author

@potiuk FYI: The same problem exists in Postgres hook:

And also AWS RDS IAM auth exists in MySQLHook. However I don't think it even works in MySQL case

def get_iam_token(self, conn: Connection) -> Tuple[str, int]:
"""
Uses AWSHook to retrieve a temporary password to connect to MySQL
Port is required. If none is provided, default 3306 is used
"""
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook

@potiuk
Copy link
Member

potiuk commented Sep 6, 2022

@potiuk FYI: The same problem exists in Postgres hook:

Yeah it could be. I put it into docker provider for reduce complexity of dependency.
And also AWS RDS IAM auth exists in MySQLHook. However I don't think it even works in MySQL case

I tihnk this should be changed then also in Postgres and MySQL :) . The approach wwhere "all AWS" goes to AWS actually makes dependencies less not more complex for the user. I thik we should avoid any kind of circular dependencies and we won't avoid them if "mysql" provider refers to "aws" one.

Both Postgres and MySQL impementation with local imports are making the dependencies equally (or even more) complex, but they are worse because they are "hidden". If you want to reason for example "which version of AWS provider we should have in order to make Postgres work with AWS" - it is impossible to answer, because by using local imports we are hiding the fact that there is depedency (but is there but implicit and hidden).

I think making an explicit dependency chain AWS -> POSTGRES, MSQL -> COMMON.SQL is the only reasonable approach. and moving "AWS specific Postgres -related classes" should be in AWS.

  • COMMON.SQL -> should have only generic SQL
  • POSTGRES -> should add postgres functionality on top
  • AWS -> shoudl add AWS-specific postgres extensions

IMHO This is our ultimate goal even if there are are cases that are not following this.

@potiuk
Copy link
Member

potiuk commented Sep 6, 2022

Why I move this part inside of DockerHook hook:

docker_conn_id not only use in DockerOperator also it use in DockerSwarmOperator and @task.docker task-flow decorator which won't work with EcrDockerOperator
This is optional parameter

It should be done in discoverable way. The very same how in "common.sql" we instantiate Hook based on "connection_id" (without knowing the type of the hook upfront). The only requirement is that it extends DBApiHook. We should have a common authentication interface implemented by Docker Operator, and some functionality in the Hook to allow the "authentication" to be discovered and used.

@Taragolis
Copy link
Contributor Author

I tihnk this should be changed then also in Postgres and MySQL :) .

Let's open Pandora box. What we should do with this ones?

No dependencies in provider (exists for couple years)

def _fetch_saml_assertion_using_http_spegno_auth(self, saml_config: Dict[str, Any]) -> str:
# requests_gssapi will need paramiko > 2.6 since you'll need
# 'gssapi' not 'python-gssapi' from PyPi.
# https://github.com/paramiko/paramiko/pull/1311
import requests_gssapi
from lxml import etree

Local imports from google provider

def _get_google_identity_token_loader(self):
from google.auth.transport import requests as requests_transport
from airflow.providers.google.common.utils.id_token_credentials import (
get_default_id_token_credentials,
)
audience = self.extra_config.get('assume_role_with_web_identity_federation_audience')

Thats is only that I know, might be also some of the same stuff exists somewhere.

We should have a common authentication interface implemented by Docker Operator

I think we should have some common authentication interface for all Hooks which implements connection ability.

@potiuk
Copy link
Member

potiuk commented Sep 6, 2022

Wrong box to open :). This is not part of it.

Both of the examples you mentioned @Taragolis are clearly AWS ones they need no fix as they are in the right place.

They are (and should be) in AWS. In Both cases the "USER" of the authentication (AWS) has dependency on the functionality they use. Having such a functionality for optional AWS <> Google dependency is perfectly fine and it is even "blessed" in our package system. the amazon provider has [google] extra and google provider has [amazon] extra. We even have an exception specially foreseen for that - not yet heavily used but once we split providers into separate repos I was planning to consistently apply it in a places that do not have it.

class AirflowOptionalProviderFeatureException(AirflowException):
    """Raise by providers when imports are missing for optional provider features."""

Those two stories are different thatn here is very easy (even if not "technical" - this is more looking at the landscape of our providers from the business side of things than pure interface/API and it is more based on likelihood of being better maintained than anything else). The decision where to put so code that is in-between is bound more to whether there are clear stakeholders behind the provider that mostly maintain the provider and are interested in having this functionality in.

It is captured in a few places already. We have Release process for providers but also we already used the same line of thoughts when we decided where to put transfer operators. See AIP-21 Changes in import paths - and also touched upon in AIP-8 - Split Provider packages for Airflow 2.0. One of the decisions we made there is that the the "transfer" operators are put in the "target" of the transfer when there is a clear stakeholder behind the target. And in this case circular references are unavoidable.

To sum up the linne of thoughts in one sentence: the idea is that the code should be put in the "provider" where there is a stakeholder that is most likely to maintain the code.

And 'Postgres' and 'MySQL' are different. They have no clear stakeholders behind that are interested in maintaining those. Even though they are commercial databases with companies behind, they are 'commodity' APIS and they are not interested in maintaining services. And neither Postgres nor MySQL are insterested in adding interfacing with AWS/GCP to the provider. But both AWS and GCP are respectively interested to allow AWS/GCP authentication WITH the Postgres/MySQL provider they expose in their own Services offering.

This has been discussed for a long time when AIP-21 was discussed - there are different kinds of providers. Simply speaking SAML/GSSAPI are "protocols", there are also "databases" (like Postgres/MySQL) and then there are "Services" which are higher layer of abstraction and while "Services" can use "Protocols" and "Databases", neither "Protocols" nor "Databases" should use "Services" - they can only be "used" by services.

This will be much more visible and obvious when we split providers to separate repos. The ideal situation should be, that people who are from AWS should be subscribed to that single "amazon" provider repo to get notifications about all the AWS-specific changes they are interested in. And this authentication code clearly falls into this "bucket".

The Google Federated identity code is also a very good example here - while it is clearly GCP-related, this is AWS people who are interested to get it working and make sure they can connect to GCP (for example to be able ot Pull data from it etc.). They will be maintaining the code, not the Google people.

@Taragolis
Copy link
Contributor Author

Taragolis commented Sep 6, 2022

Wrong box to open :). This is not part of it.

But it do almost the same the same thing that PostgresHook.get_iam_token and MySqlHook.get_iam_token - grant access by use AWS IAM auth method. This methods exists for very long time (might be before even providers exists)

In case of airflow.providers.amazon.aws.hooks.base_aws.BaseSessionFactory._get_google_identity_token_loader it use GCP to auth in AWS. So if apply the same logic this method should live in Google Provider rather than Amazon Provider.

@Taragolis
Copy link
Contributor Author

In the ideal world auth to different hooks should be pluggable. In this case user might be auth to Postgres by different ways (Explicit Credentials, pgpass, GSAPI, AWS IAM, etc) the same way also might work with Docker Registries (Explicit Credentials, ECR Get Token) as well as additional methods to auth into AWS Cloud.

Just for example Data Grip or Jetbrains Database Tools and SQL plugin. As soon as I installed third party plugin AWS Toolkit it enable additional auth methods to Postgres and MySQL

image

Comparison might be not correct Data Grip as well as Pro version of JB IDEs is commercial product and Airflow is OSS.

@potiuk
Copy link
Member

potiuk commented Sep 11, 2022

Wrong box to open :). This is not part of it.

But it do almost the same the same thing that PostgresHook.get_iam_token and MySqlHook.get_iam_token - grant access by use AWS IAM auth method. This methods exists for very long time (might be before even providers exists)

In case of airflow.providers.amazon.aws.hooks.base_aws.BaseSessionFactory._get_google_identity_token_loader it use GCP to auth in AWS. So if apply the same logic this method should live in Google Provider rather than Amazon Provider.

Yes. the difference is WHAT is doing it. If this is the code that is under stewardship of AWS, they choose to do it and maintain it. In Postgres there is no-one to maintain it. This is the main criteria. You are using wrong comparision. It's not what code it is, but who is interested in maintainig it and reacting when there is a problem.

To put it simple. If there is in the future a problem with library of Google that will cause a problem with this operator - the AWS people will have to fix it. and if - for example - it causes us a conflict with other providers (say Google) we might choose not to release AWS provider until AWS people will fix the conflict. Plain and simple. AWS is impacted, they are fixing it. With spltting providers, I want AWS people to subscribe to aws repo and be aware of all the things they will need to fix.

When Google provider uses AWS for authentication - it still holds. It's the Google team that should fix any problems that might undermine their way of interfacing with AWS.

With Postgres, the situation is different. If the same problem happens with Postgres provider, It will hold Postgres provider from being released. So the problem with AWS will hold a new release of Postgres provider. This is not something we want.

Again - your parallells are not entirely correct - it's not "what code is needed". I agree it is similar code. But IMHO it does not matter. What matters is "who has the incentive to fix it if it is broken", This is fundamental difference. Think "human", "team", "responsibility", "governance" not "code".

In the ideal world auth to different hooks should be pluggable. In this case user might be auth to Postgres by different ways (Explicit Credentials, pgpass, GSAPI, AWS IAM, etc) the same way also might work with Docker Registries (Explicit Credentials, ECR Get Token) as well as additional methods to auth into AWS Cloud.

I don't think it's an "ideal world". It's just one of the choices. Right now in Airflow authentication is very closely connected to the Connection and corresponding Hook. Basically Auth is THE ONLY functionality that Hook provides on top of most of the integrations. You give it a connection id and it gives you back some form of a client that you can use. Other than that hooks are not very useful - they are passing through parameters to underlying "clients". Qute often, particular Hook returns you directly a Python client that you could instantiate and use yourself. The actual "benefit" of using Hooks is that you can pass it the connection id. That's pretty much it. And this is big and important feature.

Currently, the way to implement different authentications it is through inheritance (if we want to do handling at the Hook level). If you need an extended Hook with new authentication, you extend one that is there and add the authentication. This fits very well Airflow's model - where Hook is a class and it is extendable and can be passed a connection id. Postgres Hook should get postgres connection id and so on. If you want to implement Postgres Hook with the conneciton Id from Google, you should implement a GooglePostgresHook "child" that can take Google Connection as initialization param. Or if you want to use AWS authentication - implement an AWSPostgres Hook "child" that can take "AWS connection" as initialization param. And this is the model used currently. This is a different model than "pluggable" authentication - where you "extend" rather tha "inject" authentication.

We might have different view on which model is better. I personally think the "pluggable authentication" is far more difficult to make into a generic solution that is applicable to multiple types of hook, I'd say airflow's model where you extend functionality and add authentication as you need it in "child" class is much better - as it saves you to define a common "authenticaiton" layer - that would have to handle a lot of complexities - a good example is impersonation chain from GCP which is very specific for GCP, and implementing common "authentication" model which would handle, GCP, Azure, AWS, Databricks, and big number of other options - via HTTP, proprietary database connection types, OpenID, Oauth etc is far more complex and error-prone.

But no-matter what our thinking on more/less complex is - the "Extension" model is what is being used in Airflow.

@Taragolis
Copy link
Contributor Author

Again - your parallells are not entirely correct - it's not "what code is needed". I agree it is similar code. But IMHO it does not matter. What matters is "who has the incentive to fix it if it is broken", This is fundamental difference. Think "human", "team", "responsibility", "governance" not "code".

It depends which way you look. For me it look like the same some additional method added years ago and not well maintained and documented:

The reason why IAM exists in postgres provider and not in amazon as well as google federation in amazon provide and not in google it is clear for me - no other way exists for extend authentication.

To be clear I'm agree that auth by IAM from postgres/mysql should move out from postgres provider. I just think about appropriate replacement which might easy to maintain and easy add additional integration with other cloud providers.

Currently, the way to implement different authentications it is through inheritance

The issue that right now if you extend hook than you need to also implements all operators.
So if we create AwsPostgresHook than also need to implement in additional all existed operators and task flow decorators like AwsPostgresToGCSOperator.

It could be done as well as done for DBApiHook, however DBApiHook not only about auth and also it about support specific protocols for specific database.

However for Postgres, Docker, MySQL and might be other it only about how to implements the way how to obtain credentials and pass for specific client. It might be expect execute some class. In this case it might required some changes in core, so it not available in providers until min version of airflow will meet requirements.

And right might be it more about create some AIP and discuss with different approaches and benefits, WDYT?

@Taragolis
Copy link
Contributor Author

@mik-laj @potiuk Let's finalise discussion. We not ready to add auth to Amazon ECR by this way it described in the PR for different reasons:

  1. We need to move auth to AWS from postgres provider to amazon
  2. We can not use import_string for untrusted modules but right now but if we move EcrDockerCredentialHelper to aws I do not see any possibility to do that without specify it in somewhere in postgres hook.

So might be some additional way to help user to integrate with ECR if it required? It could be some page in to documentation like this snippet #9707 (comment) or create Operator which could update specific docker connection by ECR credentials

WDYT guys?

@potiuk
Copy link
Member

potiuk commented Sep 19, 2022

Why not using the oppoertunity to add the auth mechanism now? I do not think this needs a super-generic solution that will work for all hooks. I think there are a few operators that really need this kind of pluggable authentication, I think this DockerHook change is a good opportunity to add it.

I would say we need something very similar that requests library uses, where you can pass auth object https://requests.readthedocs.io/en/latest/user/authentication/ - and you could add the "auth" class to pass to DockerHook and Docker operators to pass "DockerAuth" Protocol / Abstract class that would handle the authentication.

Once we get it implemented here, we could apply similar approach to Postgres and fix the current behaviour and any other class that has it done in a similar class. I think there could be various auth Classes in "Amazon" - one for Docker, one for Postgres, etc. and they could use some common code if needed.

Do you think of any drawbacks of the approach @Taragolis ?

@Taragolis
Copy link
Contributor Author

@potiuk Right now I thought this is only one approach which could cover:

  • Auth live in provider packages or user-defined scripts
  • Do not use import_string

It is only required small changes in current operators and do not need re-implement everything in case if it required only the way of Auth. This PR also in draft by the one reason I do not have an idea the best place where write the documentation how to use Auth.

It might be:

  • In docker provider just information how to implements custom and what actually DockerHook send to this class
  • In amazon provider information about what user should provide in ECR related class

And also I would like to add some information about other Hooks and something which might (or might not) be nixe implemented in the future. Which not related to current PR


HttpHook

Right now my daily workload do not required but couple years ago we were using it very active. I do not remember might be in 1.10.4 HttpHook do not even have this parameter for request Auth (today I lazy to check it).

But major issue currently you need to create new hook and overwrite get_conn method if you need provide not only login and password from connection. So may be it also good idea to implement some generic way to grab credentials from Connection and provide into HttpHook, so it wouldn't require re-create hook and operators in cases if only required custom Auth

I found this code from the old project (note: code created for 1.10.x)

Custom Auth

class BearerAuth(AuthBase):
    """ Bearer Authorization implementation for requests """

    def __init__(self, token):
        self._token = token

    def __call__(self, r):
        r.headers["Authorization"] = "Bearer %s" % self._token
        return r
class SomeDumbAuth(AuthBase):
    """ Authorization for some private API """

    def __init__(self, key, secret):
        self._key = key
        self._secret = secret

    def __call__(self, r):
        if not r.body:
            r.body = ""
        if r.body:
            b = {x.split("=")[0]: x.split("=")[1] for x in r.body.split("&")}
        else:
            b = {}

        b.update({
            "some_key": self._key,
            "some_secret": self._secret,
        })
        r.body = "&".join("%s=%s" % (k, v) for k,v in b.items())

        return r
class AppStoreAuth(AuthBase):
    """ AppStore Authorization implementation for requests """

    def __init__(self, private_key, key_id, issuer_id):
        self._private_key = private_key
        self._key_id = key_id
        self._issuer_id = issuer_id

    def __call__(self, r):
        headers = {
            "alg": 'ES256',
            "kid": self._key_id,
            "typ": "JWT"
        }
        payload = {
            "iss": self._issuer_id,
            "exp": int((datetime.now() + timedelta(minutes=20)).strftime("%s")),
            "aud": "appstoreconnect-v1"
        }

        token = jwt.encode(
            payload=payload,
            key=self._private_key,
            algorithm='ES256',
            headers=headers
        ).decode(encoding="utf-8")

        r.headers["Authorization"] = "Bearer %s" % token
        return r

Hook which use custom auth from conn

class AppStoreSalesHttpHook(HttpHook):
    """ HTTP Hook for AppStore connect API """

    def __init__(self, endpoint, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.endpoint = endpoint

    def get_conn(self, headers: dict = None):
        """ Returns http requests session for AppStore connect use with requests

        Args:
            headers: additional headers to be passed through as a dictionary
        """
        session = requests.Session()
        if self.http_conn_id:
            conn = self.get_connection(self.http_conn_id)

            if conn.password:
                private_key = conn.password.replace("\\n", "\n")
                key_id = conn.extra_dejson.get('KeyId')
                issuer_id = conn.extra_dejson.get('IssuerId')

                session.auth = AppStoreAuth(
                    private_key=private_key,
                    key_id=key_id,
                    issuer_id=issuer_id,
                )
            else:
                raise ValueError("Missing extra parameters for connection %r" % self.http_conn_id)

            if conn.host and "://" in conn.host:
                self.base_url = conn.host
            else:
                # schema defaults to HTTP
                schema = conn.schema if conn.schema else "http"
                host = conn.host if conn.host else ""
                self.base_url = schema + "://" + host

            if conn.port:
                self.base_url = self.base_url + ":" + str(conn.port)

        if headers:
            session.headers.update(headers)

        return session

PostgresHook

It is not a big deal to change in hook to use custom Auth. One thing what I would change it drop Redshift support from PostgresHook

if conn.extra_dejson.get('redshift', False):
port = conn.port or 5439
# Pull the custer-identifier from the beginning of the Redshift URL
# ex. my-cluster.ccdre4hpd39h.us-east-1.redshift.amazonaws.com returns my-cluster
cluster_identifier = conn.extra_dejson.get('cluster-identifier', conn.host.split('.')[0])
redshift_client = AwsBaseHook(aws_conn_id=aws_conn_id, client_type="redshift").conn
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html#Redshift.Client.get_cluster_credentials
cluster_creds = redshift_client.get_cluster_credentials(
DbUser=login,
DbName=self.schema or conn.schema,
ClusterIdentifier=cluster_identifier,
AutoCreate=False,
)
token = cluster_creds['DbPassword']
login = cluster_creds['DbUser']

In Amazon provider there is two different ways (and two different hooks) how to interact with Redshift

  1. DB-API by redshift-connector
  2. By AWS API and boto3

Event thought Redshift use PostgreSQL protocol I'm not sure that psycopg2 officially support Redshift or not but psycopg (v3) not officially supported it - psycopg/psycopg#122 .


MySQL

I'm still not sure that the current implementation actually allow use AWS IAM (need to check).
But also for MySQL Auth class need to have ability for specify which driver user actually use

if client_name == 'mysqlclient':
import MySQLdb
conn_config = self._get_conn_config_mysql_client(conn)
return MySQLdb.connect(**conn_config)
if client_name == 'mysql-connector-python':
import mysql.connector
conn_config = self._get_conn_config_mysql_connector_python(conn)
return mysql.connector.connect(**conn_config)


Extensible Connection / Pluggable Auth

In my head right now this part a bit more than just Auth.
Currently Airflow Connection might use for different things.

  • For authentication
  • For configuration client
  • Parameters for API call (like Amazon EMR Connection, which actually not a connection and use only in one place)

The main problem how it showed in the UI and how it stored. Good sample is AWS Connection:

  • Quite a few different method and extra parameters which actually won't work together
  • boto3 configurations

Extensible Connection might show different ui depend on select Auth Type, also it would be nice have different tabs for Auth and Configurations.

But again it is just an idea which in my head might be brilliant but in real world required a lot of changes.
Also I'm not even try to draw some scratches on paper how it would work and integrate with airflow and components.

@potiuk
Copy link
Member

potiuk commented Sep 20, 2022

Yep. I want to see how your proposal will look like after rebase, but yeah - the smaller set of changes, the better for now.

@Taragolis
Copy link
Contributor Author

@potiuk I think it would required small changes if compare to the current PR.

  1. Rename from BaseDockerCredentialHelper to something like BaseDockerRegistryAuth credential helper might confuse some user - in docker stack already exists utilities with the same name.
  2. Move ECR related to amazon provider
  3. Assign as Hook parameter rather than import_string
  4. Some small changes in operator for allow provide this classes

But year better have a look after I make changes. I hope it will happen within a couple days.

@potiuk
Copy link
Member

potiuk commented Sep 20, 2022

Fantastic!

@github-actions
Copy link

github-actions bot commented Nov 5, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 5, 2022
@Taragolis
Copy link
Contributor Author

Will rebase and make changes. Totally forgot about this PR

@eladkal eladkal removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 5, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 21, 2022
@github-actions github-actions bot closed this Dec 26, 2022
@Taragolis Taragolis mentioned this pull request Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:documentation stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants