Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RedshiftDataHook #19137

Merged
merged 2 commits into from
Mar 2, 2022
Merged

Add RedshiftDataHook #19137

merged 2 commits into from
Mar 2, 2022

Conversation

john-jac
Copy link
Contributor

Airflow users wishing to trigger Amazon Redshift queries today have to use a Postgres connection. While effective, this has the limitation that the Redshift cluster must expose a Postgres endpoint that is accessible to the Airflow environment.

As an alternative, the AWS Boto3 library exposes a client called redshift-data that allows users to query and retrieve data via the AWS API and avoid the necessity of a Postgres connection.

This PR adds a RedshiftDataHook to the existing Redshift provider, along with tests and an example, which wraps the "redshift-data" Boto3 client and allows users to run Redshift queries without a Postgres connection.

In the future, this capability could be added to any existing Redshift transfer operator, with an option to auto-detect whether to use Postgres or Boto3 based upon the connection type.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Oct 21, 2021
Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use RedshiftSQLHook or RedshiftHook? Why do we need a new hook? It's not clear to me.

I suggest we add these methods to the existing RedshiftHook and create operator(s) that would do what we want

@john-jac
Copy link
Contributor Author

john-jac commented Oct 22, 2021

Why can't we use RedshiftSQLHook or RedshiftHook? Why do we need a new hook? It's not clear to me.

RedshiftSQLHook extends DbApiHook, whereas RedshiftDataHook is based on AwsBaseHook. While they have similar capabilities, the underlying methods are completely different. If we changed RedshiftSQLHook it would break anything based on at Postgres connection.

I suggest we add these methods to the existing RedshiftHook and create operator(s) that would do what we want

RedshiftHook is based on the boto3 redshift client, which is a set of functions for managing the Redshift cluster and does not provide any data access.

RedshiftDataHook is based on the boto3 redshift-data client, which is a set of functions for accessing Redshift data. I don't believe we have any other examples where a single AWS hook wraps two different boto3 clients, so that would dictate having two different hooks. Further, the Redshift Data API has different IAM policy definitions, so someone with access to data may not have access to cluster management, and vice-versa.

@ephraimbuddy
Copy link
Contributor

In that case, to avoid confusion, we should separate this to a different file e.g redshift_data.py just like we have glue.py, glue_crawler.py and glue_catalog.py. In another PR, we can refactor the other two hooks to be in separate files and create operators from them

Let's create operators for each of those uses where we used the python operator and also add documentation on how to use the hooks/operators. These will improve user experience

@john-jac
Copy link
Contributor Author

In that case, to avoid confusion, we should separate this to a different file e.g redshift_data.py just like we have glue.py, glue_crawler.py and glue_catalog.py. In another PR, we can refactor the other two hooks to be in separate files and create operators from them

Let's create operators for each of those uses where we used the python operator and also add documentation on how to use the hooks/operators. These will improve user experience

OK let me get working on that. Thanks!

Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that's required here is a system test. That would help us to verify that the example works as expected.

See https://github.com/apache/airflow/blob/main/tests/providers/amazon/aws/operators/test_ecs_system.py for an example system test.

Here too is more information about a system test: https://github.com/apache/airflow/blob/main/TESTING.rst#airflow-system-tests

airflow/providers/amazon/aws/hooks/redshift_data.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/hooks/redshift_data.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/hooks/redshift_data.py Outdated Show resolved Hide resolved
:type with_event: bool

"""
"""only provide parameter argument if it is valid"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""only provide parameter argument if it is valid"""
# only provide parameter argument if it is valid

We may not need this comment though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks not done yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. likely missed during rebase.

@potiuk
Copy link
Member

potiuk commented Nov 7, 2021

Is this something taht you are still working on @john-jac ?

@john-jac
Copy link
Contributor Author

john-jac commented Nov 7, 2021

Is this something taht you are still working on @john-jac ?

Yes it’s still on my list. There were a number of additions requested. Some I’ve addressed in my comments, the rest I hope to add this week.

airflow/providers/amazon/aws/operators/redshift_data.py Outdated Show resolved Hide resolved
:type autocommit: bool
"""

template_fields = ('sql',)
Copy link
Contributor

@josh-fell josh-fell Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other fields to add here that users might want to have flexibility to dynamically generate a value for? Maybe parameters, cluster_identifier, database, and/or db_user. The latter 3 could maybe be added to a Connection Extra and accessed with Jinja now that the Connection object is accessible in the template context like "{{ conn.conn_id.extra_dejson.<attr> }}" when calling the operator. No strong opinions though just something to think about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The more template_fields the better. It has pretty much no side effects but makes your operators much more flxible and allow to build much more sophissticated cases.

@john-jac
Copy link
Contributor Author

john-jac commented Nov 8, 2021

Another thing that's required here is a system test. That would help us to verify that the example works as expected.

See https://github.com/apache/airflow/blob/main/tests/providers/amazon/aws/operators/test_ecs_system.py for an example system test.

Here too is more information about a system test: https://github.com/apache/airflow/blob/main/TESTING.rst#airflow-system-tests

@ephraimbuddy I don't see other similar operators, such as sagemaker or glue, using system tests. Are we sure this is necessary for redshift-data?

@ephraimbuddy
Copy link
Contributor

@ephraimbuddy I don't see other similar operators, such as sagemaker or glue, using system tests. Are we sure this is necessary for redshift-data?

Agree that some other operators don't have it but this will really help us to validate your example dag and the operator. No strong opinion though

@john-jac
Copy link
Contributor Author

@ephraimbuddy I don't see other similar operators, such as sagemaker or glue, using system tests. Are we sure this is necessary for redshift-data?

Agree that some other operators don't have it but this will really help us to validate your example dag and the operator. No strong opinion though

OK as a best practice I will add it. Thanks!

@eladkal
Copy link
Contributor

eladkal commented Dec 28, 2021

After #20276
is this PR still needed?
Using RedshiftDataHook can be confusing, isn't it?
cc @dstandish

@john-jac
Copy link
Contributor Author

After #20276
is this PR still needed?
Using RedshiftDataHook can be confusing, isn't it?
cc @dstandish

It is different. See description in the comments above.

Very sorry for the delay. The additional tests requested are still to be coded.

@ephraimbuddy ephraimbuddy dismissed their stale review January 11, 2022 16:03

issue addressed

@potiuk potiuk closed this Feb 1, 2022
@potiuk potiuk reopened this Feb 1, 2022
@potiuk
Copy link
Member

potiuk commented Feb 1, 2022

I tihnk this one needs to be rebased. It's 73 commits behind.

@john-jac
Copy link
Contributor Author

john-jac commented Feb 1, 2022

I tihnk this one needs to be rebased. It's 73 commits behind.

Should be up to date now

@josh-fell
Copy link
Contributor

FYI - Now that #20951 is merged, you don't need to specify the Sphinx :type: directives anymore in docstrings. The datatypes can now be inferred from the typing annotations directly when building the Python API documentation.

@dstandish
Copy link
Contributor

before this is merged can we get an agreement on authorship?

it was opened by @john-jac but @vincbeck is taking over the work. only one can be author and one must be co-author. i am thinking that author should be @vincbeck and co-author @john-jac because it seems that probably more than 50% of the effort will have landed on @vincbeck by the time this is merged (and seeing it through to the end can be somewhat harder than getting it started).

if you both agree, @vincbeck please rebase / squash as necessary to ensure that you have the first commit in this branch this way when we finally squash and merge we can ensure appropriate attribution.

unless you don't care in which case you may leave it and you'll end up as co-author.

thanks

@john-jac
Copy link
Contributor Author

I agree @vincbeck should be the author as he is pushing it to completion.

@vincbeck vincbeck force-pushed the redshift-data branch 2 times, most recently from 3a3c542 to 64b9b64 Compare February 28, 2022 23:07
@vincbeck
Copy link
Contributor

I stashed all the commits into one commit and added @john-jac as co-author. Everything should be fine

@potiuk
Copy link
Member

potiuk commented Mar 1, 2022

There are some static check failures to fix.

@vincbeck vincbeck force-pushed the redshift-data branch 5 times, most recently from c821485 to 8afcea5 Compare March 1, 2022 21:43
@vincbeck vincbeck force-pushed the redshift-data branch 2 times, most recently from b5bed88 to 5f10eba Compare March 2, 2022 15:27
Use the AWS Boto3 library to query and retrieve data via the AWS API and avoid the necessity of a Postgres connection.

Co-authored-by: john-jac <jacnjoh@amazon.com>
@vincbeck
Copy link
Contributor

vincbeck commented Mar 2, 2022

My bad. It's now fixed

@dstandish
Copy link
Contributor

dstandish commented Mar 2, 2022

My bad. It's now fixed

uh-may-zing

thanks for the effort @john-jac and @vincbeck

better merge it now before it breaks again!

@john-jac
Copy link
Contributor Author

john-jac commented Mar 2, 2022

better merge it now before it breaks again!

😄 thanks @dstandish !

@dstandish
Copy link
Contributor

spoke too soon apparently, had to make one small rename (done it, don't worry)
will wait for it to build green once more though before merging 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants