Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RedshiftSQLHook, RedshiftSQLOperator #18447

Merged
merged 42 commits into from
Oct 8, 2021

Conversation

Brooke-white
Copy link
Contributor

@Brooke-white Brooke-white commented Sep 22, 2021

closes: #16355 #16544

Adds RedshiftSQLHook and RedshiftSQLOperator, which allow Airflow users to execute Redshift operations.

RedshiftSQLHook is supported by the redshift_connector database driver which supports authenticating via IAM or Identity Provider. Integration with sqlalchemy is supported via sqlalchemy-redshift.

Unit tests have been added to:

  • Test Hook -- ensuring connection parameters flow through to redshift_connector as expected
  • Test Operator -- ensuring statements flow through to redshift_connector.execute method as expected

Docs have been added to:

  • explain how to build a connection
  • provide an example DAG with create/insert/select statements
  • provide reference to redshift_connector docs for details about supported connection options

Use case / motivation

There is now a redshift_connector library which we can use instead of psycopg2. The redshift connector library provides support for IAM & Identity provider authentication as well as Redshift specific datatype support (e.g. GEOMETRY, SUPER, etc.).

See #16355 for more info.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg
Copy link

boring-cyborg bot commented Sep 22, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small things... looking good though 👍

One bigger thing is ... should we take this opportunity to replace the existing RedshiftHook (which is about managing clusters) with the one added in this PR (since this by far the more useful and needed one, and more deserving of the name RedshiftHook)

airflow/providers/amazon/aws/hooks/redshift_statement.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/hooks/redshift_statement.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/hooks/redshift_statement.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/hooks/redshift_statement.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/hooks/redshift_statement.py Outdated Show resolved Hide resolved
@Brooke-white Brooke-white force-pushed the amazon-redshift-statement-executor branch from 1efe5a9 to 7205a8c Compare September 28, 2021 18:11
@eladkal eladkal linked an issue Sep 29, 2021 that may be closed by this pull request
@dstandish dstandish changed the title add RedshiftStatementHook, RedshiftOperator Add RedshiftSqlHook, RedshiftSqlOperator Sep 29, 2021
@Brooke-white Brooke-white force-pushed the amazon-redshift-statement-executor branch from ddd9d49 to a5e8786 Compare September 29, 2021 18:09
@Brooke-white Brooke-white changed the title Add RedshiftSqlHook, RedshiftSqlOperator Add RedshiftSQLHook, RedshiftSQLOperator Sep 29, 2021
ashb
ashb previously requested changes Sep 30, 2021
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a whole new hook for this -- the existing Postgres hook will already connect to Redshift

https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_api/airflow/providers/postgres/hooks/postgres/index.html#airflow.providers.postgres.hooks.postgres.PostgresHook

_If we decide we do want a hook for this anyway then it should probably subclass that postgres hook.

@ashb
Copy link
Member

ashb commented Sep 30, 2021

as well as Redshift specific datatype support (e.g. GEOMETRY, SUPER, etc.).

Geometry and Super are returned as strings -- isn't that the default behaviour for pscopg already?

@Brooke-white
Copy link
Contributor Author

@ashb, PostgresHook does not support authenticating via identity provider. That functionality is a big part of #16355.

IMO It would make more sense to have a separate hook for Redshift as it's a different product than Postgres. PostgresHook is using a driver designed for Postgres. RedshiftSQLHook would use a driver designed for Redshift.

Geometry and Super are returned as strings -- isn't that the default behaviour for pscopg already?

Geometry can also be hex encoded under OID 3999, and requires manipulation by the driver before being returned to the user.

@Brooke-white
Copy link
Contributor Author

Brooke-white commented Sep 30, 2021

Some info on authenticating via identity provider using redshift_connector 1 2

The datatypes redshift_connector supports 3

…ault redshift_conn_id value"

This reverts commit a02cdbc14871416c1449acd8c49103d518c4f279.
This reverts commit 76a6e93083a5a9ca8cb6f7f57c5a1a7c46f40da7.
@kaxil kaxil force-pushed the amazon-redshift-statement-executor branch 2 times, most recently from 32832d3 to c3b4e48 Compare October 7, 2021 19:18
@potiuk potiuk merged commit 1df9a51 into apache:main Oct 8, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 8, 2021

Awesome work, congrats on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge kind:documentation provider:amazon-aws AWS/Amazon - related issues
Projects
None yet