Add Amazon Redshift-data to S3<>RS Transfer Operators by yehoshuadimarsky · Pull Request #27947 · apache/airflow

yehoshuadimarsky · 2022-11-27T01:37:10Z

Refactored the Amazon Redshift Data API hook and operator to move the core logic into the hook instead of the operator. This will allow me in the future to implement a Redshift Data API version of the S3 to Redshift transfer operator without having to duplicate the core logic.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Taragolis · 2022-11-30T20:58:38Z

For avoid any further regression error I would suggest move tests this methods from operator tests/providers/amazon/aws/operators/test_redshift_data.py to hook tests/providers/amazon/aws/hooks/test_redshift_data.py and test that this operator call expected hook methods.

airflow/providers/amazon/aws/operators/redshift_data.py

yehoshuadimarsky · 2022-12-01T16:39:53Z

For avoid any further regression error I would suggest move tests this methods from operator tests/providers/amazon/aws/operators/test_redshift_data.py to hook tests/providers/amazon/aws/hooks/test_redshift_data.py and test that this operator call expected hook methods.

Thanks, yes I copied the tests from operator to hook, and left operator tests in place while we still have the deprecated public methods there.

vincbeck · 2022-12-02T20:33:18Z

For avoid any further regression error I would suggest move tests this methods from operator tests/providers/amazon/aws/operators/test_redshift_data.py to hook tests/providers/amazon/aws/hooks/test_redshift_data.py and test that this operator call expected hook methods.

Thanks, yes I copied the tests from operator to hook, and left operator tests in place while we still have the deprecated public methods there.

Although this is fine, as mentioned by @Taragolis I would update operator tests to only check that the hook is correctly called. We should not check in operators tests, the implementation of the hook

vincbeck

Good changes overall! Just some minor comments

vincbeck · 2022-12-02T20:35:01Z

airflow/providers/amazon/aws/hooks/redshift_data.py

+        secret_arn: str | None = None,
+        statement_name: str | None = None,
+        with_event: bool = False,
+        await_result: bool = True,


By convention, wait_for_completion is usually used as name for this kind of flag

This was the flag name in the existing code, should we really change it?

Oh yeah good point. We can either rename it but it has to go through deprecation pattern first (since it is breaking change) or leave it as is. I am fine keeping it as is

This is net new code (as far as the hook is concerned) so we can rename it here to wait_for_completion without any back compat issues and then leave the name in the Operator as await_completion for back compat (since that was publicly available before). The names will be inconsistent between the two classes, but this could be seen as a step in the right direction.

Yes please lets rename. better to stay consistent

airflow/providers/amazon/aws/hooks/redshift_data.py

dstandish · 2022-12-06T04:59:13Z

one thing to keep in mind....

we more or less have a convention, especially with aws since we have the helpful boto3 stubs library, that we should not add "alias methods" to hooks... i.e. when the underlying client method works perfectly well by itself, we should not add a method to the hook that simply forwards the kwargs to boto3.

that just adds maintenance and a layer that users may need to decypher

these aws hooks are designed in such a way that autocomplete works for boto3 client methods

e.g. see here

which means, in many cases, no need to add the hook method -- just use the client. and that's what this operator does and that's a good thing. more code sharing isn't always better because it means it's tougher to make changes.

if we add meaningful functionality to the hook then sure, add a method. but i think in this case execute_query doesn't quite cross that threshold.

i can see wait_for_result potentially being useful but, if you would not mind, i would recommend simply going straight to your transfer operator and then doing any necessary refactors as part of that, rather than saying "this is needed for this other pr" because... verifying that is easier to do when we have the actual PR.

github-actions · 2023-01-21T00:11:33Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

o-nikolas · 2023-01-25T06:13:49Z

Hey @yehoshuadimarsky,

Any updates on this one, specifically regarding the feedback from Daniel?

yehoshuadimarsky · 2023-01-25T16:48:38Z

Will take a look

yehoshuadimarsky · 2023-01-25T17:22:22Z

@dstandish
I think the Data API of Redshift is a bit different some of the other AWS services. The regulars SQL-based hook for Redshift requires a redshift conn_id that contains all of the information needed to connect to a given database. But the Data API requires you to specify the cluster identifier, database name, and database user in the operation itself when making a call, such as in the execute_statement method. This means that if we just delegate any methods down to the underlying boto3 client, every upstream user of the Redshift Data API hook would have to tediously repeat all of these connection parameters in every method the create/invoke.

I think there are two approaches we can do here.

Add more parameters to the hook itself, so that when creating a hook you only need to specify these connection args once. Yes, that adds another client layer like you were discouraging. Also, a downside of this is that a single hook instance can only be used for a single redshift destination.
Do nothing, and indeed expect the end user to pass these params around through each function call. I guess a similar service to this would be S3, where each call can specify using a different bucket and/or prefix key.

Not sure what the optimal approach is here going forward. As you suggested, I indeed started the work on the actual transfer operators with S3, and quickly ran into this question of how to model these "extra" connection params that the data api needs.

What do you (and the greater Airflow community) think or suggest?

Taragolis · 2023-01-25T18:44:57Z

Also, a downside of this is that a single hook instance can only be used for a single redshift destination.

Not a problem at all, it is how it intends to use, Single Operator -> Single Hook -> Single Credentials

very upstream user of the Redshift Data API hook would have to tediously repeat all of these connection parameters in every method the create/invoke.

And actually users could use default_args to propagate same arguments to Tasks within the DAG/TaskGroup

yehoshuadimarsky · 2023-01-25T18:49:56Z

So does that mean you think that it actually makes sense to add this new thin client wrapper on top of the hook?

Taragolis · 2023-01-26T01:36:15Z

Thin boto3 client wrapper it is a Hook based on AwsBaseHook without extending positional and key-word arguments created.

IMHO and personal thought: "The last think that we want it is add additional arguments in Hook and make them Thick Wrapper". There is couple of hooks which accept additional arguments and it it turned into the hell to support, test and use them.

You could check Redshift.Client and found that not every method required to ClusterIdentifier as well as other fields you described.

yehoshuadimarsky · 2023-01-29T02:38:12Z

Ok, per comments above added the RS data api to the S3 -> RS transfer operator

yehoshuadimarsky · 2023-01-29T03:10:00Z

Also update the reverse transfer RS -> S3 with option to use the RS data api

yehoshuadimarsky · 2023-02-03T01:58:17Z

@Taragolis @o-nikolas @dstandish @eladkal @vincbeck anyone able to review the new changes? For the full PR of adding Redshift data api to all the transfer operators... Thanks

o-nikolas

Took a quick look, looks good. @vincbeck any concerns?

o-nikolas · 2023-02-03T22:53:56Z

airflow/providers/amazon/aws/hooks/redshift_data.py

+        secret_arn: str | None = None,
+        statement_name: str | None = None,
+        with_event: bool = False,
+        await_result: bool = True,


This is net new code (as far as the hook is concerned) so we can rename it here to wait_for_completion without any back compat issues and then leave the name in the Operator as await_completion for back compat (since that was publicly available before). The names will be inconsistent between the two classes, but this could be seen as a step in the right direction.

yehoshuadimarsky · 2023-02-10T19:23:34Z

@eladkal @Taragolis @dstandish anyone able to merge this?

eladkal · 2023-02-13T17:56:35Z

airflow/providers/amazon/aws/hooks/redshift_data.py

+        secret_arn: str | None = None,
+        statement_name: str | None = None,
+        with_event: bool = False,
+        await_result: bool = True,


Yes please lets rename. better to stay consistent

yehoshuadimarsky · 2023-02-13T21:02:51Z

updated with requested changes

o-nikolas · 2023-02-14T17:43:02Z

@eladkal are you satisfied with the changes in response to your request?

refactored Amazon Redshift-data functionality into the hook

fc6abf2

yehoshuadimarsky requested a review from eladkal as a code owner November 27, 2022 01:37

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Nov 27, 2022

Taragolis reviewed Dec 1, 2022

View reviewed changes

airflow/providers/amazon/aws/operators/redshift_data.py Show resolved Hide resolved

yehoshuadimarsky added 4 commits December 1, 2022 11:06

Merge branch 'main' into redshift-data-refactor

dd688fc

removed duplicated lines

7cc5f99

added back public methods with deprecation warnings

32c0f83

copied op tests to the hook tests, with minor modifications

7df9137

yehoshuadimarsky requested review from Taragolis and removed request for eladkal December 1, 2022 19:41

Merge branch 'main' into redshift-data-refactor

9c9928d

vincbeck reviewed Dec 2, 2022

View reviewed changes

yehoshuadimarsky added 3 commits December 3, 2022 18:58

Merge branch 'main' into redshift-data-refactor

4335028

cleaned up operator tests

d86b53d

Merge branch 'main' into redshift-data-refactor

f96e94d

yehoshuadimarsky requested review from vincbeck and removed request for Taragolis December 6, 2022 03:24

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 21, 2023

Merge branch 'main' into redshift-data-refactor

bedbe83

yehoshuadimarsky requested a review from o-nikolas as a code owner January 25, 2023 16:47

eladkal requested review from eladkal and removed request for vincbeck January 25, 2023 17:19

github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 26, 2023

yehoshuadimarsky added 2 commits January 28, 2023 21:35

added RS data api to S3 to RS transfer

1ca195b

Merge branch 'main' into redshift-data-refactor

7ae0652

added RS data api to RS to S3 transfer

e7eb9e0

Merge branch 'main' into redshift-data-refactor

c5b4f91

o-nikolas approved these changes Feb 3, 2023

View reviewed changes

yehoshuadimarsky requested review from vincbeck and removed request for eladkal February 10, 2023 19:23

Merge branch 'main' into redshift-data-refactor

7fb9aa4

eladkal requested changes Feb 13, 2023

View reviewed changes

yehoshuadimarsky added 2 commits February 13, 2023 16:01

renamed 'await_result' to 'wait_for_completion'

7182d7e

Merge branch 'main' into redshift-data-refactor

eaf91b8

vincbeck approved these changes Feb 14, 2023

View reviewed changes

yehoshuadimarsky changed the title ~~refactored Amazon Redshift-data functionality into the hook~~ Add Amazon Redshift-data to S3<>RS Transfer Operators Feb 20, 2023

eladkal approved these changes Feb 20, 2023

View reviewed changes

eladkal merged commit 0604033 into apache:main Feb 20, 2023

eladkal mentioned this pull request Feb 20, 2023

RedshiftDataOperator replace await_result with wait_for_completion #29633

Merged

eladkal mentioned this pull request Mar 3, 2023

Status of testing Providers that were prepared on March 03, 2023 #29901

Closed

40 tasks

Conversation

yehoshuadimarsky commented Nov 27, 2022

Uh oh!

Taragolis commented Nov 30, 2022

Uh oh!

Uh oh!

yehoshuadimarsky commented Dec 1, 2022

Uh oh!

vincbeck commented Dec 2, 2022

Uh oh!

vincbeck left a comment

Choose a reason for hiding this comment

Uh oh!

vincbeck Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

yehoshuadimarsky Dec 4, 2022

Choose a reason for hiding this comment

Uh oh!

vincbeck Dec 5, 2022

Choose a reason for hiding this comment

Uh oh!

o-nikolas Feb 3, 2023

Choose a reason for hiding this comment

Uh oh!

eladkal Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dstandish commented Dec 6, 2022

Uh oh!

github-actions bot commented Jan 21, 2023

Uh oh!

o-nikolas commented Jan 25, 2023

Uh oh!

yehoshuadimarsky commented Jan 25, 2023

Uh oh!

yehoshuadimarsky commented Jan 25, 2023

Uh oh!

Taragolis commented Jan 25, 2023

Uh oh!

yehoshuadimarsky commented Jan 25, 2023

Uh oh!

Taragolis commented Jan 26, 2023

Uh oh!

yehoshuadimarsky commented Jan 29, 2023

Uh oh!

yehoshuadimarsky commented Jan 29, 2023

Uh oh!

yehoshuadimarsky commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

o-nikolas Feb 3, 2023

Choose a reason for hiding this comment

Uh oh!

yehoshuadimarsky commented Feb 10, 2023

Uh oh!

eladkal Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

yehoshuadimarsky commented Feb 13, 2023

Uh oh!

o-nikolas commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yehoshuadimarsky commented Feb 3, 2023 •

edited

Loading