Skip to content

Add Amazon Redshift-data to S3<>RS Transfer Operators#27947

Merged
eladkal merged 17 commits intoapache:mainfrom
yehoshuadimarsky:redshift-data-refactor
Feb 20, 2023
Merged

Add Amazon Redshift-data to S3<>RS Transfer Operators#27947
eladkal merged 17 commits intoapache:mainfrom
yehoshuadimarsky:redshift-data-refactor

Conversation

@yehoshuadimarsky
Copy link
Contributor

Refactored the Amazon Redshift Data API hook and operator to move the core logic into the hook instead of the operator. This will allow me in the future to implement a Redshift Data API version of the S3 to Redshift transfer operator without having to duplicate the core logic.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Nov 27, 2022
@Taragolis
Copy link
Contributor

For avoid any further regression error I would suggest move tests this methods from operator tests/providers/amazon/aws/operators/test_redshift_data.py to hook tests/providers/amazon/aws/hooks/test_redshift_data.py and test that this operator call expected hook methods.

@yehoshuadimarsky
Copy link
Contributor Author

For avoid any further regression error I would suggest move tests this methods from operator tests/providers/amazon/aws/operators/test_redshift_data.py to hook tests/providers/amazon/aws/hooks/test_redshift_data.py and test that this operator call expected hook methods.

Thanks, yes I copied the tests from operator to hook, and left operator tests in place while we still have the deprecated public methods there.

@yehoshuadimarsky yehoshuadimarsky requested review from Taragolis and removed request for eladkal December 1, 2022 19:41
@vincbeck
Copy link
Contributor

vincbeck commented Dec 2, 2022

For avoid any further regression error I would suggest move tests this methods from operator tests/providers/amazon/aws/operators/test_redshift_data.py to hook tests/providers/amazon/aws/hooks/test_redshift_data.py and test that this operator call expected hook methods.

Thanks, yes I copied the tests from operator to hook, and left operator tests in place while we still have the deprecated public methods there.

Although this is fine, as mentioned by @Taragolis I would update operator tests to only check that the hook is correctly called. We should not check in operators tests, the implementation of the hook

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good changes overall! Just some minor comments

secret_arn: str | None = None,
statement_name: str | None = None,
with_event: bool = False,
await_result: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, wait_for_completion is usually used as name for this kind of flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the flag name in the existing code, should we really change it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah good point. We can either rename it but it has to go through deprecation pattern first (since it is breaking change) or leave it as is. I am fine keeping it as is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is net new code (as far as the hook is concerned) so we can rename it here to wait_for_completion without any back compat issues and then leave the name in the Operator as await_completion for back compat (since that was publicly available before). The names will be inconsistent between the two classes, but this could be seen as a step in the right direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please lets rename. better to stay consistent

@yehoshuadimarsky yehoshuadimarsky requested review from vincbeck and removed request for Taragolis December 6, 2022 03:24
@dstandish
Copy link
Contributor

one thing to keep in mind....

we more or less have a convention, especially with aws since we have the helpful boto3 stubs library, that we should not add "alias methods" to hooks... i.e. when the underlying client method works perfectly well by itself, we should not add a method to the hook that simply forwards the kwargs to boto3.

that just adds maintenance and a layer that users may need to decypher

these aws hooks are designed in such a way that autocomplete works for boto3 client methods

e.g. see here

image

which means, in many cases, no need to add the hook method -- just use the client. and that's what this operator does and that's a good thing. more code sharing isn't always better because it means it's tougher to make changes.

if we add meaningful functionality to the hook then sure, add a method. but i think in this case execute_query doesn't quite cross that threshold.

i can see wait_for_result potentially being useful but, if you would not mind, i would recommend simply going straight to your transfer operator and then doing any necessary refactors as part of that, rather than saying "this is needed for this other pr" because... verifying that is easier to do when we have the actual PR.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 21, 2023
@o-nikolas
Copy link
Contributor

Hey @yehoshuadimarsky,

Any updates on this one, specifically regarding the feedback from Daniel?

@yehoshuadimarsky
Copy link
Contributor Author

Will take a look

@eladkal eladkal requested review from eladkal and removed request for vincbeck January 25, 2023 17:19
@yehoshuadimarsky
Copy link
Contributor Author

@dstandish
I think the Data API of Redshift is a bit different some of the other AWS services. The regulars SQL-based hook for Redshift requires a redshift conn_id that contains all of the information needed to connect to a given database. But the Data API requires you to specify the cluster identifier, database name, and database user in the operation itself when making a call, such as in the execute_statement method. This means that if we just delegate any methods down to the underlying boto3 client, every upstream user of the Redshift Data API hook would have to tediously repeat all of these connection parameters in every method the create/invoke.

I think there are two approaches we can do here.

  1. Add more parameters to the hook itself, so that when creating a hook you only need to specify these connection args once. Yes, that adds another client layer like you were discouraging. Also, a downside of this is that a single hook instance can only be used for a single redshift destination.
  2. Do nothing, and indeed expect the end user to pass these params around through each function call. I guess a similar service to this would be S3, where each call can specify using a different bucket and/or prefix key.

Not sure what the optimal approach is here going forward. As you suggested, I indeed started the work on the actual transfer operators with S3, and quickly ran into this question of how to model these "extra" connection params that the data api needs.

What do you (and the greater Airflow community) think or suggest?

@Taragolis
Copy link
Contributor

Also, a downside of this is that a single hook instance can only be used for a single redshift destination.

Not a problem at all, it is how it intends to use, Single Operator -> Single Hook -> Single Credentials

very upstream user of the Redshift Data API hook would have to tediously repeat all of these connection parameters in every method the create/invoke.

And actually users could use default_args to propagate same arguments to Tasks within the DAG/TaskGroup

@yehoshuadimarsky
Copy link
Contributor Author

So does that mean you think that it actually makes sense to add this new thin client wrapper on top of the hook?

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 26, 2023
@Taragolis
Copy link
Contributor

Thin boto3 client wrapper it is a Hook based on AwsBaseHook without extending positional and key-word arguments created.

IMHO and personal thought: "The last think that we want it is add additional arguments in Hook and make them Thick Wrapper". There is couple of hooks which accept additional arguments and it it turned into the hell to support, test and use them.

You could check Redshift.Client and found that not every method required to ClusterIdentifier as well as other fields you described.

@yehoshuadimarsky
Copy link
Contributor Author

Ok, per comments above added the RS data api to the S3 -> RS transfer operator

@yehoshuadimarsky
Copy link
Contributor Author

Also update the reverse transfer RS -> S3 with option to use the RS data api

@yehoshuadimarsky
Copy link
Contributor Author

yehoshuadimarsky commented Feb 3, 2023

@Taragolis @o-nikolas @dstandish @eladkal @vincbeck anyone able to review the new changes? For the full PR of adding Redshift data api to all the transfer operators... Thanks

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look, looks good. @vincbeck any concerns?

secret_arn: str | None = None,
statement_name: str | None = None,
with_event: bool = False,
await_result: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is net new code (as far as the hook is concerned) so we can rename it here to wait_for_completion without any back compat issues and then leave the name in the Operator as await_completion for back compat (since that was publicly available before). The names will be inconsistent between the two classes, but this could be seen as a step in the right direction.

@yehoshuadimarsky yehoshuadimarsky requested review from vincbeck and removed request for eladkal February 10, 2023 19:23
@yehoshuadimarsky
Copy link
Contributor Author

@eladkal @Taragolis @dstandish anyone able to merge this?

secret_arn: str | None = None,
statement_name: str | None = None,
with_event: bool = False,
await_result: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please lets rename. better to stay consistent

@yehoshuadimarsky
Copy link
Contributor Author

updated with requested changes

@o-nikolas
Copy link
Contributor

@eladkal are you satisfied with the changes in response to your request?

@yehoshuadimarsky yehoshuadimarsky changed the title refactored Amazon Redshift-data functionality into the hook Add Amazon Redshift-data to S3<>RS Transfer Operators Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants