Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not check for S3 key before attempting download #19504

Merged
merged 2 commits into from
Nov 10, 2021

Conversation

dstandish
Copy link
Contributor

When you download a key that exists, notice that it retrieves creds twice:

[2021-11-09 22:25:18,736] {s3.py:807} INFO - Downloading source S3 file from Bucket oss-test-xcom with path test-dag/test-task/2021-01-01T00:00:00+00:00/hellodf
[2021-11-09 22:25:18,736] {base_aws.py:401} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-11-09 22:25:18,752] {credentials.py:1224} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2021-11-09 22:25:19,049] {base_aws.py:424} WARNING - Unable to use Airflow Connection for credentials.
[2021-11-09 22:25:19,049] {base_aws.py:425} INFO - Fallback on boto3 credential strategy
[2021-11-09 22:25:19,049] {base_aws.py:428} INFO - Creating session using boto3 credential strategy region_name=None
/Users/dstandish/code/airflow/airflow/providers/amazon/aws/hooks/base_aws.py:494 DeprecationWarning: client_type is deprecated. Set client_type from class attribute.
[2021-11-09 22:25:19,066] {credentials.py:1224} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2021-11-09 22:25:19,526] {base_aws.py:401} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-11-09 22:25:19,594] {base_aws.py:424} WARNING - Unable to use Airflow Connection for credentials.
[2021-11-09 22:25:19,594] {base_aws.py:425} INFO - Fallback on boto3 credential strategy
[2021-11-09 22:25:19,594] {base_aws.py:428} INFO - Creating session using boto3 credential strategy region_name=None
/Users/dstandish/code/airflow/airflow/providers/amazon/aws/hooks/s3.py:343 DeprecationWarning: resource_type is deprecated. Set resource_type from class attribute.
[2021-11-09 22:25:19,628] {credentials.py:1224} INFO - Found credentials in shared credentials file: ~/.aws/credentials

The first is for checking existence and the second is retrieving the object.

We don't need to check for existence. We can just ask for the object and if it's not there, the api will let is know. And when the object is there, we'll only have retrieved creds once.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Nov 10, 2021
s3_obj = self.get_key(key, bucket_name)
try:
s3_obj = self.get_key(key, bucket_name)
except ClientError as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my preference would be to not catch this error at all, but catching it keeps this consistent with the existing behavior

S3Hook.download_file first checks object existence then downloads.
This resolves creds 2 times.  We don't need to check existence.
Just ask for the key and if it's not there you'll know from the error.
And if it is there, you'll only have resolved creds once.
@ashb ashb changed the title Do not check for key before attempting download Do not check for S3 key before attempting download Nov 10, 2021
@github-actions
Copy link

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Nov 10, 2021
@feluelle
Copy link
Member

But the main issue is that it does not reuse the connection, right? Not that we do two requests.

@feluelle
Copy link
Member

So I would suggest we reuse the connection instead.

@ashb
Copy link
Member

ashb commented Nov 10, 2021

I suspect it's to do with get_conn vs get_resource, and Boto itself treats those as two separate connections

@feluelle
Copy link
Member

Correct. This is also my assumption @ashb.

@ashb
Copy link
Member

ashb commented Nov 10, 2021

Oh yeah, so get_client_type and get_resource_type bot call _get_credentials -- that should be cached.

This change makes sense anyway though -- there's no point in making two requests.

@feluelle
Copy link
Member

But the first one is only doing a HEAD operation. So technically it is not the same. 😅 But yes, we can get rid of this request if you want. 👍

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- one nit

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
@dstandish
Copy link
Contributor Author

But the main issue is that it does not reuse the connection, right? Not that we do two requests.
But the first one is only doing a HEAD operation. So technically it is not the same. 😅 But yes, we can get rid of this request if you want. 👍

it's true that the non-reuse of creds is also a problem but yeah why do two calls when you can do one

easier to beg forgiveness than to ask permission :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants