[AIRFLOW-4255] Replace Discovery based api with client based for GCS #5054

kaxil · 2019-04-07T16:01:42Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-4255
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.

Description

Here are some details about my PR, including screenshots of any UI changes:
https://cloud.google.com/apis/docs/client-libraries-explained

Google Cloud Client Libraries use our latest client library model and are our recommended option for accessing Cloud APIs programmatically, where available.

https://pypi.org/project/google-cloud-storage/ library is available and we should be using that.

This is Part 1 of probably 3 parts. I am trying to not break any changes in this PR and keep it backward compatible so that we could include it in a patch or minor version release.

The 2nd & 3rd PR would contain some breaking changes and will contain notes in Updating.md

Tests

[] My PR adds the following unit tests OR does not need testing for this extremely good reason:
The current tests already cover some and will add few more tests

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

Passes flake8

cc @fenglu-g

mik-laj · 2019-04-07T16:56:04Z

Does this change require a note in file UPDATING.md? This hook is used by many custom operators

mik-laj · 2019-04-07T17:11:16Z

airflow/contrib/hooks/gcs_hook.py

-            'storage', 'v1', http=http_authorized, cache_discovery=False)
+        if not self._conn:
+            self._conn = storage.Client(credentials=self._get_credentials(),
+                                        project=self.project_id)


In other hooks, project_id is a method parameter. In this implementation, user can only pass project_id as a connection configuration. This introduces inconsistencies. What steps should we take to unify these situations for all GCP operator?

We have a 3 options:

Specifying project_id in connection configuration.

Specifying project_id in a method parameter with fallback to connection configuration

Specifying project_id in a hook constructor parameter with fallback to connection configuration.

The third variant does not appear anywhere, but it seems to me most expected. Initalizing parameters are not mixed with execution time parameters. project_id is a parameter that initialize client library. It don't execute a API call.

Probably the wrong place for this discussion, but we should take steps to use each operator and hook for GCP to be identical.

CC: @potiuk @antonimaciej

Ya, let's discuss this and decide on this on the mailing list.

Did we discuss this? region_name on various AWS hooks/operators have the same pattern (some take them as kwargs, some just from the connection)

airflow/contrib/hooks/gcs_hook.py

mik-laj · 2019-04-07T17:29:38Z

airflow/contrib/hooks/gcs_hook.py

-                pageToken=pageToken,
+            blobs = bucket.list_blobs(
+                max_results=maxResults,
+                page_token=pageToken,


This parameter is deprecated. Could you use a new way?

page_token (str) – (Optional) If present, return the next batch of blobs, using the value, which must correspond to the nextPageToken value returned in the previous response. Deprecated: use the pages property of the returned iterator instead of manually passing the token.

yes, that is in my todo list. As I wrote in the description of this PR, I want to keep this PR as backwards-compatible as possible, hence there is no note in Updating.md. I will add the note however as I think even though the function input and output are same, it still adds an extra dependency of google-cloud-storage, so I will do that.

I will take care of the Deprecated nextPageToken in the upcoming PR.

mik-laj · 2019-04-07T17:54:36Z

airflow/contrib/hooks/gcs_hook.py

-                raise ValueError('Object Not Found')
+        client = self.get_conn()
+        bucket = client.get_bucket(bucket)
+        blob = bucket.blob(blob_name=object)


This code looks like the get_blob method code.
Is this duplication intentional?
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/bucket.py#L691-L706

But in this case in not effective. It's do 2 calls to external API.

I write a sample script:

client = storage.Client() bucket = client.get_bucket("instance-mb-test-1") blob = bucket.get_blob('file-1.bin') print("Blob size: ", blob.size)

On the screen a have a message:

DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None) DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443 DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.googleapis.com:443 DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/instance-mb-test-1?projection=noAcl HTTP/1.1" 200 447 DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/instance-mb-test-1/o/file-1.bin HTTP/1.1" 200 753 Blob size: 104960000

It's confirm that you implementation do a two API calls (plus one call for authorization).
I am proposing that you use the code:

client = storage.Client() bucket = storage.Bucket(client, "instance-mb-test-1") blob = bucket.get_blob('file-1.bin') print("Blob size: ", blob.size)

It's do one API call:

DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None) DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443 DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.googleapis.com:443 DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/instance-mb-test-1/o/file-1.bin HTTP/1.1" 200 753 Blob size: 104960000

It is important to optimize this methos, because it is often used in a loop, and therefore the number of queries is significant.

This is inspired from Google's code and examples:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/storage/cloud-client/snippets.py

I don't get why there are 2 calls for one and not the other, may be I am missing something. Because it can either first get bucket object and then create a blob or create a bucket object and then get blob, looks same to me.

I have changed it in few places, though

The get_bucket method executes reload method, so it does additional API request.

First API call:
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/client.py#L227

Second API call:
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/bucket.py#L702

Implementation of reload is available: https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/_helpers.py#L110-L132

There is an additional advantage in using both get_bucket and get_blob. The get_bucket method raises an error if the Bucket does not exists. The get_blob just returns None if the object doesn't exist.

I like this:

client = storage.Client() bucket = client.get_bucket("instance-mb-test-1") blob = bucket.get_blob('file-1.bin') print("Blob size: ", blob.size)

Let me know what you think.

Case (1):

Case (2):

I would like to sticket with get_bucket so that we get a meaningful error, rather than just None

Also, get_blob() doesn't contain blob.reload in the latest stable google-cloud-storage release (1.14.0)

https://github.com/googleapis/google-cloud-python/blob/storage-1.14.0/storage/google/cloud/storage/bucket.py#L642

The link you pasted is for master branch and has not yet made through in release :) Hopefully they release it soon and we can remove blob.reload from our code

mik-laj · 2019-04-07T17:56:58Z

airflow/contrib/hooks/gcs_hook.py

+        blob.reload()
+        blob_crc32c = blob.crc32c
+        self.log.info('The crc32c checksum of %s is %s', object, blob_crc32c)
+        return blob_crc32c


The same comment as
https://github.com/apache/airflow/pull/5054/files#r272844679

Fokko · 2019-04-08T09:35:17Z

airflow/contrib/hooks/gcs_hook.py

@@ -193,16 +187,7 @@ def upload(self, bucket, object, filename,
        :type mime_type: str
        :param gzip: Option to compress file for upload
        :type gzip: bool
-        :param multipart: If True, the upload will be split into multiple HTTP requests. The


Does this mean that multipart support is gone?

@Fokko No, Google client library handles the multipart for your now.

https://github.com/googleapis/google-cloud-python/blob/11c543ce7dd1d804688163bc7895cf592feb445f/storage/google/cloud/storage/blob.py#L989-L997

I can add that as well in Updating.md if you think other users might think the same as well.

It's a change in API of the operator so should go in UPDATING, yes.

@Fokko Added a comment on Multipart in Updating.md.

Fokko · 2019-04-08T09:36:14Z

airflow/contrib/operators/gcs_download_operator.py

@@ -82,7 +83,7 @@ def execute(self, context):
                                   object=self.object,
                                   filename=self.filename)
        if self.store_to_xcom_key:
-            if sys.getsizeof(file_bytes) < 48000:
+            if sys.getsizeof(file_bytes) < MAX_XCOM_SIZE:


codecov-io · 2019-04-08T09:45:50Z

Codecov Report

Merging #5054 into master will decrease coverage by 0.07%.
The diff coverage is 43.6%.

@@            Coverage Diff            @@
##           master   #5054      +/-   ##
=========================================
- Coverage   76.98%   76.9%   -0.08%     
=========================================
  Files         463     455       -8     
  Lines       29806   29667     -139     
=========================================
- Hits        22945   22816     -129     
+ Misses       6861    6851      -10

Impacted Files	Coverage Δ
airflow/models/xcom.py	`80% <100%> (ø)`	⬆️
airflow/contrib/hooks/gcs_hook.py	`53.64% <43.07%> (-1.02%)`	⬇️
airflow/contrib/operators/gcs_download_operator.py	`88.46% <50%> (+0.46%)`	⬆️
airflow/lineage/backend/atlas/__init__.py	`72.41% <0%> (-15.09%)`	⬇️
airflow/models/__init__.py	`93% <0%> (-7%)`	⬇️
airflow/operators/check_operator.py	`91.79% <0%> (-0.86%)`	⬇️
airflow/models/connection.py	`65.53% <0%> (-0.2%)`	⬇️
airflow/settings.py	`84.25% <0%> (-0.13%)`	⬇️
airflow/jobs.py	`78.77% <0%> (-0.04%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da1be99...d0739e1. Read the comment docs.

potiuk

Two small comments

potiuk · 2019-04-08T11:09:33Z

airflow/contrib/hooks/gcs_hook.py

+            if blob_update_time > ts:
+                return True
+            else:
+                return False


'return False' is missing for if blob_udate_time is None

potiuk · 2019-04-08T11:12:37Z

airflow/contrib/hooks/gcs_hook.py

-        return True
+        if gzip:
+            os.remove(filename)
+        self.log.info('File %s uploaded to %s in %s bucket', filename, object, bucket)

    # pylint:disable=redefined-builtin
    def exists(self, bucket, object):


Suggestion: maybe we can change the "object" name in the signature of functions. Since we are introducing backwards-incompatible changes anyway, that might be good time to get rid of the "object" redefinition and remove the pylint disable warnings.

Ya, I have that PR ready. I am trying to keep the changes in this PR to be more on a backwards-compatible side.

The next PR will contain some breaking changes which will contain these name changes.

Can you explain what are the intentions of sharing one refactorization for a few PR's? This makes changes much more difficult to review. I see a reason if this change was backwards compatible, but it is not. We have a note in fleUpdating.md

The maintain intention is so that we can cherry-pick this one in 1.10.4.

If you look at this PR and check for "breaking-changes" - the one's that are there are not widely used (also in Updating.md).

I wouldn't want to change the name of something like the object parameter (or even bucket) and just put a note in Updating.md. We wont cherry-pick the 2nd PR for 1.10.4 and would target 2.0 instead.

They are fundamentally 2 separate pieces: This PR focuses on "Replacing discovery api with client api" and not on "updating parameter name". Also more readable in Changelog.

None of the changes in this PR remove or change any required parameter of any method.

Makes perfect sense @kaxil 👍 . Thanks for explanation.

potiuk · 2019-04-08T12:49:59Z

airflow/contrib/hooks/gcs_hook.py

-            else:
-                return False
+
+        return False


Can you approve this PR if you are ok with it :) ?

Sure! done. I like it :)

ashb · 2019-04-12T11:02:41Z

@kaxil We probably shouldn't pull this in to 1.10.4 since it changes the function sig, should we?

…pache#5054)

…5054)

…pache#5054)

kaxil requested a review from Fokko April 7, 2019 16:06

kaxil force-pushed the replace-gcs-client-library branch from 6786d39 to 707e6ba Compare April 7, 2019 16:11

[AIRFLOW-4255] Replaces Discovery based api with client based for GCS

dc16fc5

kaxil force-pushed the replace-gcs-client-library branch from 707e6ba to dc16fc5 Compare April 7, 2019 16:13

kaxil changed the title ~~[AIRFLOW-4255] Replaces Discovery based api with client based for GCS~~ [AIRFLOW-4255] Replace Discovery based api with client based for GCS Apr 7, 2019

mik-laj reviewed Apr 7, 2019

View reviewed changes

airflow/contrib/hooks/gcs_hook.py Outdated Show resolved Hide resolved

mik-laj reviewed Apr 7, 2019

View reviewed changes

airflow/contrib/hooks/gcs_hook.py Outdated Show resolved Hide resolved

mik-laj reviewed Apr 7, 2019

View reviewed changes

kaxil added 4 commits April 8, 2019 02:41

Add tests

bca3d0b

GCS Download corrections

60e0d01

Update test_gcs_hook.py

c66ee86

Fix test

681cfc8

Fokko reviewed Apr 8, 2019

View reviewed changes

Update Bucket -> get_bucket

dbbf54a

kaxil added 2 commits April 8, 2019 11:00

Update UPDATING.md

e3b6bdf

Update UPDATING.md

81d8837

potiuk reviewed Apr 8, 2019

View reviewed changes

kaxil added 2 commits April 8, 2019 13:13

Update version

10615ff

Update gcs_hook.py

772944b

potiuk reviewed Apr 8, 2019

View reviewed changes

Update UPDATING.md

44b5e44

potiuk approved these changes Apr 9, 2019

View reviewed changes

Merge branch 'master' into replace-gcs-client-library

d0739e1

kaxil merged commit ec7c67f into apache:master Apr 9, 2019

kaxil deleted the replace-gcs-client-library branch April 9, 2019 18:58

This was referenced Apr 11, 2019

[AIRFLOW-4268] Add MsSqlToGoogleCloudStorageOperator #5077

Merged

[AIRFLOW-4236] Add num_retries to MySqlToGoogleCloudStorageOperator #5043

Closed

kaxil mentioned this pull request Apr 12, 2019

[AIRFLOW-4255] Make GCS Hook Backwards compatible #5089

Merged

6 tasks

cthenderson pushed a commit to cthenderson/apache-airflow that referenced this pull request Apr 16, 2019

[AIRFLOW-4255] Replace Discovery based api with client based for GCS (a…

7e3206d

…pache#5054)

kaxil added a commit that referenced this pull request Apr 16, 2019

[AIRFLOW-4255] Replace Discovery based api with client based for GCS (#…

684d45e

…5054)

mik-laj mentioned this pull request Apr 19, 2019

[AIRFLOW-4335] Add default num_retries to GCP connection #5117

Merged

6 tasks

andriisoldatenko pushed a commit to andriisoldatenko/airflow that referenced this pull request Jul 26, 2019

[AIRFLOW-4255] Replace Discovery based api with client based for GCS (a…

005da49

…pache#5054)

wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019

[AIRFLOW-4255] Replace Discovery based api with client based for GCS (a…

3e648b7

…pache#5054)

dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019

[AIRFLOW-4255] Replace Discovery based api with client based for GCS (a…

a58adb8

…pache#5054)

dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 14, 2019

[AIRFLOW-4255] Replace Discovery based api with client based for GCS (a…

59f25f6

…pache#5054)

mik-laj mentioned this pull request Aug 28, 2019

[AIRFLOW-5335] Update GCSHook methods so they need min IAM perms #5939

Merged

6 tasks

[AIRFLOW-4255] Replace Discovery based api with client based for GCS #5054

[AIRFLOW-4255] Replace Discovery based api with client based for GCS #5054

Conversation

kaxil commented Apr 7, 2019 • edited Loading

Jira

Description

Tests

Commits

Documentation

Code Quality

mik-laj commented Apr 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mik-laj Apr 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaxil Apr 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Apr 8, 2019 • edited Loading

Codecov Report

potiuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mik-laj Apr 8, 2019 • edited Loading

Choose a reason for hiding this comment

kaxil Apr 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashb commented Apr 12, 2019 • edited Loading

kaxil commented Apr 7, 2019 •

edited

Loading

mik-laj Apr 7, 2019 •

edited

Loading

kaxil Apr 8, 2019 •

edited

Loading

codecov-io commented Apr 8, 2019 •

edited

Loading

mik-laj Apr 8, 2019 •

edited

Loading

kaxil Apr 8, 2019 •

edited

Loading

ashb commented Apr 12, 2019 •

edited

Loading