-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Add gcloud-based GCSHook #1119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gcloud-based GCSHook #1119
Conversation
|
follows #1118 |
|
Could you make it so we can specify the service account from the JSON? We actually run Airflow with multiple service accounts per-hook to get fine-grained permissions. Setting it at the environment variable level doesn't work well with this. |
| ] | ||
| gcp_api = [ | ||
| 'httplib2', | ||
| 'google-api-python-client', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@criccomini I think that for compatibility this might be better as google-api-python-client<1.5 but I'm not sure. As far as I can tell 1.5 is the first version that depends on oauth2client 2.0.
|
What do you think about migrating the GCS operators to use new hook as part of this? |
|
@criccomini regarding service accounts it should work -- this Hook looks for a "keyfile_path" in the airflow connection to identify a json service account key. I think there's similar extra parameter for the other GCS Hook. However this one uses json and I think the other one uses p12 if I remember correctly. I am going to make a change so you can pass either a json OR a p12 file to this hook, so you could reuse the same GCS connections. |
|
Ah, got it. Missed that. Awesome. Motivation to switch to json. :) We're still on the deprecated p12's for the most part. |
|
No problem, will support both! :) One sec. |
|
@criccomini I think this will now allow you to reuse your existing Airflow connections, but please give it a shot to be sure. |
Thought: perhaps add backwards-compatible APIs so that the existing GCS operators can work with your new hooks? |
|
I think that's the cleanest approach -- let me see how easily I can make the signatures match. |
|
@criccomini I'm feeling pretty good about this! the latest one should have full signature compatibility with the previous hook, both at initialization and for the |
|
...except for "delegate_to". I'm not familiar with this, unfortunately. You can pass the argument, but it's ignored. With regard to other operators, at this time our GCP use is focused on GCS/GCE/GKE/CloudSQL/DataProc so I don't think I have the expertise to migrate them. However we are starting to use BigQuery more so maybe that'll change as our workflow evolves. |
|
@jlowin is this ready for another round of reviews? |
|
@criccomini yes |
| relabeling: {'host': 'Connection URL'}, | ||
| }, | ||
| google_cloud_platform: { | ||
| hidden_fields: ['host', 'schema', 'login', 'password', 'port', 'extra'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still need 'extra'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, new connections should be set up using "project", "key_path", etc. fields that are displayed in the Airflow UI, not extras. I think "extras" is opaque for users who haven't read the code. However, if someone uses a connection that predates this change and DOES have an "extra" field, the code here will recognize that and handle it properly.
As far as I know, the connection type is largely irrelevant to Airflow -- you could create an "FTP" connection and pass it to GCSHook, and as long as it had the correct attributes it would work properly. The only thing that the type does affect is the fields that are displayed/handled by the UI (which are put in the "extras" field anyway behind the scenes)
So bottom line, lets say you have a connection of type google_cloud_storage with key_path in extras, and you've been using it with GoogleCloudStorageHook, you should be able to pass that same connection to a GCSHook no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. See note about connection type, above, though. Data profiler does use it.
|
I'm going to merge this in. I think that it will be safe to assume that a GCP connection type should be treated as BQ in data profiler. |
A GCSHook based on the
gcloudlibrary, analogous to theS3Hookin the main airflow package.see conversation in #1109
The file itself lives in
contrib/gcloud/to clearly identify it as different from the current GCSHook.