Skip to content

Conversation

@jlowin
Copy link
Member

@jlowin jlowin commented Mar 4, 2016

A GCSHook based on the gcloud library, analogous to the S3Hook in the main airflow package.

see conversation in #1109

The file itself lives in contrib/gcloud/ to clearly identify it as different from the current GCSHook.

@jlowin
Copy link
Member Author

jlowin commented Mar 4, 2016

follows #1118

@criccomini
Copy link
Contributor

Could you make it so we can specify the service account from the JSON? We actually run Airflow with multiple service accounts per-hook to get fine-grained permissions. Setting it at the environment variable level doesn't work well with this.

]
gcp_api = [
'httplib2',
'google-api-python-client',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@criccomini I think that for compatibility this might be better as google-api-python-client<1.5 but I'm not sure. As far as I can tell 1.5 is the first version that depends on oauth2client 2.0.

@criccomini
Copy link
Contributor

What do you think about migrating the GCS operators to use new hook as part of this?

@jlowin
Copy link
Member Author

jlowin commented Mar 5, 2016

@criccomini regarding service accounts it should work -- this Hook looks for a "keyfile_path" in the airflow connection to identify a json service account key. I think there's similar extra parameter for the other GCS Hook. However this one uses json and I think the other one uses p12 if I remember correctly. I am going to make a change so you can pass either a json OR a p12 file to this hook, so you could reuse the same GCS connections.

@criccomini
Copy link
Contributor

Ah, got it. Missed that. Awesome. Motivation to switch to json. :) We're still on the deprecated p12's for the most part.

@jlowin
Copy link
Member Author

jlowin commented Mar 5, 2016

No problem, will support both! :) One sec.

@jlowin
Copy link
Member Author

jlowin commented Mar 5, 2016

@criccomini I think this will now allow you to reuse your existing Airflow connections, but please give it a shot to be sure.

@criccomini
Copy link
Contributor

@criccomini I think this will now allow you to reuse your existing Airflow connections, but please give it a shot to be sure.

Thought: perhaps add backwards-compatible APIs so that the existing GCS operators can work with your new hooks?

@jlowin
Copy link
Member Author

jlowin commented Mar 5, 2016

I think that's the cleanest approach -- let me see how easily I can make the signatures match.

@jlowin
Copy link
Member Author

jlowin commented Mar 5, 2016

@criccomini I'm feeling pretty good about this! the latest one should have full signature compatibility with the previous hook, both at initialization and for the download and upload methods. Also full support for scopes, which we didn't have before (though note scopes require passing key_path either in the connection or at initialization).

@jlowin
Copy link
Member Author

jlowin commented Mar 5, 2016

...except for "delegate_to". I'm not familiar with this, unfortunately. You can pass the argument, but it's ignored.

With regard to other operators, at this time our GCP use is focused on GCS/GCE/GKE/CloudSQL/DataProc so I don't think I have the expertise to migrate them. However we are starting to use BigQuery more so maybe that'll change as our workflow evolves.

@jlowin jlowin mentioned this pull request Mar 5, 2016
@criccomini
Copy link
Contributor

@jlowin is this ready for another round of reviews?

@jlowin
Copy link
Member Author

jlowin commented Mar 7, 2016

@criccomini yes

relabeling: {'host': 'Connection URL'},
},
google_cloud_platform: {
hidden_fields: ['host', 'schema', 'login', 'password', 'port', 'extra'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need 'extra'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, new connections should be set up using "project", "key_path", etc. fields that are displayed in the Airflow UI, not extras. I think "extras" is opaque for users who haven't read the code. However, if someone uses a connection that predates this change and DOES have an "extra" field, the code here will recognize that and handle it properly.

As far as I know, the connection type is largely irrelevant to Airflow -- you could create an "FTP" connection and pass it to GCSHook, and as long as it had the correct attributes it would work properly. The only thing that the type does affect is the fields that are displayed/handled by the UI (which are put in the "extras" field anyway behind the scenes)

So bottom line, lets say you have a connection of type google_cloud_storage with key_path in extras, and you've been using it with GoogleCloudStorageHook, you should be able to pass that same connection to a GCSHook no problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. See note about connection type, above, though. Data profiler does use it.

@criccomini
Copy link
Contributor

I'm going to merge this in. I think that it will be safe to assume that a GCP connection type should be treated as BQ in data profiler.

criccomini added a commit that referenced this pull request Mar 7, 2016
Add gcloud-based GCSHook
@criccomini criccomini merged commit c6ddcb1 into apache:master Mar 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants