Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-895] Allow empty GCP credential for pipelines that only access to public data. #1280

Closed
wants to merge 6 commits into from

Conversation

peihe
Copy link
Contributor

@peihe peihe commented Nov 4, 2016

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@peihe
Copy link
Contributor Author

peihe commented Nov 4, 2016

R: @davorbonaci

@@ -0,0 +1,53 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases when the user doesn't have credentials and is trying to run a pipeline, how does there error message now change for an unauthorized user. Before we would fail on creation of the client, but now we might try to stage files or contact BQ to see if datasets exist. Is the message as to why it failed same, better or worse off for various GCP service use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Users will see the same error message: NULL_CREDENTIAL_REASON.
  2. There will be few lines differences about where the exception is thrown.
    We typically do:
    someClient = Transport.newSomeClient(); // previous throws in here
    // few lines between
    someClient.list/insert(); // Now throws in here.

I think this changes should apply to GCP data services: BigQuery, PubSub, Storage, but not CloudResourceManager. Updated to get around this.

+ "for details on how to specify credentials. This version of the SDK is "
+ "dependent on the gcloud core component version 2015.02.05 or newer to "
+ "be able to get credentials from the currently authorized user via gcloud auth.", e);
// Ignore the exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to leave the contract of this method the same but migrate GcpOptions to ignore failures in getting the credential?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to fail immediate if users provide a bad key file (line 112).
So, ignoring in here could differentiate the two cases.

+ "be able to get credentials from the currently authorized user via gcloud auth.", e);
// Ignore the exception
// Pipelines that only access to public data should be able to run without credentials.
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update Javadoc of this method to document this new contact, and possibly Javadoc of any calling method that has changed as a consequence?

*
* <p>When the access is denied, it throws {@link IOException} with a detailed error message.
*/
public class NullCredential implements HttpRequestInitializer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name? SpecificExceptionAfterNonauthorizedResponse? something better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep NullCredential.

SpecificExceptionAfterNonauthorizedResponse is describing the implementation. If we need to do more for a "NullCredential", we will have to change the name. And, NullCredential follows the current pattern, and reads better in Transport as following:
if (credential == null) {
return new ChainingHttpRequestInitializer(new NullCredential(), httpRequestInitializer);
} else {
return new ChainingHttpRequestInitializer(credential, httpRequestInitializer);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullCredentialInitializer?

@davorbonaci
Copy link
Member

Also, per Luke's comment -- we should do a manual inspection could any callers be negatively impacted.

@peihe peihe force-pushed the validation branch 2 times, most recently from de23f50 to 356ae15 Compare November 7, 2016 22:29
@peihe
Copy link
Contributor Author

peihe commented Nov 7, 2016

Updated and merged with the gcp credentials change.

@lukecwik could you take another look?
Looks like we no longer accept key files and secrets files, right?

@lukecwik
Copy link
Member

lukecwik commented Nov 7, 2016

@peihe, thats correct. The user can call GoogleCredentials.fromStream(InputStream) to load any location they want if they want to use custom credential locations. The credentials library supports many more ways to load/create credentials beyond the few that were supported by Dataflow.

@peihe
Copy link
Contributor Author

peihe commented Nov 9, 2016

One issue I found is that gcs writes could be buffered in the client, and the exception could be deferred until the file is closed. The result is one exception per bundle in TextIO.Write and BigQueryIO.Write.

I have verified following cases with DirectRunner:
(There is no public folder in gcp, so only one test case for Write.)

  1. TextIO.Read from public gcp input files works.
  2. TextIO.Read from private gcp input files failed with expected exception.
  3. BigQueryIO.Read from public/private dataset failed. Looks to me, BigQuery required authenticated users. Exceptions are expected.
    4.TextIO.Write (and, BigQuery.Write which write to gcs) failed. There is one exception for each write bundle. The exception itself is expected.

@peihe
Copy link
Contributor Author

peihe commented Nov 9, 2016

Verified TextIO.Write works with a public buckets.

@davorbonaci
Copy link
Member

Where does this PR stand? Can we move forward?

@lukecwik
Copy link
Member

@davorbonaci My understanding is that your the primary reviewer. Has that changed?

@davorbonaci
Copy link
Member

LGTM from a review standpoint.

I know @peihe was investigating a few corner cases, but that seems to be done now. So, we are good to go, I think.

@peihe peihe force-pushed the validation branch 2 times, most recently from 1e0104e to a9392d0 Compare November 14, 2016 18:29
@peihe
Copy link
Contributor Author

peihe commented Nov 14, 2016

PTAL

done rename

@peihe peihe force-pushed the validation branch 2 times, most recently from e83217c to 7f6686a Compare November 15, 2016 20:02
@lukecwik
Copy link
Member

I don't think you need the NoopCredentialFactory anymore since the default is to return null which implies to do the service calls with no credentials which is effectively what NoopCredentials is already doing.

@peihe
Copy link
Contributor Author

peihe commented Nov 16, 2016

Creating a DataflowClient with null credential is considered as a error, since nothing is public with dataflow requests.

NoopCredentialFactory is used by DataflowRunnerTest to create a non-null fake Credentials.

@asfgit asfgit closed this in 479c19a Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants