-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-25790][flink-gs-fs-hadoop] Support authentication via core-site.xml in GCS FileSystem plugin #18489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 6d23f67 (Mon Jan 24 23:11:42 UTC 2022) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galenwarren. The changes look good in general. It might be better to have some test cases guarding our changes. I can think of two things that need to be verified: the configuration fallback logics, and merging of hadoop config from flink config and core-site.
// follow the same rules as for the Hadoop connector, i.e. | ||
// 1) only use service credentials at all if Hadoop | ||
// "google.cloud.auth.service.account.enable" is true (default: true) | ||
// 2) use GOOGLE_APPLICATION_CREDENTIALS as location of credentials, if supplied | ||
// 3) use Hadoop "google.cloud.auth.service.account.json.keyfile" as location of | ||
// credentials, if supplied | ||
// 4) use no credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to understand, what happens if a user configs none of this options? I.e, the service account is enabled by default, but no credential is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands now, if service account were enabled but no credentials were supplied via either GOOGLE_APPLICATION_CREDENTIALS
or google.cloud.auth.service.account.json.keyfile
, it would create a Storage instance with no credential. If you were writing to a publicly writable GCS bucket, this would work, but it would fail if the bucket required credentials.
This is similar to what would happen (as far as I understand) wrt Hadoop config; even if service credentials are enabled (which they are by default), you still have to specify a credential of some kind or else it won't use one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's worth noting that there are more authentication options supported for Hadoop than just those two:
- Private key and id
- P12 certificate
- Short-lived service-account impersonation
But the only ones that have been documented as supported in Flink so far are the two that are directly mentioned in the new docs, GOOGLE_APPLICATION_CREDENTIALS
and google.cloud.auth.service.account.json.keyfile
.
Do you think it's OK just to support those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to only support those mentioned in the documentation, as aligned with the previous behavior.
That leads to another questions: shall we load the entire core-site/default.xml
and merge them with configurations from flink config? Alternatively we may only look for the configs we supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check me on this, but my thought was that if someone was using the gcs-connector
before -- i.e. just as a Hadoop-backed FileSystem -- that they would have been able to supply arbitrary config options in the core-site/default.xml
and expected them to be applied. So I was trying to preserve that behavior.
But, yes, if we support arbitrary config options in core-site/default.xml
except for certain authentication-related options, that does seem a bit counterintuitive.
I suppose one option could be to continue to parse and pass through all the options but to document that the only authentication options that will yield the proper behavior for all FileSystem operations are the two documented, and not the others (P12, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this a bit more, I think it's probably fine as is.
The module currently consists of two parts leveraging different underlying libaraies: the FileSystem that uses gcs-connector
, and the RecoverableWriter that uses google-cloud-storage
. Hadoop configurations (core-site/default.xml
) can be applied directly on gcs-connector
but not google-cloud-storage
.
In that sense, it makes sense to me that RecoverableWriter/google-cloud-storage
only support selective Hadoop configurations. As a first step, the supported configurations includes only account.enable
and account.json.keyfile
. We can add more if new demands emerge later.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree.
I'll add a note to the docs (on the other PR) to call this out, and I'll work on some unit tests.
Storage storage; | ||
if (credentialsPath.isPresent()) { | ||
LOGGER.info( | ||
"Creating GSRecoverableWriter using credentials from {}", | ||
credentialsPath.get()); | ||
try (FileInputStream credentialsStream = new FileInputStream(credentialsPath.get())) { | ||
GoogleCredentials credentials = GoogleCredentials.fromStream(credentialsStream); | ||
storage = | ||
StorageOptions.newBuilder() | ||
.setCredentials(credentials) | ||
.build() | ||
.getService(); | ||
} | ||
} else { | ||
LOGGER.info("Creating GSRecoverableWriter using no credentials"); | ||
storage = StorageOptions.newBuilder().build().getService(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to minimize things we do in the if-else
branches as follow:
// construct the storage instance, using credentials if provided
StorageOptions.Builder storageOptionBuilder = StorageOptions.newBuilder();
if (credentialsPath.isPresent()) {
LOGGER.info(
"Creating GSRecoverableWriter using credentials from {}",
credentialsPath.get());
try (FileInputStream credentialsStream = new FileInputStream(credentialsPath.get())) {
GoogleCredentials credentials = GoogleCredentials.fromStream(credentialsStream);
storageOptionBuilder.setCredentials(credentials);
}
} else {
LOGGER.info("Creating GSRecoverableWriter using no credentials");
}
// create the GS blob storage wrapper
GSBlobStorageImpl blobStorage =
new GSBlobStorageImpl(storageOptionBuilder.build().getService());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6d137f7 (and moved to ConfigUtils
).
Please let me know if there's anything else you would suggest I change in terms of the implementation, besides what you've already suggested, and then I'll look at unit tests as the last piece. |
…FileSystem to support unit tests; add unit tests
I've added these in the latest commit, 6d137f7. I had to refactor things a bit to make them easily testable; now, all the interesting code is in I also consolidated things a bit. There was really no good reason for the configuration operations to be spread between Please let me know if you have any other feedback, or let me know if it looks good and I'll squash. Thanks. |
* Interface that provides context-specific config helper functions, factored out to support | ||
* unit testing. * | ||
*/ | ||
public interface ConfigContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find a bit hard to understand the responsibility of this interface. It seems several things are mixed together.
- It serves as a provider of context-related inputs: environment variables and files.
- It somehow also decides how context-related inputs are applied: overwriting the given config / storage options.
- In the tests, it's also used for recording which files the util is reading.
I think 1) alone should be good enough for providing different context-related inputs in production / tests. 2) is probably not a big deal as the logics are as simple as passing the around. However, I'm not sure about 3), as it feels like we are checking that the ConfigUtils
is reading the correct input rather than providing the correct output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that doesn't feel right is having to provide different ConfigContext
implementations for various test cases. And the UnsupportedOperationException
indicates that we are assuming how ConfigContext
is used by ConfigUtils
internally, rather than treat the latter as a blackbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest the following:
- Change the
ConfigContext
protocol to a mere input provider. Something like:
public interface ConfigContext {
Optional<String> getenv(String name);
org.apache.hadoop.conf.Configuration loadHadoopConfigFromDir(String configDir);
GoogleCredentials loadStorageCredentialsFromFile(String credentialsPath);
}
- We can have a testing implementation like:
public class TestingConfigContext implements ConfigContext{
Map<String, String> envs;
Map<String, org.apache.hadoop.conf.Configuration> hadoopConfigs;
Map<String, GoogleCredentials> credentials;
public Optional<String> getenv(String name) {
return Optional.ofNullable(envs.get(name));
}
public org.apache.hadoop.conf.Configuration loadHadoopConfigFromDir(String configDir) {
return hadoopConfigs.get(configDir);
}
public GoogleCredentials loadStorageCredentialsFromFile(String credentialsPath) {
return credentials.get(credentialsPath);
}
}
In this way we can reuse the same TestingConfigContext
implementation in different test cases, constructed with different parameters.
- The test cases can simply verify the outcome of
ConfigUtils#getHadoopConfiguration
andConfigUtils#getStorageOptions
(StorageOptions#getCredentials
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestions, done in c258789.
I had to make one more change. It turns out that, if one creates a StorageOptions
instance via StorageOptions.Builder#build
without having provided credentials to the builder, the builder will use credentials defined via GOOGLE_APPLICATION_CREDENTIALS
-- and return them via getCredentials
-- and there doesn't seem to be a way to prevent this. This isn't a problem at runtime, but it does create problems for the tests if GOOGLE_APPLICATION_CREDENTIALS
is defined in the environment; specifically, StorageOptions
that we would expect to have no credentials via getCredentials
in fact do have credentials defined.
So, I changed ConfigUtils#getStorageOptions
to be ConfigUtils#getStorageCredentials
instead, i.e.:
public static Optional<GoogleCredentials> getStorageCredentials(
org.apache.hadoop.conf.Configuration hadoopConfig, ConfigContext configContext) {
... which allows me to properly validate in unit tests whether credentials were created or not, avoiding the need to read them back out of a StorageOptions
instance.
…sociated unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments, @galenwarren. LGTM.
I'm taking over from here and merge both this and #18430. There's only one minor comment which I'll address myself while merging, as well as squashing the commits.
@VisibleForTesting | ||
static final String HADOOP_OPTION_ENABLE_SERVICE_ACCOUNT = | ||
"google.cloud.auth.service.account.enable"; | ||
|
||
@VisibleForTesting | ||
static final String HADOOP_OPTION_SERVICE_ACCOUNT_JSON_KEYFILE = | ||
"google.cloud.auth.service.account.json.keyfile"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VisibleForTesting | |
static final String HADOOP_OPTION_ENABLE_SERVICE_ACCOUNT = | |
"google.cloud.auth.service.account.enable"; | |
@VisibleForTesting | |
static final String HADOOP_OPTION_SERVICE_ACCOUNT_JSON_KEYFILE = | |
"google.cloud.auth.service.account.json.keyfile"; | |
private static final String HADOOP_OPTION_ENABLE_SERVICE_ACCOUNT = | |
"google.cloud.auth.service.account.enable"; | |
private static final String HADOOP_OPTION_SERVICE_ACCOUNT_JSON_KEYFILE = | |
"google.cloud.auth.service.account.json.keyfile"; |
@xintongsong Sounds great! With this and the associated docs, we're all done right? I just want to make sure I'm not forgetting anything. Thanks again for all your help on this. |
…Hadoop config This closes apache#18489
…Hadoop config This closes apache#18489 (cherry picked from commit fb634aa)
What is the purpose of the change
For the GCS FileSystem plugin, use the same authentication options for the RecoverableWriter portion as is done for the normal FileSystem portion. This means that it will use GOOGLE_APPLICATION_CREDENTIALS, if it exists, but will also use the
google.cloud.auth.service.account.json.keyfile
property from Hadoop config.To have both portions of the plugin use the same rules, each of them will only consider using service credentials if the Hadoop property
google.cloud.auth.service.account.enable
istrue
or unspecified (i.e. the default value istrue
).Brief change log
GSFileSystemFactory
to read Hadoop config from the location specified inCoreOptions.FLINK_HADOOP_CONF_DIR
or in theHADOOP_CONF_DIR
environment variable and to combine it with Hadoop config values from the Flink configGSFileSystem
to look for credentials in eitherGOOGLE_APPLICATION_CREDENTIALS
orgoogle.cloud.auth.service.account.json.keyfile
, ifgoogle.cloud.auth.service.account.enable
is not false, when constructing theStorage
instance for theRecoverableWriter
Verifying this change
Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no) NoDocumentation