-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, Hive: Support pluggable ClientPool #6698
Conversation
@szehon-ho @pvary @nastra @flyrain Could you please have a look? Thanks~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lirui-apache for the change! Left some comments
public static <C, E extends Exception> ClientPool<C, E> loadClientPool( | ||
String impl, Map<String, String> properties, Object conf) { | ||
Preconditions.checkNotNull( | ||
impl, "Cannot initialize custom ClientPool, impl class name is null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, do you think we can also have a better word in the error message than 'impl', ie use the actual property constant name, and we can pass it to Preconditions via the .checkNonNull(errorMsgTemplate, errorMsgArgs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the util method here shouldn't care about where the impl class name come from. Maybe it's better to change the message like Cannot initialize custom ClientPool, impl class name is null. Please check the value of CatalogProperties.CLIENT_POOL_IMPL.
Let me know if you think otherwise
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
@szehon-ho Thanks for reviewing! I've updated the PR |
@@ -119,6 +119,8 @@ private CatalogProperties() {} | |||
"client.pool.cache.eviction-interval-ms"; | |||
public static final long CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT = | |||
TimeUnit.MINUTES.toMillis(5); | |||
/** Name of the custom {@link ClientPool} implementation class. */ | |||
public static final String CLIENT_POOL_IMPL = "client-pool-impl"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it strictly necessary to inject this using reflection? If this is for customization in a certain environment, then maybe it would be better to allow customization in a different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rdblue , I think the context is from #6175 (more issues scattered). The general problem being the CachedClientPool that's always on in Iceberg Hive, which is a global cache that doesnt work in all environments, because the current key (metastoreUri) collides too much, leading to the wrong client used for a lot of use cases.
It seems simpler to me to just make cache pluggable, but I am not sure if there's a better solution I'm not seeing to make it non-dynamic? If we dont want user to use a client-pool other than CachedClientPool, at least we need to have a pluggable conf => key generator? Or maybe just allow them to use the toggle between CachedClientPool and the raw ClientPool as an option (as a way to turn off the global cache)? cc @pvary @flyrain as well if any thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another way is to allow configurable cache keys? We can have pre-defined key elements and users can use these elements to compose cache key used in CachedClientPool
. Some key elements I can think of: HMS URI (this is probably mandatory), UGI, user name, specific configurations. When HiveCatalog
creates CahcedClientPool
, it can check the configured key and pass a key supplier to CahcedClientPool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lirui-apache yes I am ok with that , if you want to give a try? Wasnt sure if hive property alone addresses your use-case, but addresses ours (putting metastore.catalog.default to key). Maybe it will be easier for user, as they can just set a list of properties, versus implement a new class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll have a try and update the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @szehon-ho @rdblue , I have updated the PR to demonstrate the idea. Let me know if you think it makes sense. Thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi sorry, I realized I let this slip, I will try to look at this soon.
} | ||
configureHadoopConf(clientPool, conf); | ||
clientPool.initialize(properties); | ||
return clientPool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: This code needs more whitespace between code blocks and statements.
For the reference, the previous discussions:
|
ab2748c
to
4847c4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lirui-apache , I spent some time reading this patch and left my thoughts here.
@@ -53,12 +70,14 @@ public class CachedClientPool implements ClientPool<IMetaStoreClient, TException | |||
properties, | |||
CatalogProperties.CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS, | |||
CatalogProperties.CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT); | |||
this.keySuppliers = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like the keySupplier idea, wouldn't it be much simpler if we just use a static map of suppliers, like:
static final Map<String, Supplier> keySuppliers = ImmutableMap.of(
"uri", () -> {
try {
return UserGroupInformation.getCurrentUser();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}),
"user_name" -> {
() -> {
try {
String userName = UserGroupInformation.getCurrentUser().getUserName();
return UserNameElement.of(userName);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
});
}
(as i dont see any need to generate this on the fly.) And then here in CachedClientPool CTOR, we can just apply keySuppliers on the conf to make a Key instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I use Supplier is to get the current user whenever CachedClientPool::clientPool
is called. If a CachedClientPool
is not meant to be shared among different users, I think we don't need to keep the suppliers at all, and just generate the Key instance in CTOR. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this trips me up every time, but I see, one HiveCatalog making one CachedClientPool. So I think, we are safe and can make a key in the CTOR, and will be much simpler. @pvary for sanity check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, my thought is, lets keep it simple until we have some use case where a process start sharing the HiveCatalog. I dont see any huge problem just creating one on the fly.
* <p>The following elements are supported: | ||
* | ||
* <ul> | ||
* <li>URI - as specified by {@link CatalogProperties#URI}. URI will be the only element when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think URI should be removed here as it shouldn't be configurable, as it is always there.
Maybe we could mention in the javadoc that uri is always used , no matter what configurable keys the user passes in? Something like:
A comma separated list of elements used, in addition to the hive metastore uri, to compose the key of the client pool cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually , I am now thinking I don't see any use-case where you would turn off any of these from making a key. What do you think? From user point of view, probably simpler is better and we should have reasonable defaults.
Maybe we could have a configurable key but just limit it to to add additional hive conf if we missed something, as a safety valve. In any case, without dynamic loading, the user cant add additional code suppliers like ugi, and thus is limited to just setting config values here.
cc @pvary @RussellSpitzer @flyrain for thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively, it is just too much of a niche-configuration to be useful, and we go back to get #6175 committed , and Iceberg community will just add different suppliers to the key as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should always include URI in the key. For UGI and USER_NAME, they have different behaviors regarding proxy users so I think it's useful to keep it configurable. There're some discussions here: #6175 (comment), #6175 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lirui-apache I went through the discussion, couldn't find the relevant information, would you mind summarizing what situation do we not want to use UGI and USER_NAME as part of the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho A common way to achieve impersonation is to create proxy users using the engine's principal. But different UGI instances cannot be equivalent, even though they actually represent the same user. E.g. the following two proxy users are different:
UserGroupInformation foo1 = UserGroupInformation.createProxyUser("foo", current);
UserGroupInformation foo2 = UserGroupInformation.createProxyUser("foo", current);
And if we use the above UGIs in the cache key, foo1
and foo2
won't be able to share the underlying connection pool. This may or may not be desirable, given how an engine/service manages UGI instances. That's why I think both UGI and USER_NAME can be useful. But I'm also OK if we just keep UGI and add other options as we have new requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, curious, for spark-server what will you use? (I assume that's your use-case?). Just wanted to simplify it a bit for the user, but makes sense though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we're using spark-server and we use UGI in the key (as suggested by our spark team). I suppose spark maintains a HiveCatalog for each user session, which means different sessions won't share the underlying pool, even though they are for the same end user.
return suppliers.build(); | ||
} | ||
|
||
@Value.Immutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strugling to see the value of adding these immutable value, although I may be missing something. Why not just use String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's meant to give different types for different keys. E.g. both URI and USER_NAME are essentially just a wrapper of a string, but they are not comparable and can never be considered equal to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's a bit weird to have multiple wrapper classes for holding a single string or a single list. I was actually able to rewrite the code without using any of those wrapper classes and the tests passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strings and lists are used in cache keys and need to be compared. I think we can either assume "how could a user name have the same value as an HMS URI", or we can wrap them in different classes to make sure they won't be equivalent. Personally I prefer the latter. But since we generally agree URI should be mandatory, I guess we can remove these two wrappers for now.
public static <C, E extends Exception> ClientPool<C, E> loadClientPool( | ||
String impl, Map<String, String> properties, Object conf) { | ||
LOG.info("Loading custom client pool implementation: {}", impl); | ||
Preconditions.checkNotNull( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be good to be slightly more precise here and change this to Preconditions.checkArgument(null != impl, ...)
* @return initialized ClientPool object | ||
* @throws IllegalArgumentException if no-arg constructor not found or error during initialization | ||
*/ | ||
public static <C, E extends Exception> ClientPool<C, E> loadClientPool( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests for this should go into TestCatalogUtil
CatalogUtil.ICEBERG_CATALOG_TYPE_HIVE, | ||
properties, | ||
metastore.hiveConf()); | ||
Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper); | |
Assertions.assertThat(hiveCatalog.clientPool()).isInstanceOf(CachedClientPoolWrapper.class); |
hive-metastore/src/test/java/org/apache/iceberg/hive/TestCachedClientPool.java
Outdated
Show resolved
Hide resolved
Assert.assertTrue( | ||
CachedClientPool.clientPoolCache() | ||
.getIfPresent(CachedClientPool.toKey(Collections.singletonList(uri))) | ||
== clientPool1); | ||
TimeUnit.MILLISECONDS.sleep(EVICTION_INTERVAL - TimeUnit.SECONDS.toMillis(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test methods takes 23+ seconds, which is too long imo for a simple unit test. We might want to decrease the eviction interval for testing (can be a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that can be done separately.
hive-metastore/src/test/java/org/apache/iceberg/hive/TestCachedClientPool.java
Outdated
Show resolved
Hide resolved
return suppliers.build(); | ||
} | ||
|
||
@Value.Immutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's a bit weird to have multiple wrapper classes for holding a single string or a single list. I was actually able to rewrite the code without using any of those wrapper classes and the tests passed.
} | ||
|
||
@VisibleForTesting | ||
static List<Supplier<Object>> extractKeySuppliers(String cacheKeys, Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I find the code difficult to reason about, especially given the fact that the Cache key is now essentially a List<Object>
.
I was wondering whether it would be possible to build up a String that includes all of the relevant items in string form.
Something like uri:<...>_ugi:<...>_username:<...>_conf:<...>
but you'd probably need to use delimiters that are unique (and also I don't know if a string representation of UserGroupInformation
would be unique)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to represent UGI as a string while maintain the same equals/hashCode semantics. The UserGroupInformation::toString
method won't do it because it just returns the user names (including both real and proxy user).
0884eb5
to
699cbb5
Compare
Thanks @szehon-ho @nastra for your comments. I've updated the PR. Let me know if I missed anything |
} | ||
|
||
private synchronized void init() { | ||
if (clientPoolCache == null) { | ||
clientPoolCache = | ||
Caffeine.newBuilder() | ||
.expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS) | ||
.removalListener((key, value, cause) -> ((HiveClientPool) value).close()) | ||
.removalListener((ignored, value, cause) -> ((HiveClientPool) value).close()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we don't need this change particularly and can revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually required, otherwise checkstyle fails because the key
here now hides a class member.
hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java
Show resolved
Hide resolved
String key = trimmed.substring(CONF_ELEMENT_PREFIX.length()); | ||
ValidationException.check( | ||
!confElements.containsKey(key), "Conf key element %s already specified", key); | ||
confElements.put(key, conf.get(key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, how we are sorting conf elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confElements
is a TreeMap so that the conf keys are sorted
// generate key elements in a certain order, so that the Key instances are comparable | ||
List<Object> elements = Lists.newArrayList(); | ||
elements.add(conf.get(HiveConf.ConfVars.METASTOREURIS.varname, "")); | ||
if (cacheKeys == null || cacheKeys.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to add 'default.catalog' to here as well, as I'm not exactly sure any use case where Iceberg re-uses HMS Client with different catalogs. (As we never allow user to pass in catalog explicitly to HMSClient). But I'm ok to do it in another pr which we can contribute, for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's leave it to another PR.
024d56c
to
5b764f3
Compare
@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good to me. ping @pvary if interested to take a look
I didnt see an immediate need, but not sure what you mean? I dont see this flag being used extensively except necessary, to add new dimensions to cache. What would be the behavior if the cache is full? |
I meant the static cache in CachedClientPool. It's created as private synchronized void init() {
if (clientPoolCache == null) {
clientPoolCache =
Caffeine.newBuilder()
.expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
.removalListener((ignored, value, cause) -> ((HiveClientPool) value).close())
.build();
}
} We do have TTL on the entries which by default expire in 5min. Not sure if we also want to set a max size of the cache. E.g. suppose we add UGI to the key, and there're lots of short-lived user sessions, then the cache may hold pools that are no longer needed until they hit TTL. |
@lirui-apache we can consider it, but I'd probably split it to another pr. Also, was re-reading and had a question on the location of the new property key, as it seems Hive-specific, not sure what you think about that. |
@@ -119,6 +119,26 @@ private CatalogProperties() {} | |||
"client.pool.cache.eviction-interval-ms"; | |||
public static final long CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT = | |||
TimeUnit.MINUTES.toMillis(5); | |||
/** | |||
* A comma separated list of elements used, in addition to the hive metastore uri, to compose the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just realized this list of properties is for all Catalog, but we only want this property for HiveCatalog. There's a lot of other catalogs that can't can use this flag, given how specific it is.
Do you think we should move this to "CachedClientPool", as it only makes sense there?
I do realize that there's other cached flags in this file currently only used by Hive, but they seem more generic to me and could be re-purposed by other catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree CatalogProperties
may not be the best place for the config, but we also have CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS
here which is also specific to CachedClientPool
.
I think it's more consistent and intuitive to keep such configs in the same place because they both configures the cache behavior. And I don't see why TTL is more generic than the key when you're using a cache. So I prefer to leave it here for now. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea as I mentioned, I also saw other cached flags currently used by Hive in this file, but I feel they are more generic.
The reason I felt its more specific than ttl is its javadoc, it looks quite Hive-specific (mention hive metastore uri, ugi, conf). Maybe we can fix that instead then, though was not sure how, so hence the suggestion to move. Maybe if you feel strongly, we can put the note "for Hive Catalog, the following keys are supported.."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think key is generic enough for a cache, but what can be used to compose the key is implementation dependent. How about we leave the config in CatalogProperties
like this:
/**
* A comma separated list of elements used, in addition to the {@link #URI}, to compose the
* key of the client pool cache.
*
* <p>Supported key elements in a Catalog are implementation-dependent.
*/
public static final String CLIENT_POOL_CACHE_KEYS = "client-pool-cache-keys";
Then we move the rest of the javadoc to CachedClientPool
. Will that be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that'd be great, thanks @lirui-apache !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho PR updated. Please let me know if you have further comments
I like the final approach you come up with! |
Merged, thanks @lirui-apache for persistence, and thanks @pvary @rdblue @nastra for reviews |
Thanks guys for the reviews! |
To address issue #6697.
This PR allows users to specify custom client pools via a catalog property. Then catalogs can check this property and create the client pool accordingly. An
initialize
API is added toClientPool
so that catalogs can pass in the required configurations.