Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, Hive: Support pluggable ClientPool #6698

Merged
merged 8 commits into from
Mar 24, 2023

Conversation

lirui-apache
Copy link
Contributor

To address issue #6697.
This PR allows users to specify custom client pools via a catalog property. Then catalogs can check this property and create the client pool accordingly. An initialize API is added to ClientPool so that catalogs can pass in the required configurations.

@lirui-apache
Copy link
Contributor Author

@szehon-ho @pvary @nastra @flyrain Could you please have a look? Thanks~

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lirui-apache for the change! Left some comments

core/src/main/java/org/apache/iceberg/CatalogUtil.java Outdated Show resolved Hide resolved
public static <C, E extends Exception> ClientPool<C, E> loadClientPool(
String impl, Map<String, String> properties, Object conf) {
Preconditions.checkNotNull(
impl, "Cannot initialize custom ClientPool, impl class name is null");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, do you think we can also have a better word in the error message than 'impl', ie use the actual property constant name, and we can pass it to Preconditions via the .checkNonNull(errorMsgTemplate, errorMsgArgs)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the util method here shouldn't care about where the impl class name come from. Maybe it's better to change the message like Cannot initialize custom ClientPool, impl class name is null. Please check the value of CatalogProperties.CLIENT_POOL_IMPL. Let me know if you think otherwise

core/src/main/java/org/apache/iceberg/CatalogUtil.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/ClientPool.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/ClientPool.java Outdated Show resolved Hide resolved
@lirui-apache
Copy link
Contributor Author

@szehon-ho Thanks for reviewing! I've updated the PR

@@ -119,6 +119,8 @@ private CatalogProperties() {}
"client.pool.cache.eviction-interval-ms";
public static final long CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT =
TimeUnit.MINUTES.toMillis(5);
/** Name of the custom {@link ClientPool} implementation class. */
public static final String CLIENT_POOL_IMPL = "client-pool-impl";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it strictly necessary to inject this using reflection? If this is for customization in a certain environment, then maybe it would be better to allow customization in a different way?

Copy link
Collaborator

@szehon-ho szehon-ho Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rdblue , I think the context is from #6175 (more issues scattered). The general problem being the CachedClientPool that's always on in Iceberg Hive, which is a global cache that doesnt work in all environments, because the current key (metastoreUri) collides too much, leading to the wrong client used for a lot of use cases.

It seems simpler to me to just make cache pluggable, but I am not sure if there's a better solution I'm not seeing to make it non-dynamic? If we dont want user to use a client-pool other than CachedClientPool, at least we need to have a pluggable conf => key generator? Or maybe just allow them to use the toggle between CachedClientPool and the raw ClientPool as an option (as a way to turn off the global cache)? cc @pvary @flyrain as well if any thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another way is to allow configurable cache keys? We can have pre-defined key elements and users can use these elements to compose cache key used in CachedClientPool. Some key elements I can think of: HMS URI (this is probably mandatory), UGI, user name, specific configurations. When HiveCatalog creates CahcedClientPool, it can check the configured key and pass a key supplier to CahcedClientPool.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lirui-apache yes I am ok with that , if you want to give a try? Wasnt sure if hive property alone addresses your use-case, but addresses ours (putting metastore.catalog.default to key). Maybe it will be easier for user, as they can just set a list of properties, versus implement a new class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll have a try and update the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @szehon-ho @rdblue , I have updated the PR to demonstrate the idea. Let me know if you think it makes sense. Thanks~

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi sorry, I realized I let this slip, I will try to look at this soon.

}
configureHadoopConf(clientPool, conf);
clientPool.initialize(properties);
return clientPool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: This code needs more whitespace between code blocks and statements.

@pvary
Copy link
Contributor

pvary commented Feb 2, 2023

For the reference, the previous discussions:

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lirui-apache , I spent some time reading this patch and left my thoughts here.

@@ -53,12 +70,14 @@ public class CachedClientPool implements ClientPool<IMetaStoreClient, TException
properties,
CatalogProperties.CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS,
CatalogProperties.CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT);
this.keySuppliers =
Copy link
Collaborator

@szehon-ho szehon-ho Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like the keySupplier idea, wouldn't it be much simpler if we just use a static map of suppliers, like:

static final Map<String, Supplier> keySuppliers = ImmutableMap.of(
   "uri", () -> {
                try {
                  return UserGroupInformation.getCurrentUser();
                } catch (IOException e) {
                  throw new UncheckedIOException(e);
                }
              }),
   "user_name" -> {
            () -> {
                try {
                  String userName = UserGroupInformation.getCurrentUser().getUserName();
                  return UserNameElement.of(userName);
                } catch (IOException e) {
                  throw new UncheckedIOException(e);
                }
              });
}

(as i dont see any need to generate this on the fly.) And then here in CachedClientPool CTOR, we can just apply keySuppliers on the conf to make a Key instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I use Supplier is to get the current user whenever CachedClientPool::clientPool is called. If a CachedClientPool is not meant to be shared among different users, I think we don't need to keep the suppliers at all, and just generate the Key instance in CTOR. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this trips me up every time, but I see, one HiveCatalog making one CachedClientPool. So I think, we are safe and can make a key in the CTOR, and will be much simpler. @pvary for sanity check here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, my thought is, lets keep it simple until we have some use case where a process start sharing the HiveCatalog. I dont see any huge problem just creating one on the fly.

* <p>The following elements are supported:
*
* <ul>
* <li>URI - as specified by {@link CatalogProperties#URI}. URI will be the only element when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think URI should be removed here as it shouldn't be configurable, as it is always there.

Maybe we could mention in the javadoc that uri is always used , no matter what configurable keys the user passes in? Something like:

A comma separated list of elements used, in addition to the hive metastore uri, to compose the key of the client pool cache.

Copy link
Collaborator

@szehon-ho szehon-ho Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually , I am now thinking I don't see any use-case where you would turn off any of these from making a key. What do you think? From user point of view, probably simpler is better and we should have reasonable defaults.

Maybe we could have a configurable key but just limit it to to add additional hive conf if we missed something, as a safety valve. In any case, without dynamic loading, the user cant add additional code suppliers like ugi, and thus is limited to just setting config values here.

cc @pvary @RussellSpitzer @flyrain for thoughts.

Copy link
Collaborator

@szehon-ho szehon-ho Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively, it is just too much of a niche-configuration to be useful, and we go back to get #6175 committed , and Iceberg community will just add different suppliers to the key as necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should always include URI in the key. For UGI and USER_NAME, they have different behaviors regarding proxy users so I think it's useful to keep it configurable. There're some discussions here: #6175 (comment), #6175 (comment)

Copy link
Collaborator

@szehon-ho szehon-ho Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lirui-apache I went through the discussion, couldn't find the relevant information, would you mind summarizing what situation do we not want to use UGI and USER_NAME as part of the key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho A common way to achieve impersonation is to create proxy users using the engine's principal. But different UGI instances cannot be equivalent, even though they actually represent the same user. E.g. the following two proxy users are different:

    UserGroupInformation foo1 = UserGroupInformation.createProxyUser("foo", current);
    UserGroupInformation foo2 = UserGroupInformation.createProxyUser("foo", current);

And if we use the above UGIs in the cache key, foo1 and foo2 won't be able to share the underlying connection pool. This may or may not be desirable, given how an engine/service manages UGI instances. That's why I think both UGI and USER_NAME can be useful. But I'm also OK if we just keep UGI and add other options as we have new requirements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, curious, for spark-server what will you use? (I assume that's your use-case?). Just wanted to simplify it a bit for the user, but makes sense though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we're using spark-server and we use UGI in the key (as suggested by our spark team). I suppose spark maintains a HiveCatalog for each user session, which means different sessions won't share the underlying pool, even though they are for the same end user.

return suppliers.build();
}

@Value.Immutable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strugling to see the value of adding these immutable value, although I may be missing something. Why not just use String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meant to give different types for different keys. E.g. both URI and USER_NAME are essentially just a wrapper of a string, but they are not comparable and can never be considered equal to each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's a bit weird to have multiple wrapper classes for holding a single string or a single list. I was actually able to rewrite the code without using any of those wrapper classes and the tests passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strings and lists are used in cache keys and need to be compared. I think we can either assume "how could a user name have the same value as an HMS URI", or we can wrap them in different classes to make sure they won't be equivalent. Personally I prefer the latter. But since we generally agree URI should be mandatory, I guess we can remove these two wrappers for now.

public static <C, E extends Exception> ClientPool<C, E> loadClientPool(
String impl, Map<String, String> properties, Object conf) {
LOG.info("Loading custom client pool implementation: {}", impl);
Preconditions.checkNotNull(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be good to be slightly more precise here and change this to Preconditions.checkArgument(null != impl, ...)

* @return initialized ClientPool object
* @throws IllegalArgumentException if no-arg constructor not found or error during initialization
*/
public static <C, E extends Exception> ClientPool<C, E> loadClientPool(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests for this should go into TestCatalogUtil

CatalogUtil.ICEBERG_CATALOG_TYPE_HIVE,
properties,
metastore.hiveConf());
Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper);
Assertions.assertThat(hiveCatalog.clientPool()).isInstanceOf(CachedClientPoolWrapper.class);

Assert.assertTrue(
CachedClientPool.clientPoolCache()
.getIfPresent(CachedClientPool.toKey(Collections.singletonList(uri)))
== clientPool1);
TimeUnit.MILLISECONDS.sleep(EVICTION_INTERVAL - TimeUnit.SECONDS.toMillis(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test methods takes 23+ seconds, which is too long imo for a simple unit test. We might want to decrease the eviction interval for testing (can be a separate PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that can be done separately.

return suppliers.build();
}

@Value.Immutable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's a bit weird to have multiple wrapper classes for holding a single string or a single list. I was actually able to rewrite the code without using any of those wrapper classes and the tests passed.

}

@VisibleForTesting
static List<Supplier<Object>> extractKeySuppliers(String cacheKeys, Configuration conf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I find the code difficult to reason about, especially given the fact that the Cache key is now essentially a List<Object>.

I was wondering whether it would be possible to build up a String that includes all of the relevant items in string form.
Something like uri:<...>_ugi:<...>_username:<...>_conf:<...> but you'd probably need to use delimiters that are unique (and also I don't know if a string representation of UserGroupInformation would be unique)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to represent UGI as a string while maintain the same equals/hashCode semantics. The UserGroupInformation::toString method won't do it because it just returns the user names (including both real and proxy user).

@lirui-apache
Copy link
Contributor Author

Thanks @szehon-ho @nastra for your comments. I've updated the PR. Let me know if I missed anything

}

private synchronized void init() {
if (clientPoolCache == null) {
clientPoolCache =
Caffeine.newBuilder()
.expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
.removalListener((key, value, cause) -> ((HiveClientPool) value).close())
.removalListener((ignored, value, cause) -> ((HiveClientPool) value).close())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we don't need this change particularly and can revert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually required, otherwise checkstyle fails because the key here now hides a class member.

String key = trimmed.substring(CONF_ELEMENT_PREFIX.length());
ValidationException.check(
!confElements.containsKey(key), "Conf key element %s already specified", key);
confElements.put(key, conf.get(key));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, how we are sorting conf elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confElements is a TreeMap so that the conf keys are sorted

// generate key elements in a certain order, so that the Key instances are comparable
List<Object> elements = Lists.newArrayList();
elements.add(conf.get(HiveConf.ConfVars.METASTOREURIS.varname, ""));
if (cacheKeys == null || cacheKeys.isEmpty()) {
Copy link
Collaborator

@szehon-ho szehon-ho Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to add 'default.catalog' to here as well, as I'm not exactly sure any use case where Iceberg re-uses HMS Client with different catalogs. (As we never allow user to pass in catalog explicitly to HMSClient). But I'm ok to do it in another pr which we can contribute, for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's leave it to another PR.

@lirui-apache
Copy link
Contributor Author

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good to me. ping @pvary if interested to take a look

@szehon-ho
Copy link
Collaborator

szehon-ho commented Mar 17, 2023

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

I didnt see an immediate need, but not sure what you mean? I dont see this flag being used extensively except necessary, to add new dimensions to cache. What would be the behavior if the cache is full?

@lirui-apache
Copy link
Contributor Author

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

I didnt see an immediate need, but not sure what you mean? I dont see this flag being used extensively except necessary, to add new dimensions to cache. What would be the behavior if the cache is full?

I meant the static cache in CachedClientPool. It's created as

  private synchronized void init() {
    if (clientPoolCache == null) {
      clientPoolCache =
          Caffeine.newBuilder()
              .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
              .removalListener((ignored, value, cause) -> ((HiveClientPool) value).close())
              .build();
    }
  }

We do have TTL on the entries which by default expire in 5min. Not sure if we also want to set a max size of the cache. E.g. suppose we add UGI to the key, and there're lots of short-lived user sessions, then the cache may hold pools that are no longer needed until they hit TTL.

@szehon-ho
Copy link
Collaborator

szehon-ho commented Mar 21, 2023

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

@lirui-apache we can consider it, but I'd probably split it to another pr.

Also, was re-reading and had a question on the location of the new property key, as it seems Hive-specific, not sure what you think about that.

@@ -119,6 +119,26 @@ private CatalogProperties() {}
"client.pool.cache.eviction-interval-ms";
public static final long CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT =
TimeUnit.MINUTES.toMillis(5);
/**
* A comma separated list of elements used, in addition to the hive metastore uri, to compose the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just realized this list of properties is for all Catalog, but we only want this property for HiveCatalog. There's a lot of other catalogs that can't can use this flag, given how specific it is.

Do you think we should move this to "CachedClientPool", as it only makes sense there?

I do realize that there's other cached flags in this file currently only used by Hive, but they seem more generic to me and could be re-purposed by other catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree CatalogProperties may not be the best place for the config, but we also have CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS here which is also specific to CachedClientPool.

I think it's more consistent and intuitive to keep such configs in the same place because they both configures the cache behavior. And I don't see why TTL is more generic than the key when you're using a cache. So I prefer to leave it here for now. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea as I mentioned, I also saw other cached flags currently used by Hive in this file, but I feel they are more generic.

The reason I felt its more specific than ttl is its javadoc, it looks quite Hive-specific (mention hive metastore uri, ugi, conf). Maybe we can fix that instead then, though was not sure how, so hence the suggestion to move. Maybe if you feel strongly, we can put the note "for Hive Catalog, the following keys are supported.."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think key is generic enough for a cache, but what can be used to compose the key is implementation dependent. How about we leave the config in CatalogProperties like this:

  /**
   * A comma separated list of elements used, in addition to the {@link #URI}, to compose the
   * key of the client pool cache.
   * 
   * <p>Supported key elements in a Catalog are implementation-dependent.
   */
  public static final String CLIENT_POOL_CACHE_KEYS = "client-pool-cache-keys";

Then we move the rest of the javadoc to CachedClientPool. Will that be clearer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be great, thanks @lirui-apache !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho PR updated. Please let me know if you have further comments

@pvary
Copy link
Contributor

pvary commented Mar 24, 2023

I like the final approach you come up with!
LGTM +1

@szehon-ho szehon-ho merged commit ef5c731 into apache:master Mar 24, 2023
@szehon-ho
Copy link
Collaborator

Merged, thanks @lirui-apache for persistence, and thanks @pvary @rdblue @nastra for reviews

@lirui-apache lirui-apache deleted the pluggable-client-pool branch March 25, 2023 05:27
@lirui-apache
Copy link
Contributor Author

Thanks guys for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants