Core, Hive: Support pluggable ClientPool #6698

lirui-apache · 2023-01-30T08:04:08Z

To address issue #6697.
This PR allows users to specify custom client pools via a catalog property. Then catalogs can check this property and create the client pool accordingly. An initialize API is added to ClientPool so that catalogs can pass in the required configurations.

lirui-apache · 2023-01-30T10:44:33Z

@szehon-ho @pvary @nastra @flyrain Could you please have a look? Thanks~

szehon-ho

Thanks @lirui-apache for the change! Left some comments

core/src/main/java/org/apache/iceberg/CatalogProperties.java

core/src/main/java/org/apache/iceberg/CatalogUtil.java

szehon-ho · 2023-01-31T06:08:53Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+  public static <C, E extends Exception> ClientPool<C, E> loadClientPool(
+      String impl, Map<String, String> properties, Object conf) {
+    Preconditions.checkNotNull(
+        impl, "Cannot initialize custom ClientPool, impl class name is null");


nit, do you think we can also have a better word in the error message than 'impl', ie use the actual property constant name, and we can pass it to Preconditions via the .checkNonNull(errorMsgTemplate, errorMsgArgs)?

IMO the util method here shouldn't care about where the impl class name come from. Maybe it's better to change the message like Cannot initialize custom ClientPool, impl class name is null. Please check the value of CatalogProperties.CLIENT_POOL_IMPL. Let me know if you think otherwise

core/src/main/java/org/apache/iceberg/CatalogUtil.java

core/src/main/java/org/apache/iceberg/ClientPool.java

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

lirui-apache · 2023-01-31T07:59:26Z

@szehon-ho Thanks for reviewing! I've updated the PR

rdblue · 2023-01-31T19:40:52Z

core/src/main/java/org/apache/iceberg/CatalogProperties.java

@@ -119,6 +119,8 @@ private CatalogProperties() {}
      "client.pool.cache.eviction-interval-ms";
  public static final long CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT =
      TimeUnit.MINUTES.toMillis(5);
+  /** Name of the custom {@link ClientPool} implementation class. */
+  public static final String CLIENT_POOL_IMPL = "client-pool-impl";


Is it strictly necessary to inject this using reflection? If this is for customization in a certain environment, then maybe it would be better to allow customization in a different way?

Hi @rdblue , I think the context is from #6175 (more issues scattered). The general problem being the CachedClientPool that's always on in Iceberg Hive, which is a global cache that doesnt work in all environments, because the current key (metastoreUri) collides too much, leading to the wrong client used for a lot of use cases.

It seems simpler to me to just make cache pluggable, but I am not sure if there's a better solution I'm not seeing to make it non-dynamic? If we dont want user to use a client-pool other than CachedClientPool, at least we need to have a pluggable conf => key generator? Or maybe just allow them to use the toggle between CachedClientPool and the raw ClientPool as an option (as a way to turn off the global cache)? cc @pvary @flyrain as well if any thoughts

Perhaps another way is to allow configurable cache keys? We can have pre-defined key elements and users can use these elements to compose cache key used in CachedClientPool. Some key elements I can think of: HMS URI (this is probably mandatory), UGI, user name, specific configurations. When HiveCatalog creates CahcedClientPool, it can check the configured key and pass a key supplier to CahcedClientPool.

Hi @lirui-apache yes I am ok with that , if you want to give a try? Wasnt sure if hive property alone addresses your use-case, but addresses ours (putting metastore.catalog.default to key). Maybe it will be easier for user, as they can just set a list of properties, versus implement a new class.

Yeah I'll have a try and update the PR

Hi @szehon-ho @rdblue , I have updated the PR to demonstrate the idea. Let me know if you think it makes sense. Thanks~

Hi sorry, I realized I let this slip, I will try to look at this soon.

rdblue · 2023-01-31T19:41:32Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+    }
+    configureHadoopConf(clientPool, conf);
+    clientPool.initialize(properties);
+    return clientPool;


Style: This code needs more whitespace between code blocks and statements.

pvary · 2023-02-02T06:02:51Z

For the reference, the previous discussions:

Hive: Add UGI to the key in CachedClientPool Hive: Add UGI to the key in CachedClientPool #6175
Hive: More distinctive cached client pool key to avoid conflict Hive: More distinctive cached client pool key to avoid conflict #5378

szehon-ho

Thanks @lirui-apache , I spent some time reading this patch and left my thoughts here.

szehon-ho · 2023-03-10T00:33:57Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

@@ -53,12 +70,14 @@ public class CachedClientPool implements ClientPool<IMetaStoreClient, TException
            properties,
            CatalogProperties.CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS,
            CatalogProperties.CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT);
+    this.keySuppliers =


While I like the keySupplier idea, wouldn't it be much simpler if we just use a static map of suppliers, like:

static final Map<String, Supplier> keySuppliers = ImmutableMap.of( "uri", () -> { try { return UserGroupInformation.getCurrentUser(); } catch (IOException e) { throw new UncheckedIOException(e); } }), "user_name" -> { () -> { try { String userName = UserGroupInformation.getCurrentUser().getUserName(); return UserNameElement.of(userName); } catch (IOException e) { throw new UncheckedIOException(e); } }); }

(as i dont see any need to generate this on the fly.) And then here in CachedClientPool CTOR, we can just apply keySuppliers on the conf to make a Key instead?

The reason why I use Supplier is to get the current user whenever CachedClientPool::clientPool is called. If a CachedClientPool is not meant to be shared among different users, I think we don't need to keep the suppliers at all, and just generate the Key instance in CTOR. WDYT?

Yea this trips me up every time, but I see, one HiveCatalog making one CachedClientPool. So I think, we are safe and can make a key in the CTOR, and will be much simpler. @pvary for sanity check here.

Thanks, my thought is, lets keep it simple until we have some use case where a process start sharing the HiveCatalog. I dont see any huge problem just creating one on the fly.

szehon-ho · 2023-03-10T00:36:25Z

core/src/main/java/org/apache/iceberg/CatalogProperties.java

+   * <p>The following elements are supported:
+   *
+   * <ul>
+   *   <li>URI - as specified by {@link CatalogProperties#URI}. URI will be the only element when


I think URI should be removed here as it shouldn't be configurable, as it is always there.

Maybe we could mention in the javadoc that uri is always used , no matter what configurable keys the user passes in? Something like:

A comma separated list of elements used, in addition to the hive metastore uri, to compose the key of the client pool cache.

Actually , I am now thinking I don't see any use-case where you would turn off any of these from making a key. What do you think? From user point of view, probably simpler is better and we should have reasonable defaults.

Maybe we could have a configurable key but just limit it to to add additional hive conf if we missed something, as a safety valve. In any case, without dynamic loading, the user cant add additional code suppliers like ugi, and thus is limited to just setting config values here.

cc @pvary @RussellSpitzer @flyrain for thoughts.

Or alternatively, it is just too much of a niche-configuration to be useful, and we go back to get #6175 committed , and Iceberg community will just add different suppliers to the key as necessary.

I agree we should always include URI in the key. For UGI and USER_NAME, they have different behaviors regarding proxy users so I think it's useful to keep it configurable. There're some discussions here: #6175 (comment), #6175 (comment)

@lirui-apache I went through the discussion, couldn't find the relevant information, would you mind summarizing what situation do we not want to use UGI and USER_NAME as part of the key?

@szehon-ho A common way to achieve impersonation is to create proxy users using the engine's principal. But different UGI instances cannot be equivalent, even though they actually represent the same user. E.g. the following two proxy users are different:

UserGroupInformation foo1 = UserGroupInformation.createProxyUser("foo", current); UserGroupInformation foo2 = UserGroupInformation.createProxyUser("foo", current);

And if we use the above UGIs in the cache key, foo1 and foo2 won't be able to share the underlying connection pool. This may or may not be desirable, given how an engine/service manages UGI instances. That's why I think both UGI and USER_NAME can be useful. But I'm also OK if we just keep UGI and add other options as we have new requirements.

OK, curious, for spark-server what will you use? (I assume that's your use-case?). Just wanted to simplify it a bit for the user, but makes sense though.

Yeah we're using spark-server and we use UGI in the key (as suggested by our spark team). I suppose spark maintains a HiveCatalog for each user session, which means different sessions won't share the underlying pool, even though they are for the same end user.

core/src/main/java/org/apache/iceberg/CatalogProperties.java

szehon-ho · 2023-03-10T00:44:15Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

+    return suppliers.build();
+  }
+
+  @Value.Immutable


I'm strugling to see the value of adding these immutable value, although I may be missing something. Why not just use String?

It's meant to give different types for different keys. E.g. both URI and USER_NAME are essentially just a wrapper of a string, but they are not comparable and can never be considered equal to each other.

I agree that it's a bit weird to have multiple wrapper classes for holding a single string or a single list. I was actually able to rewrite the code without using any of those wrapper classes and the tests passed.

The strings and lists are used in cache keys and need to be compared. I think we can either assume "how could a user name have the same value as an HMS URI", or we can wrap them in different classes to make sure they won't be equivalent. Personally I prefer the latter. But since we generally agree URI should be mandatory, I guess we can remove these two wrappers for now.

nastra · 2023-02-02T17:02:03Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+  public static <C, E extends Exception> ClientPool<C, E> loadClientPool(
+      String impl, Map<String, String> properties, Object conf) {
+    LOG.info("Loading custom client pool implementation: {}", impl);
+    Preconditions.checkNotNull(


nit: I think it would be good to be slightly more precise here and change this to Preconditions.checkArgument(null != impl, ...)

nastra · 2023-02-02T17:03:23Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+   * @return initialized ClientPool object
+   * @throws IllegalArgumentException if no-arg constructor not found or error during initialization
+   */
+  public static <C, E extends Exception> ClientPool<C, E> loadClientPool(


tests for this should go into TestCatalogUtil

nastra · 2023-02-02T17:04:35Z

hive-metastore/src/test/java/org/apache/iceberg/hive/TestLoadHiveCatalog.java

+                CatalogUtil.ICEBERG_CATALOG_TYPE_HIVE,
+                properties,
+                metastore.hiveConf());
+    Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper);


Suggested change

Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper);

Assertions.assertThat(hiveCatalog.clientPool()).isInstanceOf(CachedClientPoolWrapper.class);

hive-metastore/src/test/java/org/apache/iceberg/hive/TestCachedClientPool.java

nastra · 2023-03-10T09:22:00Z

hive-metastore/src/test/java/org/apache/iceberg/hive/TestCachedClientPool.java

+    Assert.assertTrue(
+        CachedClientPool.clientPoolCache()
+                .getIfPresent(CachedClientPool.toKey(Collections.singletonList(uri)))
+            == clientPool1);
    TimeUnit.MILLISECONDS.sleep(EVICTION_INTERVAL - TimeUnit.SECONDS.toMillis(2));


this test methods takes 23+ seconds, which is too long imo for a simple unit test. We might want to decrease the eviction interval for testing (can be a separate PR)

Yeah I think that can be done separately.

hive-metastore/src/test/java/org/apache/iceberg/hive/TestCachedClientPool.java

nastra · 2023-03-10T09:26:32Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

+    return suppliers.build();
+  }
+
+  @Value.Immutable


I agree that it's a bit weird to have multiple wrapper classes for holding a single string or a single list. I was actually able to rewrite the code without using any of those wrapper classes and the tests passed.

nastra · 2023-03-10T09:38:57Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

+  }
+
+  @VisibleForTesting
+  static List<Supplier<Object>> extractKeySuppliers(String cacheKeys, Configuration conf) {


TBH I find the code difficult to reason about, especially given the fact that the Cache key is now essentially a List<Object>.

I was wondering whether it would be possible to build up a String that includes all of the relevant items in string form.
Something like uri:<...>_ugi:<...>_username:<...>_conf:<...> but you'd probably need to use delimiters that are unique (and also I don't know if a string representation of UserGroupInformation would be unique)

I'm not sure how to represent UGI as a string while maintain the same equals/hashCode semantics. The UserGroupInformation::toString method won't do it because it just returns the user names (including both real and proxy user).

lirui-apache · 2023-03-13T12:15:27Z

Thanks @szehon-ho @nastra for your comments. I've updated the PR. Let me know if I missed anything

szehon-ho · 2023-03-14T22:22:46Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

  }

  private synchronized void init() {
    if (clientPoolCache == null) {
      clientPoolCache =
          Caffeine.newBuilder()
              .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
-              .removalListener((key, value, cause) -> ((HiveClientPool) value).close())
+              .removalListener((ignored, value, cause) -> ((HiveClientPool) value).close())


Nit: I think we don't need this change particularly and can revert.

This is actually required, otherwise checkstyle fails because the key here now hides a class member.

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

szehon-ho · 2023-03-14T22:28:16Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

+        String key = trimmed.substring(CONF_ELEMENT_PREFIX.length());
+        ValidationException.check(
+            !confElements.containsKey(key), "Conf key element %s already specified", key);
+        confElements.put(key, conf.get(key));


Question, how we are sorting conf elements?

confElements is a TreeMap so that the conf keys are sorted

szehon-ho · 2023-03-14T22:32:01Z

hive-metastore/src/main/java/org/apache/iceberg/hive/CachedClientPool.java

+    // generate key elements in a certain order, so that the Key instances are comparable
+    List<Object> elements = Lists.newArrayList();
+    elements.add(conf.get(HiveConf.ConfVars.METASTOREURIS.varname, ""));
+    if (cacheKeys == null || cacheKeys.isEmpty()) {


I would love to add 'default.catalog' to here as well, as I'm not exactly sure any use case where Iceberg re-uses HMS Client with different catalogs. (As we never allow user to pass in catalog explicitly to HMSClient). But I'm ok to do it in another pr which we can contribute, for readability.

OK let's leave it to another PR.

lirui-apache · 2023-03-17T11:05:23Z

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

szehon-ho

I think it looks good to me. ping @pvary if interested to take a look

szehon-ho · 2023-03-17T18:20:12Z

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

I didnt see an immediate need, but not sure what you mean? I dont see this flag being used extensively except necessary, to add new dimensions to cache. What would be the behavior if the cache is full?

lirui-apache · 2023-03-20T12:52:19Z

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

I didnt see an immediate need, but not sure what you mean? I dont see this flag being used extensively except necessary, to add new dimensions to cache. What would be the behavior if the cache is full?

I meant the static cache in CachedClientPool. It's created as

  private synchronized void init() {
    if (clientPoolCache == null) {
      clientPoolCache =
          Caffeine.newBuilder()
              .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
              .removalListener((ignored, value, cause) -> ((HiveClientPool) value).close())
              .build();
    }
  }

We do have TTL on the entries which by default expire in 5min. Not sure if we also want to set a max size of the cache. E.g. suppose we add UGI to the key, and there're lots of short-lived user sessions, then the cache may hold pools that are no longer needed until they hit TTL.

szehon-ho · 2023-03-21T17:17:59Z

@szehon-ho Since we're making the cache more likely to grow, do you think we should put a limit on the cache size?

@lirui-apache we can consider it, but I'd probably split it to another pr.

Also, was re-reading and had a question on the location of the new property key, as it seems Hive-specific, not sure what you think about that.

szehon-ho · 2023-03-21T17:15:28Z

core/src/main/java/org/apache/iceberg/CatalogProperties.java

@@ -119,6 +119,26 @@ private CatalogProperties() {}
      "client.pool.cache.eviction-interval-ms";
  public static final long CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS_DEFAULT =
      TimeUnit.MINUTES.toMillis(5);
+  /**
+   * A comma separated list of elements used, in addition to the hive metastore uri, to compose the


Actually, I just realized this list of properties is for all Catalog, but we only want this property for HiveCatalog. There's a lot of other catalogs that can't can use this flag, given how specific it is.

Do you think we should move this to "CachedClientPool", as it only makes sense there?

I do realize that there's other cached flags in this file currently only used by Hive, but they seem more generic to me and could be re-purposed by other catalog.

I agree CatalogProperties may not be the best place for the config, but we also have CLIENT_POOL_CACHE_EVICTION_INTERVAL_MS here which is also specific to CachedClientPool.

I think it's more consistent and intuitive to keep such configs in the same place because they both configures the cache behavior. And I don't see why TTL is more generic than the key when you're using a cache. So I prefer to leave it here for now. What do you think?

Yea as I mentioned, I also saw other cached flags currently used by Hive in this file, but I feel they are more generic.

The reason I felt its more specific than ttl is its javadoc, it looks quite Hive-specific (mention hive metastore uri, ugi, conf). Maybe we can fix that instead then, though was not sure how, so hence the suggestion to move. Maybe if you feel strongly, we can put the note "for Hive Catalog, the following keys are supported.."?

Hmm, I think key is generic enough for a cache, but what can be used to compose the key is implementation dependent. How about we leave the config in CatalogProperties like this:

/** * A comma separated list of elements used, in addition to the {@link #URI}, to compose the * key of the client pool cache. * * <p>Supported key elements in a Catalog are implementation-dependent. */ public static final String CLIENT_POOL_CACHE_KEYS = "client-pool-cache-keys";

Then we move the rest of the javadoc to CachedClientPool. Will that be clearer?

I think that'd be great, thanks @lirui-apache !

@szehon-ho PR updated. Please let me know if you have further comments

pvary · 2023-03-24T08:47:16Z

I like the final approach you come up with!
LGTM +1

szehon-ho · 2023-03-24T16:56:23Z

Merged, thanks @lirui-apache for persistence, and thanks @pvary @rdblue @nastra for reviews

lirui-apache · 2023-03-25T05:33:35Z

Thanks guys for the reviews!

github-actions bot added core hive labels Jan 30, 2023

szehon-ho reviewed Jan 31, 2023

View reviewed changes

rdblue reviewed Jan 31, 2023

View reviewed changes

lirui-apache mentioned this pull request Feb 9, 2023

Failed to get table info from metastore using impersonation #6750

Closed

lirui-apache force-pushed the pluggable-client-pool branch from ab2748c to 4847c4c Compare February 21, 2023 10:11

github-actions bot added the build label Feb 21, 2023

szehon-ho reviewed Mar 10, 2023

View reviewed changes

nastra requested changes Mar 10, 2023

View reviewed changes

lirui-apache force-pushed the pluggable-client-pool branch from 0884eb5 to 699cbb5 Compare March 13, 2023 10:45

szehon-ho reviewed Mar 14, 2023

View reviewed changes

lirui-apache added 7 commits March 16, 2023 11:09

Core, Hive: Support pluggable ClientPool

2c27f15

address comments

57f658c

support custom cache keys via configurations

96e2e55

address comments

e552fea

fix doc & checkstyle

f175e70

fix findbugs

b6cf4cb

remove side effect from validation check

5b764f3

lirui-apache force-pushed the pluggable-client-pool branch from 024d56c to 5b764f3 Compare March 16, 2023 03:35

szehon-ho approved these changes Mar 17, 2023

View reviewed changes

szehon-ho reviewed Mar 21, 2023

View reviewed changes

improve java docs

5f8ff1d

pvary approved these changes Mar 24, 2023

View reviewed changes

szehon-ho merged commit ef5c731 into apache:master Mar 24, 2023

lirui-apache deleted the pluggable-client-pool branch March 25, 2023 05:27

lirui-apache mentioned this pull request Mar 27, 2023

Hive: Add UGI to the key in CachedClientPool #6175

Closed

szehon-ho mentioned this pull request Apr 26, 2023

Hive: Support connecting to multiple Hive-Catalog #7441

Merged

flyrain mentioned this pull request May 19, 2023

Hive: More distinctive cached client pool key to avoid conflict #5378

Closed

	Assert.assertTrue(hiveCatalog.clientPool() instanceof CachedClientPoolWrapper);
	Assertions.assertThat(hiveCatalog.clientPool()).isInstanceOf(CachedClientPoolWrapper.class);

Core, Hive: Support pluggable ClientPool #6698

Core, Hive: Support pluggable ClientPool #6698

Conversation

lirui-apache commented Jan 30, 2023

lirui-apache commented Jan 30, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lirui-apache commented Jan 31, 2023

Choose a reason for hiding this comment

szehon-ho Feb 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Feb 2, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

szehon-ho Mar 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Mar 10, 2023 • edited

Choose a reason for hiding this comment

szehon-ho Mar 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Mar 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lirui-apache commented Mar 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Mar 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lirui-apache commented Mar 17, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

szehon-ho commented Mar 17, 2023 • edited

lirui-apache commented Mar 20, 2023

szehon-ho commented Mar 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Mar 24, 2023

szehon-ho commented Mar 24, 2023

lirui-apache commented Mar 25, 2023

szehon-ho Feb 1, 2023 •

edited

szehon-ho Mar 10, 2023 •

edited

szehon-ho Mar 10, 2023 •

edited

szehon-ho Mar 10, 2023 •

edited

szehon-ho Mar 11, 2023 •

edited

szehon-ho Mar 14, 2023 •

edited

szehon-ho commented Mar 17, 2023 •

edited

szehon-ho commented Mar 21, 2023 •

edited