Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-12679: Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf #4444

Closed
wants to merge 4 commits into from

Conversation

okumin
Copy link
Contributor

@okumin okumin commented Jun 21, 2023

What changes were proposed in this pull request?

Make it possible to replace the default IMetaStoreClient with a custom one.

It is the third time to try to send this patch. Austin originally opened HIVE-12679 in 2016, and @moomindani tried to send his patch in #1402, and then I took over it again.

Why are the changes needed?

In some environments, we want to connect to another data catalog rather than the native Hive Metastore. Looks like, AWS Glue Data Catalog is one of the cases. We also have a similar case for some historical reasons.

As we can see, more than 40 people are watching HIVE-12679, and some of them have asked us about the progress of this ticket. I'm sure it is worth maintaining this feature.

Does this PR introduce any user-facing change?

This doesn't change the original behavior at all.

Is the change a dependency upgrade?

No

How was this patch tested?

I tested HiveServer2 still works with this patch.

@sonarcloud
Copy link

sonarcloud bot commented Jun 22, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 13 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

} else {
return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap,
SessionHiveMetaStoreClient.class.getName(), allowEmbedded);
return createMetaStoreClientFactory(conf)
Copy link
Member

@deniskuzZ deniskuzZ Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid new factory object creation on every new client request?

private static HiveMetaStoreClientFactory createMetaStoreClientFactory(HiveConf conf) throws
MetaException {
String metaStoreClientFactoryClassName = MetastoreConf.getVar(conf,
MetastoreConf.ConfVars.METASTORE_CLIENT_FACTORY_CLASS);
Copy link
Member

@deniskuzZ deniskuzZ Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of providing a custom factory, you need to supply the custom client impl (SessionHiveMetaStoreClient, RetryingMetaStoreClient, etc). Please take a look at https://github.com/apache/hive/pull/4257/files#diff-6561f3987ba0c11e6a8998efcdc862d3d3340d4babbe003ae8da98b1e4020faf
cc @wecharyu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! The code will be clearer if we instantiate the client impl in a single factory, maybe just named HiveMetaStoreClientFactory .

Copy link
Contributor Author

@okumin okumin Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ @wecharyu Thanks for your opinions. Let me clarify your suggestions first. This is my understanding.

  • IHMSHandler: The essential interface. HMSHandler is the primary and only implementation for now
  • AbstractHMSHandlerProxy: The interface of dynamic proxy for IHMHandler. RetryingHMSHandler is the primary implementation and it becomes configurable thanks to HIVE-27284: Make HMSHandler proxy pluggable #4257
  • HMSHandlerProxyFactory: It has one static method, getProxy(Configuration, IMSHandler, boolean), to wrap the given IHMSHandler with the configured AbstractHMSHandlerProxy

I would also say the relation between IHMSHandler and RetryingHMSHandler is similar to between IMetaStoreClient and RetryingMetaStoreClient. But the purpose is a bit different. We'd like to make pluggable not RetryingMetaStoreClient but IMetaStoreClient here. Also, it is not evident that a custom IMetaStoreClient needs a dynamic proxy(e.g. I think at least the one for AWS Data Catelog may not need RetryingMetaStoreClient or another proxy because AWS SDK can provide more purpose-built helpers).
So, if we want to satisfy our requirements, we have to make both IMetaStoreClient and the proxy(e.g. NopProxy for AWS Glue Catalog) configurable. I think it is acceptable but I wonder if we should configure two parameters, maybe metastore.client.class and metastore.client.proxy.class, in order to generate one IMetaStoreClient.
As to another aspect, the current patch and hive.metastore.client.factory.class are already and unfortunately accepted by several services... Looks like it is used in AWS Glue, Amazon EMR, and Databricks. My company is also using it. We might not need to take care of those ones too much, but they and their users might be a little confused.

I don't have strong opinions, and to be honest, I could be misunderstanding your suggestion. Please feel free to give me your thoughts. I am willing to follow that advice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need metastore.client.class parameter, and make a static method like newClient(Configuration conf) in HiveMetaStoreClientFactory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think there are 3 existing patterns.

  • hive.metastore.fastpath=true and metastore.client.class is default
    • We use SessionHiveMetaStoreClient without RetryingMetaStoreClient
  • hive.metastore.fastpath=false and metastore.client.class is default
    • We use SessionHiveMetaStoreClient with RetryingMetaStoreClient
  • hive.metastore.fastpath=false and metastore.client.class is a custom one
    • We use the custom client without RetryingMetaStoreClient. RetryingMetaStoreClient is tightly coupled with Thrift-based SessionHiveMetaStoreClient

I have not come up with a very smart way to resolve the combination... It could be likely to be like this but I'd say there is a better way.

String className = MetastoreConf.getVar(conf, MetastoreConf.ConfVars.METASTORE_CLIENT_FACTORY_CLASS);
if (conf.getBoolVar(ConfVars.METASTORE_FASTPATH) || className != "SessionHiveMetaStoreClient") {
  return HiveMetaStoreClientFactory.create(...);
} else {
  return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap,
      SessionHiveMetaStoreClient.class.getName(), allowEmbedded);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify the match pattern as follows?

String className = MetastoreConf.getVar(conf, MetastoreConf.ConfVars.METASTORE_CLIENT_CLASS);
if (conf.getBoolVar(ConfVars.METASTORE_FASTPATH)) {
  return HiveMetaStoreClientFactory.create(...);
} else {
  return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap, className, allowEmbedded);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two thoughts.

  • I think we have to explicitly disable hive.metastore.fastpath when configuring metastore.client.class since RetryingMetaStoreClient is tightly coupled with SessionHiveMetaStoreClient
  • I guess people in the world already depend on hive.metastore.client.factory.class since we have failed to merge the patch for 7 years... This is the biggest headache

@dengzhhu653
Copy link
Member

We have a data connector for bridging over different resources for this purpose, do we consider using this feature?
https://issues.apache.org/jira/browse/HIVE-24396

@okumin
Copy link
Contributor Author

okumin commented Jun 22, 2023

We have a data connector for bridging over different resources for this purpose, do we consider using this feature? https://issues.apache.org/jira/browse/HIVE-24396

If I understand correctly, it requires us to have a Hive Metastore as a single source. The expected environment of HIVE-12679 doesn't have a true Hive Metastore but does have another data catalog whose protocol is incompatible with Hive Metastore.

@ganeshashree
Copy link
Contributor

@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679.

@okumin
Copy link
Contributor Author

okumin commented Jun 28, 2023

@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679.

@ganeshashree Quickly checking SessionHiveMetastoreClient, it has the following additional features.

  • Temporary table management in most methods
  • Query cache management
  • TX id management

It sounds like a good idea to make it easy to compose the features with a custom metastore client. I am personally thinking we can work on it in another ticket and I will definitely try it. In my mind, I wonder how about having the following entities.

  • IMetastoreClient
    • It is the interface of any metastore client.
  • HiveMetaStoreClient
    • It is the implementation of IMetastoreClient to talk to Hive Metastore
  • SessionHiveMetaStoreClient
    • It is a proxy to add the above three features to any IMetastoreClient
  • RetryingMetaStoreClient
    • It is a dynamic proxy to add retry capability to an IMetastoreClient relying on HiveMetastoreClient. Practically, it is dedicated for SessionHiveMetaStoreClient on the top of HiveMetaStoreClient

As proposed in the JIRA ticket, I expect we can implement SessionHiveMetaStoreClient by composition rather than inheritance like below.

class SessionHiveMetaStoreClient implement IMetastoreClient {
  private IMetastoreClient underlying;

  @Override
  protected void create_table(CreateTableRequest request) throws
      InvalidObjectException, MetaException, NoSuchObjectException, TException {
    org.apache.hadoop.hive.metastore.api.Table tbl = request.getTable();
    if (tbl.isTemporary()) {
      createTempTable(tbl);
      return;
    }
    underlying.create_table(request);
  }
}

If we follow the current API design using HiveMetaStoreClientFactory, the responsibility of the factory will be just "create an IMetastoreClient". Users may wrap their custom clients with SessionHiveMetaStoreClient if they need to support temp tables, or may wrap them with RetryingMetaStoreClient if the custom clients depend on Thrift. It is up to the users. I guess it would give us the best flexibility(I guess all the past authors have not needed the features provided by SessionHiveMetaStoreClient and it could be possible that someone will want to customize such session-related features for some requirements of their platforms).

@wecharyu and @deniskuzZ may have different opinions about how to generate an instance. So, this is just my opinion, though.

@ganeshashree
Copy link
Contributor

@okumin Thank you for addressing this! I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class. This way, it will be easy for users to just implement their version of IMetaStoreClient and also get all the features implemented in SessionHiveMetastoreClient along with RetryingMetaStoreClient. I am also thinking of using the specified implementation of IMetaStoreClient for delegating the calls in SessionHiveMetastoreClient (same as you mentioned). I would let senior community members to review this approach and give their feedback.

@okumin
Copy link
Contributor Author

okumin commented Jun 29, 2023

@ganeshashree Thanks! I think this point is still a bit controversial.

I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class

As the third owner of this patch, to be honest, I currently prefer to keep the current design using a factory class for the following reasons.

  • Is it useful to always add traits of RetryingMetaStoreClient, SessionHiveMetastoreClient, and HiveMetaStoreClientWithLocalCache?
    • RetryingMetaStoreClient is meaningless or even harmful to a custom client since it is coupled with Thrift or Kerberos. I am not 100% sure the features of SessionHiveMetastoreClient will never conflict with all custom clients now or in the future. HiveMetaStoreClientWithLocalCache seems to be tightly coupled with HiveMetaStoreClient. The design using a factory class can enable us to handle any situation since users can decorate their custom clients as they like
  • Who is the target users we help?
    • Existing users historically have depended on the current design for years. If we adopt a new design, they have to maintain the unmerge patch. Otherwise, they have to ask their customers to take action on Hive 4 upgrade. Of course, we can choose the best way for not existing users but new users. I believe existing users have ported the unapproved patch on their own responsibility

Anyway, I hope we agree with the following points.

  • Some of us are really willing to have a feature to replace IMetaStoreClient as we see many watchers on HIVE-12679
  • It seems to be useful if we can reuse traits of SessionHiveMetastoreClient. I started working on it on HIVE-27473

As for if we should configure our custom clients via a factory or not, I would also say it is finally up to the community members. Sorry for my long opinions 🙇

@okumin
Copy link
Contributor Author

okumin commented Jul 1, 2023

I'm checking the HiveMetaStoreClient family and I found it is not trivial. HiveMetaStoreClient, HiveMetaStoreClientWithLocalCache, and SessionHiveMetaStoreClient are tightly coupled by dynamic dispatching throw APIs defined in not IMetaStoreClient but HiveMetaStoreClient. It means they depend on each other in a bi-directional manner, and we can not simply replace the dependencies with compositions.

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Sep 1, 2023
@okumin
Copy link
Contributor Author

okumin commented Sep 4, 2023

Let me implement the alternative idea and ask the community to conclude that later

@github-actions github-actions bot removed the stale label Sep 5, 2023
@okumin
Copy link
Contributor Author

okumin commented Oct 5, 2023

@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response.

I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users.
https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e

Should we ask more users about their opinions using the mailing list?

@deniskuzZ
Copy link
Member

@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response.

I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users. https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e

Should we ask more users about their opinions using the mailing list?

hi @okumin, sure, why not. Also, it would be great to include HMS folks as reviewers: @dengzhhu653 , @nrg4878 , @saihemanth-cloudera

@okumin
Copy link
Contributor Author

okumin commented Oct 13, 2023

Thanks. I sent an e-mail to the dev ML for visibility.
https://lists.apache.org/thread/607swrj6pq4g6q052tyo4l304vb091m2

@dengzhhu653
Copy link
Member

dengzhhu653 commented Oct 18, 2023

Thanks @okumin for the great work! some ideas:

  1. Can we just expose SessionHiveMetaStoreClient in the Hive class?
    https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5839
    I think the HiveMetaStoreClientWithLocalCache can also benefit other data catalogs and it won't hurt the temp table by design.

I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class.

This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation.

  1. IDataConnectorProvider in HMS can plugin different datasources nowadays, example:
    https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/dataconnector/jdbc/PostgreSQLConnectorProvider.java, which possible makes HMS a central meta repository across the organization, we can add a Glue connector for HMS talking to Glue data catalog.

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am definitely in favor of making the client pluggable. However, I am not sure if the proposed patch here covers every instantiation of a client in the project.

What is a bit worrisome is that there are lots of calls to RetryingMetaStoreClient.getProxy methods outside the Hive class itself. It seems that the factory will not have any effect on those. Isn't this a problem?

@okumin
Copy link
Contributor Author

okumin commented Oct 19, 2023

@dengzhhu653 Thanks for the meaningful insight!

  1. Can we just expose SessionHiveMetaStoreClient in the Hive class?
  2. This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation

If I understand correctly, it is a problem resolved by HIVE-27473. I agree users might want to use adaptors to add some capabilities. In addition, they can be optional in my opinion. I mean if a user wants to use the convenient adaptors, they can use it in the following way.

IMetaStoreClient customMetaStoreClient = ...;
IMetaStoreClient metaStoreClientWithCache = new MetaStoreClientWithCache(customMetaStoreClient);
IMetaStoreClient metaStoreClientWithTmpTable = new MetaStoreClientWithTmpTable(metaStoreClientWithCache);

If the user doesn't need any or all of the additional capabilities or features, they don't have to wrap their client. Everything, e.g. what to support or how to implement, is up to the user as long as their plugin is integrated with Hive via the primitive interface, IMetaStoreClient. I expect this doesn't sacrifice convenience so much, keeping things explicit and flexible. If we don't prefer it, I guess we should resolve HIVE-27473 first(and it would not be trivial as SessionHiveMetastoreClient or HiveMetaStoreClientWithLocalCache are tightly coupled with HiveMetaStoreClient).

  1. IDataConnectorProvider in HMS can plugin different datasources nowadays

It seems to be a really lovely feature. I also guess it may not fit into our use case. That's because we would need to verify the security, availability, and performance of both HMS and our metadata service. To be fair, I am very confident with the capabilities of HMS. But we still always need to keep all endpoints well-tested. Anyway, I think it is a strong option and I listed it as an alternative. Thanks!

@zabetak

What is a bit worrisome is that there are lots of calls to RetryingMetaStoreClient.getProxy methods outside the Hive class itself. It seems that the factory will not have any effect on those. Isn't this a problem?

I didn't notice that point, and I am listing use cases...

  • RetryingMetaStoreClient.getProxy
    • UpgradeTool
    • HiveClientCache of HCatalog
  • Constructor of HiveMetaStoreClient, including inheritance
    • HiveClientCache
    • MsckOperation & Msck
    • HiveStrictManagedMigration
    • PartitionManagementTask
    • SmokeTest
  • Constructor of SessionHiveMetaStoreClient
    • Hive.java

Some might not be needed for a 3rd party metastore but I feel some could be valuable. Now, I guess it could be better to think of usages outside Hive.java, too. I can PoC once we can agree with the gluing API...

@dengzhhu653
Copy link
Member

@okumin,

their plugin is integrated with Hive via the primitive interface, IMetaStoreClient

That's right, there plugin client is wrapped by cache and session client, example:

IMetaStoreClient client = new SessionHiveMetaStoreClient(new RetryingMetastoreClient(new HiveMetaStoreClient(....)));

We can put the user client under the SessionHiveMetaStoreClient, e.g, new SessionHiveMetaStoreClient(new CatalogMetaClient(...))
In SessionHiveMetaStoreClient we need to change the call super.create_table(request) to delegate.create_table(request), that I think that it's worthful and wouldn't introduce too much complexity.

@okumin
Copy link
Contributor Author

okumin commented Oct 20, 2023

@dengzhhu653 Thanks. I hope we are on the same page, meaning the purpose of HIVE-12679 is to make it possible to integrate another metastore with Hive via IMetaStoreClient. And we can decouple how to make it easy to implement IMetaStoreClient. Please let us know if this point is wrong.
I'd like to move to HIVE-27473 if we want to discuss the second problem and we agree that HIVE-12679 doesn't depend on HIVE-27473. That's because we might be confused when we talk about two complex problems in a single place.

@dengzhhu653
Copy link
Member

dengzhhu653 commented Oct 21, 2023

Hi @okumin, sorry for that! What I mean is that no matter what the underlying client is, we should expose SessionHiveMetaStoreClient to the HiveServer2, as this would not break the temporary table or the meta cache in my opinion.

For the gluing API, I have no strong reason to against it, I'm thinking:

  1. How to load it on runtime and separate it from other pluggable clients?
  2. Where we can put these pluggable clients?

@okumin
Copy link
Contributor Author

okumin commented Oct 21, 2023

@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point.
In that case, HIVE-27473 will be a blocker of this PR. Should we work on it first and then revisit here?

PS: I'd be glad if someone could join the discussion of HIVE-27473 since I have not found a very smart way to make them integrated not by inheritance but by composition. Thanks.

@dengzhhu653
Copy link
Member

@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point. In that case, HIVE-27473 will be a blocker of this PR. Should we work on it first and then revisit here?

I think so, let's hear other opinions. Thanks.

@nrg4878
Copy link
Contributor

nrg4878 commented Oct 24, 2023

Just thinking out loud. Would it be a better model to federate these external catalogs thru HMS instead of swapping out the IMetastoreClient implementation? Changing the implementation makes it only work with that source where as if we build a Connector for Glue (see HIVE-24396), we can get HMS to pull metadata from Glue and present it to HS2 as if it were local.

@okumin
Copy link
Contributor Author

okumin commented Oct 25, 2023

@nrg4878 Thanks. I think you are mentioning an alternative in this list and it mostly sounds great. In HIVE-12679, I assume the environment where they need to maintain a single source of truth other than HMS.

@aturoczy
Copy link

aturoczy commented Nov 8, 2023

@okumin this thread is pretty long. Could you please share a summary about the current status? Do you need any support?

@okumin
Copy link
Contributor Author

okumin commented Nov 13, 2023

@aturoczy

  • The base discussion is here
  • @dengzhhu653 suggests we always wrap an IMetaStoreClient with the composable version of SessionHiveMetaStoreClient. If we second the idea, HIVE-27473 is a blocker of HIVE-12679(or we can merge both into one ticket if we agree it is a hard requirement)
  • @zabetak wonders if IMetaStoreClient would be used in other places than Hive.java. It is true and I think we can resolve that point just by brute-forcing all references

So, we can make an advance if we find a solution for HIVE-27473, if we prove HIVE-27473 is not practically impossible, or if we agree that we will go without HIVE-27473. I'm trying to tackle HIVE-27473 as it sounds useful anyway, but I have not found an elegant way.

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Jan 13, 2024
@github-actions github-actions bot closed this Jan 21, 2024
@okumin
Copy link
Contributor Author

okumin commented Jan 22, 2024

Just a note. I am working on HIVE-27473 and will revive this PR once we've agreed or disagreed with the feasibility of HIVE-27473

Copy link

github-actions bot commented Oct 9, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Oct 9, 2024
@github-actions github-actions bot closed this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.