-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-12679: Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf #4444
Conversation
Kudos, SonarCloud Quality Gate passed! |
} else { | ||
return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap, | ||
SessionHiveMetaStoreClient.class.getName(), allowEmbedded); | ||
return createMetaStoreClientFactory(conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we avoid new factory object creation on every new client request?
private static HiveMetaStoreClientFactory createMetaStoreClientFactory(HiveConf conf) throws | ||
MetaException { | ||
String metaStoreClientFactoryClassName = MetastoreConf.getVar(conf, | ||
MetastoreConf.ConfVars.METASTORE_CLIENT_FACTORY_CLASS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of providing a custom factory, you need to supply the custom client impl (SessionHiveMetaStoreClient
, RetryingMetaStoreClient
, etc). Please take a look at https://github.com/apache/hive/pull/4257/files#diff-6561f3987ba0c11e6a8998efcdc862d3d3340d4babbe003ae8da98b1e4020faf
cc @wecharyu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! The code will be clearer if we instantiate the client impl in a single factory, maybe just named HiveMetaStoreClientFactory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ @wecharyu Thanks for your opinions. Let me clarify your suggestions first. This is my understanding.
IHMSHandler
: The essential interface.HMSHandler
is the primary and only implementation for nowAbstractHMSHandlerProxy
: The interface of dynamic proxy forIHMHandler
.RetryingHMSHandler
is the primary implementation and it becomes configurable thanks to HIVE-27284: Make HMSHandler proxy pluggable #4257HMSHandlerProxyFactory
: It has one static method,getProxy(Configuration, IMSHandler, boolean)
, to wrap the givenIHMSHandler
with the configuredAbstractHMSHandlerProxy
I would also say the relation between IHMSHandler
and RetryingHMSHandler
is similar to between IMetaStoreClient
and RetryingMetaStoreClient
. But the purpose is a bit different. We'd like to make pluggable not RetryingMetaStoreClient
but IMetaStoreClient
here. Also, it is not evident that a custom IMetaStoreClient
needs a dynamic proxy(e.g. I think at least the one for AWS Data Catelog may not need RetryingMetaStoreClient or another proxy because AWS SDK can provide more purpose-built helpers).
So, if we want to satisfy our requirements, we have to make both IMetaStoreClient
and the proxy(e.g. NopProxy for AWS Glue Catalog) configurable. I think it is acceptable but I wonder if we should configure two parameters, maybe metastore.client.class
and metastore.client.proxy.class
, in order to generate one IMetaStoreClient
.
As to another aspect, the current patch and hive.metastore.client.factory.class
are already and unfortunately accepted by several services... Looks like it is used in AWS Glue, Amazon EMR, and Databricks. My company is also using it. We might not need to take care of those ones too much, but they and their users might be a little confused.
I don't have strong opinions, and to be honest, I could be misunderstanding your suggestion. Please feel free to give me your thoughts. I am willing to follow that advice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need metastore.client.class
parameter, and make a static method like newClient(Configuration conf)
in HiveMetaStoreClientFactory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think there are 3 existing patterns.
hive.metastore.fastpath=true
andmetastore.client.class
is default- We use
SessionHiveMetaStoreClient
withoutRetryingMetaStoreClient
- We use
hive.metastore.fastpath=false
andmetastore.client.class
is default- We use
SessionHiveMetaStoreClient
withRetryingMetaStoreClient
- We use
hive.metastore.fastpath=false
andmetastore.client.class
is a custom one- We use the custom client without
RetryingMetaStoreClient
.RetryingMetaStoreClient
is tightly coupled with Thrift-basedSessionHiveMetaStoreClient
- We use the custom client without
I have not come up with a very smart way to resolve the combination... It could be likely to be like this but I'd say there is a better way.
String className = MetastoreConf.getVar(conf, MetastoreConf.ConfVars.METASTORE_CLIENT_FACTORY_CLASS);
if (conf.getBoolVar(ConfVars.METASTORE_FASTPATH) || className != "SessionHiveMetaStoreClient") {
return HiveMetaStoreClientFactory.create(...);
} else {
return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap,
SessionHiveMetaStoreClient.class.getName(), allowEmbedded);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify the match pattern as follows?
String className = MetastoreConf.getVar(conf, MetastoreConf.ConfVars.METASTORE_CLIENT_CLASS);
if (conf.getBoolVar(ConfVars.METASTORE_FASTPATH)) {
return HiveMetaStoreClientFactory.create(...);
} else {
return RetryingMetaStoreClient.getProxy(conf, hookLoader, metaCallTimeMap, className, allowEmbedded);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two thoughts.
- I think we have to explicitly disable
hive.metastore.fastpath
when configuringmetastore.client.class
since RetryingMetaStoreClient is tightly coupled withSessionHiveMetaStoreClient
- It could be acceptable if we change the description of hive.metastore.fastpath
- I guess people in the world already depend on
hive.metastore.client.factory.class
since we have failed to merge the patch for 7 years... This is the biggest headache
We have a data connector for bridging over different resources for this purpose, do we consider using this feature? |
If I understand correctly, it requires us to have a Hive Metastore as a single source. The expected environment of HIVE-12679 doesn't have a true Hive Metastore but does have another data catalog whose protocol is incompatible with Hive Metastore. |
@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679. |
@ganeshashree Quickly checking SessionHiveMetastoreClient, it has the following additional features.
It sounds like a good idea to make it easy to compose the features with a custom metastore client. I am personally thinking we can work on it in another ticket and I will definitely try it. In my mind, I wonder how about having the following entities.
As proposed in the JIRA ticket, I expect we can implement
If we follow the current API design using @wecharyu and @deniskuzZ may have different opinions about how to generate an instance. So, this is just my opinion, though. |
@okumin Thank you for addressing this! I vote for users to specify an implementation of |
@ganeshashree Thanks! I think this point is still a bit controversial.
As the third owner of this patch, to be honest, I currently prefer to keep the current design using a factory class for the following reasons.
Anyway, I hope we agree with the following points.
As for if we should configure our custom clients via a factory or not, I would also say it is finally up to the community members. Sorry for my long opinions 🙇 |
I'm checking the
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Let me implement the alternative idea and ask the community to conclude that later |
@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response. I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users. Should we ask more users about their opinions using the mailing list? |
hi @okumin, sure, why not. Also, it would be great to include HMS folks as reviewers: @dengzhhu653 , @nrg4878 , @saihemanth-cloudera |
Thanks. I sent an e-mail to the dev ML for visibility. |
Thanks @okumin for the great work! some ideas:
This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am definitely in favor of making the client pluggable. However, I am not sure if the proposed patch here covers every instantiation of a client in the project.
What is a bit worrisome is that there are lots of calls to RetryingMetaStoreClient.getProxy
methods outside the Hive class itself. It seems that the factory will not have any effect on those. Isn't this a problem?
@dengzhhu653 Thanks for the meaningful insight!
If I understand correctly, it is a problem resolved by HIVE-27473. I agree users might want to use adaptors to add some capabilities. In addition, they can be optional in my opinion. I mean if a user wants to use the convenient adaptors, they can use it in the following way.
If the user doesn't need any or all of the additional capabilities or features, they don't have to wrap their client. Everything, e.g. what to support or how to implement, is up to the user as long as their plugin is integrated with Hive via the primitive interface, IMetaStoreClient. I expect this doesn't sacrifice convenience so much, keeping things explicit and flexible. If we don't prefer it, I guess we should resolve HIVE-27473 first(and it would not be trivial as SessionHiveMetastoreClient or HiveMetaStoreClientWithLocalCache are tightly coupled with HiveMetaStoreClient).
It seems to be a really lovely feature. I also guess it may not fit into our use case. That's because we would need to verify the security, availability, and performance of both HMS and our metadata service. To be fair, I am very confident with the capabilities of HMS. But we still always need to keep all endpoints well-tested. Anyway, I think it is a strong option and I listed it as an alternative. Thanks!
I didn't notice that point, and I am listing use cases...
Some might not be needed for a 3rd party metastore but I feel some could be valuable. Now, I guess it could be better to think of usages outside Hive.java, too. I can PoC once we can agree with the gluing API... |
That's right, there plugin client is wrapped by cache and session client, example:
We can put the user client under the SessionHiveMetaStoreClient, e.g, new SessionHiveMetaStoreClient(new CatalogMetaClient(...)) |
@dengzhhu653 Thanks. I hope we are on the same page, meaning the purpose of HIVE-12679 is to make it possible to integrate another metastore with Hive via |
Hi @okumin, sorry for that! What I mean is that no matter what the underlying client is, we should expose For the gluing API, I have no strong reason to against it, I'm thinking:
|
@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point. PS: I'd be glad if someone could join the discussion of HIVE-27473 since I have not found a very smart way to make them integrated not by inheritance but by composition. Thanks. |
I think so, let's hear other opinions. Thanks. |
Just thinking out loud. Would it be a better model to federate these external catalogs thru HMS instead of swapping out the IMetastoreClient implementation? Changing the implementation makes it only work with that source where as if we build a Connector for Glue (see HIVE-24396), we can get HMS to pull metadata from Glue and present it to HS2 as if it were local. |
@nrg4878 Thanks. I think you are mentioning an alternative in this list and it mostly sounds great. In HIVE-12679, I assume the environment where they need to maintain a single source of truth other than HMS. |
@okumin this thread is pretty long. Could you please share a summary about the current status? Do you need any support? |
So, we can make an advance if we find a solution for HIVE-27473, if we prove HIVE-27473 is not practically impossible, or if we agree that we will go without HIVE-27473. I'm trying to tackle HIVE-27473 as it sounds useful anyway, but I have not found an elegant way. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Just a note. I am working on HIVE-27473 and will revive this PR once we've agreed or disagreed with the feasibility of HIVE-27473 |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
What changes were proposed in this pull request?
Make it possible to replace the default IMetaStoreClient with a custom one.
It is the third time to try to send this patch. Austin originally opened HIVE-12679 in 2016, and @moomindani tried to send his patch in #1402, and then I took over it again.
Why are the changes needed?
In some environments, we want to connect to another data catalog rather than the native Hive Metastore. Looks like, AWS Glue Data Catalog is one of the cases. We also have a similar case for some historical reasons.
As we can see, more than 40 people are watching HIVE-12679, and some of them have asked us about the progress of this ticket. I'm sure it is worth maintaining this feature.
Does this PR introduce any user-facing change?
This doesn't change the original behavior at all.
Is the change a dependency upgrade?
No
How was this patch tested?
I tested HiveServer2 still works with this patch.