Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-27984: Support backward compatibility of hms thrift struct about column stats #4984

Merged
merged 2 commits into from
Feb 1, 2024

Conversation

zhangbutao
Copy link
Contributor

@zhangbutao zhangbutao commented Jan 8, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Please see details from HIVE-27984.
HIVE-22046 indroduced the must required engine field in hms thrift api, which broken the backward compatibility. This lead to hive3 hms thrift client can not call the hive4 column stats api.

This PR try to make engine field optional and then this will reduce the incompatibility issues.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Test with local cluster. And this change has no negative effect to Hive.

@@ -725,7 +725,7 @@ struct SetPartitionsStatsRequest {
2: optional bool needMerge, //stats need to be merged with the existing stats
3: optional i64 writeId=-1, // writeId for the current query that updates the stats
4: optional string validWriteIdList, // valid write id list for the table for which this struct is being sent
5: required string engine //engine creating the current request
5: optional string engine = "hive" //engine creating the current request
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the engine field optional, and give it a default value hive which is consistent with the current existing code logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we set the engine to default value hive without changing it to optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work. required will enforce check engine field in construction method though it has a default value. For some others framework(Trino) will use this class SetPartitionsStatsRequest but don't want to care the engine field, and they just reuse hive stats.

optional has no any problem in current hive code. only need partial trival changes in HiveMetaStoreClient.java. I think it is safe. :)

Copy link
Contributor

@okumin okumin Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I am not familiar with Thrift, I expect the following behaviors.

Client protocol HMS version Behavior
Hive 3 (no engine is set) Hive 4 w/o HIVE-27984 Error
Hive 4 w/o HIVE-27984 Hive 4 w/o HIVE-27984 The engine type specified by the client is used
Hive 4 w/ HIVE-27984 Hive 4 w/o HIVE-27984 The engine type specified by the client is used. Error is a client nullified engine explicitly?
Hive 3 (no engine is set) Hive 4 w/ HIVE-27984 engine=hive
Hive 4 w/o HIVE-27984 Hive 4 w/ HIVE-27984 The engine type specified by the client is used
Hive 4 w/ HIVE-27984 Hive 4 w/ HIVE-27984 The engine type specified by the client is used. engine-hive if a client nullifies engine explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are almost right. To be more specific, for example, Trino will get hms table column statistics by its thrift hms client which is compatible with hive3.
https://github.com/trinodb/trino/blob/057f2d435d6461855ff0f904dbd675b39db97d1e/plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/thrift/ThriftHiveMetastoreClient.java#L331-L336

But TableStatsRequest in Hive4 need pass engine through constructor function, and the engine field is required, therefore:

  1. for hive3 thrift client, this api call will fail due to the missed engine field if HMS server doesn't have this PR, As there is no place for hive3 thrift client to pass engine field. But this api call will sucessfully if HMS has this PR(engine field is optional and has the defalut value).
  2. for hive4 thrift client without this PR, you also must pass engine field in TableStatsRequest to call this api correctly no matter if HMS server has this PR. Because engine field is needed in TableStatsRequest constructor function.
  3. for hive4 thrift client with this PR. Whether HMS server have this PR or not
    , you can call the api sucessfully without pass engine as the engine field is not mandatory in TableStatsRequest constructor function, and you can explicitly reset the engine fileld by TableStatsRequest::setEngine if you don't want the default hive engine column statistics. But i think most framework(like trino) will want reuse hive engine column statistics.

Sounds a little verbose.. :) In one word, this PR can make hive3 thrift hms client worked well with hive4 hms when calling statistics related api. I have tested this PR with Trino, and i think it is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating the behaviors. I'd say it is flexible and less surprising for people.

@pan3793
Copy link
Member

pan3793 commented Jan 9, 2024

@zhangbutao hi, have you observed backward compatibility issues with Hive 2.3 client and Hive 4 HMS? also cc @sunchao

@zhangbutao
Copy link
Contributor Author

@zhangbutao hi, have you observed backward compatibility issues with Hive 2.3 client and Hive 4 HMS? also cc @sunchao

@pan3793 I think it depends your use case, for this change, TrinoDB&StarRocks will use its customized thrift client and call ThriftHiveMetastore$Client.recv_get_table_statistics_re to fetch hive tables column stats, so it will occur compatibility issues. But as far as i know, Spark won't call this api to get hive tables column stats, so it is safe for Spark to integrate Hive4 using its embedded Hive2.3 client.
BTW, we have used Spark3 to query hive4 tables for a long time, and we have not found any compatibility issues.

@pan3793
Copy link
Member

pan3793 commented Jan 9, 2024

@zhangbutao great! thanks for such information

@zhangbutao
Copy link
Contributor Author

Can i request your review? @ayushtkn @dengzhhu653 @deniskuzZ
Thx.

Copy link

sonarcloud bot commented Jan 22, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

6 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this issue!

@@ -725,7 +725,7 @@ struct SetPartitionsStatsRequest {
2: optional bool needMerge, //stats need to be merged with the existing stats
3: optional i64 writeId=-1, // writeId for the current query that updates the stats
4: optional string validWriteIdList, // valid write id list for the table for which this struct is being sent
5: required string engine //engine creating the current request
5: optional string engine = "hive" //engine creating the current request
Copy link
Contributor

@okumin okumin Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I am not familiar with Thrift, I expect the following behaviors.

Client protocol HMS version Behavior
Hive 3 (no engine is set) Hive 4 w/o HIVE-27984 Error
Hive 4 w/o HIVE-27984 Hive 4 w/o HIVE-27984 The engine type specified by the client is used
Hive 4 w/ HIVE-27984 Hive 4 w/o HIVE-27984 The engine type specified by the client is used. Error is a client nullified engine explicitly?
Hive 3 (no engine is set) Hive 4 w/ HIVE-27984 engine=hive
Hive 4 w/o HIVE-27984 Hive 4 w/ HIVE-27984 The engine type specified by the client is used
Hive 4 w/ HIVE-27984 Hive 4 w/ HIVE-27984 The engine type specified by the client is used. engine-hive if a client nullifies engine explicitly?

@zhangbutao zhangbutao merged commit 6ab9b14 into apache:master Feb 1, 2024
5 checks passed
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Feb 9, 2024
… column stats (apache#4984)(Butao Zhang, reviewed by okumin, Zhihua Deng)
dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Mar 7, 2024
… column stats (apache#4984)(Butao Zhang, reviewed by okumin, Zhihua Deng)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants