New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-27994: Optimize renaming the partitioned table #4995
Conversation
771d30c
to
9b9832d
Compare
9b9832d
to
d2eee3d
Compare
@@ -181,7 +182,7 @@ private void populateInsertUpdateMap(Map<PartitionInfo, ColumnStatistics> statsP | |||
e -> e.partitionId).collect(Collectors.toList() | |||
); | |||
|
|||
prefix.append("select \"PART_ID\", \"COLUMN_NAME\" from \"PART_COL_STATS\" WHERE "); | |||
prefix.append("select \"PART_ID\", \"COLUMN_NAME\", \"ENGINE\" from \"PART_COL_STATS\" WHERE "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need ENGINE
field here?
Isn't the PART_ID
field enough to identify the specific partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be possible,
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L10420
in the above example, we use different engines to fetch the statistics with the same partitions and columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. if so, Shouldn't we add the ENGINE
field in the WHERE
sub-statement?
Current direct SQL query:
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141))
Maybe can be changed to:
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141)) and ENGINE = hive
BTW, we also have another field CAT_NAME
in PART_COL_STATS
to differentiate column stas between multi catalog. Should we also consider it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Morning @zhangbutao!
In this method we want to get the insert or the update statistics from Map<PartitionInfo, ColumnStatistics> statsPartInfoMap
, there is no guarantee that all of the statsPartInfoMap
are for the engine hive or the same engine, so PartColNameInfo needs to feed with the engine info when compared with the stats in statsPartInfoMap
.
BTW, we also have another field CAT_NAME in PART_COL_STATS to differentiate column stas between multi catalog. Should we also consider it here?
I think we don't need to, the PART_ID
here has the same effect for clarifying the catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation!
there is no guarantee that all of the statsPartInfoMap are for the engine hive or the same engine,
Is this code snippet only for insert
statement?
If so, the one insert
statement is must from the one same ENGINE
(like hive or other compute engine), so the statsPartInfoMap
will have the same ENGINE
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code snippet only for insert statement?
No, as you see, this PR aims for optimizing the partitioned table rename. Even the statsPartInfoMap
has the same ENGINE
value, we must be careful not to override the stats(for other engines) for the same partition column on the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some code debug. Only found that insert
into statement gone through this code block(method populateInsertUpdateMap
), but alter table rename
not. Maybe i missed some configuration?
Since statsPartInfoMap
has the same ENGINE
value, why not we use this direct sql to filter the PART_ID based on the ENGINE
(e.g. hive
)?
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141)) and ENGINE = hive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao, you need to apply this change to see populateInsertUpdateMap
is invoked by the rename operation, there is a batch rename for stats.
In my idea, there is no document that all the stats in statsPartInfoMap
must be the same engine, so we cannot simply using the ENGINE = hive
for all stats, besides ENGINE = hive
in the filter is almost the same as in select \"ENGINE\"
when there has limited numbers of engines on this partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,i didn't apply your change to do test. No need too much consideration about this comment, i just want to explore more details about stats usage in Hive.
Thanks. 😃
@dengzhhu653 There are some minor issues found by SonarCloud related to readability and maintainability: https://sonarcloud.io/project/issues?cleanCodeAttributeCategories=INTENTIONAL&resolved=false&pullRequest=4995&id=apache_hive&open=AYz34ifC28xE78Z9jBVO Fixing these are optional since these are very minor but good to have! :) |
throw new MetaException("Invalid state of PART_COL_STATS for PART_ID " + partId); | ||
StringBuilder update = new StringBuilder("UPDATE \"PART_COL_STATS\" SET ") | ||
.append(StatObjectConverter.getUpdatedColumnSql(mPartitionColumnStatistics)) | ||
.append(" WHERE \"PART_ID\" = ? AND \"COLUMN_NAME\" = ? AND \"ENGINE\" = ?"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Should we also consider addding CAT_NAME
field here?
Thank you @soumyakanti3578, will fix them. |
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 4 New issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good to me. +1
Will wait for @zhangbutao's approval.
@dengzhhu653 -- Can you also create a follow-up jira to optimize this further where we directly update the db_name and table_name in the 'tab_col_stats' and 'part_col_stats' tables instead of fetching the stats, updating db/table names and then persisting it to the DB? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
Batch commit is a good way to improve the performance of alter partitioned tables.
Let's go ahead!
Created HIVE-28011 for tracking this |
Thank you @zhangbutao @soumyakanti3578 and @saihemanth-cloudera for the comment and review! |
…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)
…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
Is the change a dependency upgrade?
How was this patch tested?