HIVE-27994: Optimize renaming the partitioned table #4995

dengzhhu653 · 2024-01-10T15:21:27Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

zhangbutao · 2024-01-11T10:07:44Z

...ore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DirectSqlUpdatePart.java

@@ -181,7 +182,7 @@ private void populateInsertUpdateMap(Map<PartitionInfo, ColumnStatistics> statsP
            e -> e.partitionId).collect(Collectors.toList()
    );

-    prefix.append("select \"PART_ID\", \"COLUMN_NAME\" from \"PART_COL_STATS\" WHERE ");
+    prefix.append("select \"PART_ID\", \"COLUMN_NAME\", \"ENGINE\" from \"PART_COL_STATS\" WHERE ");


Why do we need ENGINE field here?
Isn't the PART_ID field enough to identify the specific partition?

It could be possible,
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L10420
in the above example, we use different engines to fetch the statistics with the same partitions and columns.

Make sense. if so, Shouldn't we add the ENGINE field in the WHERE sub-statement?
Current direct SQL query:
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141))

Maybe can be changed to:
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141)) and ENGINE = hive

BTW, we also have another field CAT_NAME in PART_COL_STATS to differentiate column stas between multi catalog. Should we also consider it here?

Morning @zhangbutao!
In this method we want to get the insert or the update statistics from Map<PartitionInfo, ColumnStatistics> statsPartInfoMap, there is no guarantee that all of the statsPartInfoMap are for the engine hive or the same engine, so PartColNameInfo needs to feed with the engine info when compared with the stats in statsPartInfoMap.

BTW, we also have another field CAT_NAME in PART_COL_STATS to differentiate column stas between multi catalog. Should we also consider it here?

I think we don't need to, the PART_ID here has the same effect for clarifying the catalog.

Thanks for explanation!

there is no guarantee that all of the statsPartInfoMap are for the engine hive or the same engine,

Is this code snippet only for insert statement?

If so, the one insert statement is must from the one same ENGINE (like hive or other compute engine), so the statsPartInfoMap will have the same ENGINE value.

Is this code snippet only for insert statement?

No, as you see, this PR aims for optimizing the partitioned table rename. Even the statsPartInfoMap has the same ENGINE value, we must be careful not to override the stats(for other engines) for the same partition column on the database.

I did some code debug. Only found that insert into statement gone through this code block(method populateInsertUpdateMap), but alter table rename not. Maybe i missed some configuration?

Since statsPartInfoMap has the same ENGINE value, why not we use this direct sql to filter the PART_ID based on the ENGINE (e.g. hive)?
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141)) and ENGINE = hive

@zhangbutao, you need to apply this change to see populateInsertUpdateMap is invoked by the rename operation, there is a batch rename for stats.
In my idea, there is no document that all the stats in statsPartInfoMap must be the same engine, so we cannot simply using the ENGINE = hive for all stats, besides ENGINE = hive in the filter is almost the same as in select \"ENGINE\" when there has limited numbers of engines on this partition.

ok,i didn't apply your change to do test. No need too much consideration about this comment, i just want to explore more details about stats usage in Hive.
Thanks. 😃

soumyakanti3578 · 2024-01-11T21:20:01Z

@dengzhhu653 There are some minor issues found by SonarCloud related to readability and maintainability: https://sonarcloud.io/project/issues?cleanCodeAttributeCategories=INTENTIONAL&resolved=false&pullRequest=4995&id=apache_hive&open=AYz34ifC28xE78Z9jBVO

Fixing these are optional since these are very minor but good to have! :)

zhangbutao · 2024-01-12T01:26:04Z

...ore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DirectSqlUpdatePart.java

-          throw new MetaException("Invalid state of  PART_COL_STATS for PART_ID " + partId);
+      StringBuilder update = new StringBuilder("UPDATE \"PART_COL_STATS\" SET ")
+          .append(StatObjectConverter.getUpdatedColumnSql(mPartitionColumnStatistics))
+          .append(" WHERE \"PART_ID\" = ? AND \"COLUMN_NAME\" = ? AND \"ENGINE\" = ?");


Same as above. Should we also consider addding CAT_NAME field here?

dengzhhu653 · 2024-01-12T01:27:22Z

@dengzhhu653 There are some minor issues found by SonarCloud related to readability and maintainability: https://sonarcloud.io/project/issues?cleanCodeAttributeCategories=INTENTIONAL&resolved=false&pullRequest=4995&id=apache_hive&open=AYz34ifC28xE78Z9jBVO

Fixing these are optional since these are very minor but good to have! :)

Thank you @soumyakanti3578, will fix them.
@nrg4878 @saihemanth-cloudera @henrib @deniskuzZ @ayushtkn cloud you take a look as well if have cycles please?

sonarcloud · 2024-01-12T09:36:03Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

4 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

saihemanth-cloudera

Changes looks good to me. +1
Will wait for @zhangbutao's approval.

saihemanth-cloudera · 2024-01-19T01:14:47Z

@dengzhhu653 -- Can you also create a follow-up jira to optimize this further where we directly update the db_name and table_name in the 'tab_col_stats' and 'part_col_stats' tables instead of fetching the stats, updating db/table names and then persisting it to the DB?

zhangbutao

LGTM +1
Batch commit is a good way to improve the performance of alter partitioned tables.
Let's go ahead!

dengzhhu653 · 2024-01-19T06:50:21Z

@dengzhhu653 -- Can you also create a follow-up jira to optimize this further where we directly update the db_name and table_name in the 'tab_col_stats' and 'part_col_stats' tables instead of fetching the stats, updating db/table names and then persisting it to the DB?

Created HIVE-28011 for tracking this

dengzhhu653 · 2024-01-20T02:12:23Z

Thank you @zhangbutao @soumyakanti3578 and @saihemanth-cloudera for the comment and review!

…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)

asf-ci-hive added tests pending tests unstable and removed tests pending labels Jan 10, 2024

dengzhhu653 force-pushed the HIVE-27994 branch from 771d30c to 9b9832d Compare January 11, 2024 03:35

asf-ci-hive added tests pending and removed tests unstable labels Jan 11, 2024

HIVE-27994: Optimize renaming the partitioned table

d2eee3d

dengzhhu653 force-pushed the HIVE-27994 branch from 9b9832d to d2eee3d Compare January 11, 2024 05:49

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Jan 11, 2024

zhangbutao reviewed Jan 11, 2024

View reviewed changes

asf-ci-hive added tests unstable tests pending tests passed and removed tests pending tests unstable labels Jan 11, 2024

zhangbutao reviewed Jan 12, 2024

View reviewed changes

code smell

40e3a13

asf-ci-hive added tests pending and removed tests passed labels Jan 12, 2024

asf-ci-hive added tests passed and removed tests pending labels Jan 12, 2024

saihemanth-cloudera reviewed Jan 19, 2024

View reviewed changes

zhangbutao approved these changes Jan 19, 2024

View reviewed changes

saihemanth-cloudera approved these changes Jan 19, 2024

View reviewed changes

dengzhhu653 merged commit 72fd26d into apache:master Jan 20, 2024
5 checks passed

tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Feb 9, 2024

HIVE-27994: Optimize renaming the partitioned table (apache#4995) (Zh…

dc43c79

…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)

dengzhhu653 added a commit to dengzhhu653/hive that referenced this pull request Mar 7, 2024

HIVE-27994: Optimize renaming the partitioned table (apache#4995) (Zh…

3fc31d5

…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-27994: Optimize renaming the partitioned table #4995

HIVE-27994: Optimize renaming the partitioned table #4995

dengzhhu653 commented Jan 10, 2024

zhangbutao Jan 11, 2024

dengzhhu653 Jan 11, 2024 •

edited

zhangbutao Jan 12, 2024 •

edited

dengzhhu653 Jan 12, 2024

zhangbutao Jan 12, 2024

dengzhhu653 Jan 12, 2024

zhangbutao Jan 12, 2024

dengzhhu653 Jan 15, 2024

zhangbutao Jan 15, 2024 •

edited

soumyakanti3578 commented Jan 11, 2024

zhangbutao Jan 12, 2024

dengzhhu653 commented Jan 12, 2024 •

edited

sonarcloud bot commented Jan 12, 2024

saihemanth-cloudera left a comment

saihemanth-cloudera commented Jan 19, 2024

zhangbutao left a comment

dengzhhu653 commented Jan 19, 2024

dengzhhu653 commented Jan 20, 2024

HIVE-27994: Optimize renaming the partitioned table #4995

HIVE-27994: Optimize renaming the partitioned table #4995

Conversation

dengzhhu653 commented Jan 10, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

zhangbutao Jan 11, 2024

Choose a reason for hiding this comment

dengzhhu653 Jan 11, 2024 • edited

Choose a reason for hiding this comment

zhangbutao Jan 12, 2024 • edited

Choose a reason for hiding this comment

dengzhhu653 Jan 12, 2024

Choose a reason for hiding this comment

zhangbutao Jan 12, 2024

Choose a reason for hiding this comment

dengzhhu653 Jan 12, 2024

Choose a reason for hiding this comment

zhangbutao Jan 12, 2024

Choose a reason for hiding this comment

dengzhhu653 Jan 15, 2024

Choose a reason for hiding this comment

zhangbutao Jan 15, 2024 • edited

Choose a reason for hiding this comment

soumyakanti3578 commented Jan 11, 2024

zhangbutao Jan 12, 2024

Choose a reason for hiding this comment

dengzhhu653 commented Jan 12, 2024 • edited

sonarcloud bot commented Jan 12, 2024

Quality Gate passed

saihemanth-cloudera left a comment

Choose a reason for hiding this comment

saihemanth-cloudera commented Jan 19, 2024

zhangbutao left a comment

Choose a reason for hiding this comment

dengzhhu653 commented Jan 19, 2024

dengzhhu653 commented Jan 20, 2024

dengzhhu653 Jan 11, 2024 •

edited

zhangbutao Jan 12, 2024 •

edited

zhangbutao Jan 15, 2024 •

edited

dengzhhu653 commented Jan 12, 2024 •

edited