Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-27994: Optimize renaming the partitioned table #4995

Merged
merged 2 commits into from Jan 20, 2024

Conversation

dengzhhu653
Copy link
Member

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

@@ -181,7 +182,7 @@ private void populateInsertUpdateMap(Map<PartitionInfo, ColumnStatistics> statsP
e -> e.partitionId).collect(Collectors.toList()
);

prefix.append("select \"PART_ID\", \"COLUMN_NAME\" from \"PART_COL_STATS\" WHERE ");
prefix.append("select \"PART_ID\", \"COLUMN_NAME\", \"ENGINE\" from \"PART_COL_STATS\" WHERE ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need ENGINE field here?
Isn't the PART_ID field enough to identify the specific partition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be possible,
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L10420
in the above example, we use different engines to fetch the statistics with the same partitions and columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. if so, Shouldn't we add the ENGINE field in the WHERE sub-statement?
Current direct SQL query:
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141))

Maybe can be changed to:
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141)) and ENGINE = hive

BTW, we also have another field CAT_NAME in PART_COL_STATS to differentiate column stas between multi catalog. Should we also consider it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Morning @zhangbutao!
In this method we want to get the insert or the update statistics from Map<PartitionInfo, ColumnStatistics> statsPartInfoMap, there is no guarantee that all of the statsPartInfoMap are for the engine hive or the same engine, so PartColNameInfo needs to feed with the engine info when compared with the stats in statsPartInfoMap.

BTW, we also have another field CAT_NAME in PART_COL_STATS to differentiate column stas between multi catalog. Should we also consider it here?

I think we don't need to, the PART_ID here has the same effect for clarifying the catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation!

there is no guarantee that all of the statsPartInfoMap are for the engine hive or the same engine,

Is this code snippet only for insert statement?

If so, the one insert statement is must from the one same ENGINE (like hive or other compute engine), so the statsPartInfoMap will have the same ENGINE value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code snippet only for insert statement?

No, as you see, this PR aims for optimizing the partitioned table rename. Even the statsPartInfoMap has the same ENGINE value, we must be careful not to override the stats(for other engines) for the same partition column on the database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some code debug. Only found that insert into statement gone through this code block(method populateInsertUpdateMap), but alter table rename not. Maybe i missed some configuration?

Since statsPartInfoMap has the same ENGINE value, why not we use this direct sql to filter the PART_ID based on the ENGINE (e.g. hive)?
select "PART_ID", "COLUMN_NAME" from "PART_COL_STATS" WHERE ("PART_ID" in (141)) and ENGINE = hive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao, you need to apply this change to see populateInsertUpdateMap is invoked by the rename operation, there is a batch rename for stats.
In my idea, there is no document that all the stats in statsPartInfoMap must be the same engine, so we cannot simply using the ENGINE = hive for all stats, besides ENGINE = hive in the filter is almost the same as in select \"ENGINE\" when there has limited numbers of engines on this partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,i didn't apply your change to do test. No need too much consideration about this comment, i just want to explore more details about stats usage in Hive.
Thanks. 😃

@soumyakanti3578
Copy link
Contributor

@dengzhhu653 There are some minor issues found by SonarCloud related to readability and maintainability: https://sonarcloud.io/project/issues?cleanCodeAttributeCategories=INTENTIONAL&resolved=false&pullRequest=4995&id=apache_hive&open=AYz34ifC28xE78Z9jBVO

Fixing these are optional since these are very minor but good to have! :)

throw new MetaException("Invalid state of PART_COL_STATS for PART_ID " + partId);
StringBuilder update = new StringBuilder("UPDATE \"PART_COL_STATS\" SET ")
.append(StatObjectConverter.getUpdatedColumnSql(mPartitionColumnStatistics))
.append(" WHERE \"PART_ID\" = ? AND \"COLUMN_NAME\" = ? AND \"ENGINE\" = ?");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Should we also consider addding CAT_NAME field here?

@dengzhhu653
Copy link
Member Author

dengzhhu653 commented Jan 12, 2024

@dengzhhu653 There are some minor issues found by SonarCloud related to readability and maintainability: https://sonarcloud.io/project/issues?cleanCodeAttributeCategories=INTENTIONAL&resolved=false&pullRequest=4995&id=apache_hive&open=AYz34ifC28xE78Z9jBVO

Fixing these are optional since these are very minor but good to have! :)

Thank you @soumyakanti3578, will fix them.
@nrg4878 @saihemanth-cloudera @henrib @deniskuzZ @ayushtkn cloud you take a look as well if have cycles please?

Copy link

sonarcloud bot commented Jan 12, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

4 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Contributor

@saihemanth-cloudera saihemanth-cloudera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good to me. +1
Will wait for @zhangbutao's approval.

@saihemanth-cloudera
Copy link
Contributor

@dengzhhu653 -- Can you also create a follow-up jira to optimize this further where we directly update the db_name and table_name in the 'tab_col_stats' and 'part_col_stats' tables instead of fetching the stats, updating db/table names and then persisting it to the DB?

Copy link
Contributor

@zhangbutao zhangbutao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1
Batch commit is a good way to improve the performance of alter partitioned tables.
Let's go ahead!

@dengzhhu653
Copy link
Member Author

@dengzhhu653 -- Can you also create a follow-up jira to optimize this further where we directly update the db_name and table_name in the 'tab_col_stats' and 'part_col_stats' tables instead of fetching the stats, updating db/table names and then persisting it to the DB?

Created HIVE-28011 for tracking this

@dengzhhu653 dengzhhu653 merged commit 72fd26d into apache:master Jan 20, 2024
5 checks passed
@dengzhhu653
Copy link
Member Author

Thank you @zhangbutao @soumyakanti3578 and @saihemanth-cloudera for the comment and review!

tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Feb 9, 2024
…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)
dengzhhu653 added a commit to dengzhhu653/hive that referenced this pull request Mar 7, 2024
…ihua Deng, reviewed by Butao Zhang, Sai Hemanth Gantasala)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants