Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT]Sync hive lost some partitions when submit multiple commits at the same time #7570

Closed
perfectcw opened this issue Dec 28, 2022 · 6 comments
Assignees
Labels
meta-sync multi-writer priority:critical production down; pipelines stalled; Need help asap.

Comments

@perfectcw
Copy link

perfectcw commented Dec 28, 2022

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Issue:
Lost some partitions when sync hive

Background:
We have a data ingest pipeline, which ingest about 500 partitions per day. And the pipeline will submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.

And after all of commits succeed, we found that some partitions are missing in the hive table.

The following is the analysis of log and hoodie files:
For the hoodie files, shows six of the commits. Then it was found that only 20221227042858342 & 20221227042906103 two commits were synced to hive, and the rest of the partitions did not appear in hive table.

I think the root cause is because of the mechanism of sync hive. When hudi sync hive after the commit is succeed, it will first get the latest synced commit, and then use the timestamp of this commit as a benchmark to check whether the new column and partition are added to the commit behind it, and if so, it will sync to hive.
So if a commit A is submmitted before this latest synced commit B, but succeeds after commit B, so it will not be synced hive. Because of commit A's timestamp < commit B's timestamp, it won't be detected.

Here is the log of commit 20221227042859357, we can see it get latest synced commit is 20221227042906103, which commit after 20221227042859357 itself. So the partition inserted by 20221227042859357 commit has not been detected, and the partition that needs to be synced is 0.

log of commit 20221227042859357:
2022-12-27 04:30:16,449 INFO hive.metastore: Opened a connection to metastore, current connections: 1
2022-12-27 04:30:16,465 INFO hive.metastore: Connected to metastore.
2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Syncing target hoodie table with hive table(forecast_agg_hoover_multi_publish). Hive metastore URL :jdbc:hive2://hs2.presto.stg.aws.fwmrm.net:10000/;auth=noSasl, basePath :s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish
2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Trying to sync hoodie table forecast_agg_hoover_multi_publish with base path s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish of type COPY_ON_WRITE
2022-12-27 04:30:16,815 INFO table.TableSchemaResolver: Reading schema from s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish/20221227/0/20230108/9820ce59-03a8-4efa-8978-3c3cf61298d8-0_1-11-3890_20221227042906103.parquet
2022-12-27 04:30:16,904 INFO s3a.S3AInputStream: Switching to Random IO seek policy
2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: No Schema difference for forecast_agg_hoover_multi_publish
2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: Schema sync complete. Syncing partitions for forecast_agg_hoover_multi_publish
2022-12-27 04:30:17,525 INFO hive.HiveSyncTool: Last commit time synced was found to be 20221227042906103
2022-12-27 04:30:17,525 INFO common.AbstractSyncHoodieClient: Last commit time synced is 20221227042906103, Getting commits since then
2022-12-27 04:30:17,527 INFO hive.HiveSyncTool: Storage partitions scan complete. Found 0
2022-12-27 04:30:17,697 INFO hive.HiveSyncTool: Sync complete for forecast_agg_hoover_multi_publish

    
.hoodie files: (order by time)
 name                                   type            last modify time            partition            if exist in hive
 20221227042855832.commit.requested	requested      2022-12-27 pm12:28:59 CST    20221227/0/20230101        no
 20221227042858342.commit.requested	requested      2022-12-27 pm12:29:00 CST    20221227/0/20230106        yes
 20221227042858801.commit.requested	requested      2022-12-27 pm12:29:01 CST    20221227/0/20230107        no
 20221227042859357.commit.requested	requested      2022-12-27 pm12:29:01 CST    20221227/0/20221229        no
 20221227042901993.commit.requested	requested      2022-12-27 pm12:29:04 CST    20221227/0/20230103        no
 20221227042906103.commit.requested	requested      2022-12-27 pm12:29:08 CST    20221227/0/20230108        yes
 ...
 20221227042855832.inflight	        inflight       2022-12-27 pm12:29:16 CST
 20221227042858342.inflight	        inflight       2022-12-27 pm12:29:16 CST
 20221227042858801.inflight	        inflight       2022-12-27 pm12:29:17 CST
 20221227042859357.inflight 	        inflight       2022-12-27 pm12:29:19 CST
 20221227042906103.inflight	        inflight       2022-12-27 pm12:29:19 CST
 20221227042901993.inflight	        inflight       2022-12-27 pm12:29:20 CST
 ...
 20221227042858342.commit	        commit         2022-12-27 pm12:29:46 CST   20221227/0/20230106                          
 20221227042906103.commit	        commit         2022-12-27 pm12:29:54 CST   20221227/0/20230108                         
 20221227042858801.commit	        commit         2022-12-27 pm12:30:04 CST   20221227/0/20230107 
 20221227042859357.commit	        commit         2022-12-27 pm12:30:14 CST
 20221227042855832.commit	        commit         2022-12-27 pm12:30:23 CST
 20221227042901993.commit	        commit         2022-12-27 pm12:30:33 CST
 ...

To Reproduce

Steps to reproduce the behavior:

1.Submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit.
2.The order in which all commits succeed is inconsistent with the order in which they were submitted.
3.Check whether the hive table has parititon for all inserts

Expected behavior

A clear and concise description of what you expected to happen.
Each commit can specify a synchronized partition as the currently inserted parition.

Environment Description

  • Hudi version :0.11.1

  • Spark version :3.2.1

  • Hive version :XXX

  • Hadoop version :3.3.2

  • Storage (HDFS/S3/GCS..) :S3

  • Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@perfectcw perfectcw reopened this Dec 28, 2022
@fengjian428
Copy link
Contributor

#6662 this should be fixed by this pr

@perfectcw
Copy link
Author

Thanks for your reply. And could you explain the specific reason? Is it because some commits are archived so cannot be synced to hive.

@fengjian428
Copy link
Contributor

Thanks for your reply. And could you explain the specific reason? Is it because some commits are archived so cannot be synced to hive.

the sync logic is: check last_update_time in hive table properties, get all commits from that time, then update last_update_time,this is not working for multiple writers

@perfectcw
Copy link
Author

Thanks for your reply. And could you explain the specific reason? Is it because some commits are archived so cannot be synced to hive.

the sync logic is: check last_update_time in hive table properties, get all commits from that time, then update last_update_time,this is not working for multiple writers

Is that means, when 20221227042855832.commit goes to sync hive, if the last_update_time in hive table properties is 20221227042906103, then the commit of 20221227042855832 will not be synced to hive.

@fengjian428
Copy link
Contributor

Thanks for your reply. And could you explain the specific reason? Is it because some commits are archived so cannot be synced to hive.

the sync logic is: check last_update_time in hive table properties, get all commits from that time, then update last_update_time,this is not working for multiple writers

Is that means, when 20221227042855832.commit goes to sync hive, if the last_update_time in hive table properties is 20221227042906103, then the commit of 20221227042855832 will not be synced to hive.

yes

@perfectcw
Copy link
Author

Thanks!

@yihua yihua added meta-sync priority:critical production down; pipelines stalled; Need help asap. multi-writer labels Jan 1, 2023
@xushiyan xushiyan closed this as completed Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-sync multi-writer priority:critical production down; pipelines stalled; Need help asap.
Projects
Archived in project
Development

No branches or pull requests

4 participants