New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] HiveSyncTool: missing partitions #6277
Comments
so, when a new device's data comes in, is there a chance multiple hive sync processes are running at the same time? |
Seems pretty unlikely, there's a single process writing updates, and after a commit it checks whether a configured nr of commits starting from the last sync has passed before syncing again. I think I also experienced the issue when syncing manually, but I have to add some more historic data, so I'll try some more then. |
@matthiasdg I have met a similar issue but it was caused by multiple writers. |
Yes, have also noticed the "last commit time synced is..." and "getting commits since then" log messages. Based on the logs, seems it's the timestamp of the last commit synced that is used (not the sync time itself) I even experience issues with a single writer and only manual syncing. |
oh, you remind me maybe it was caused by the commits retain strategy, let me check and fix it |
@matthiasdg do you have |
one possible reason why you are seeing this. for eg, lets say you have 10 commits and ran hive sync to sync everything. and things are in good shape. Is there a chance this is happening in your case? |
Yup, that's more or less what I thought was the case. (Not sure what hudi setting exactly manages it, I mentioned commit retention in one of the earlier comments, but could be something else). |
Describe the problem you faced
We have some IoT data tables with a few thousands of partitions; typically
deviceId/year/month/day
.We do not sync to hive every commit, but at regular intervals.
For one of these tables I added a few months of historic data for an additional set of devices, as opposed to daily updates for the existing set. Somehow hive syncing with HiveSyncTool afterwards must have gone wrong (unfortunately do not have logs, so not sure if it failed or passed silently without detecting some partitions (suspect the latter)) because not all these partitions are present in hive. If I now run HiveSyncTool again, I just get e.g.
Last commit time synced is 20220802000054258, Getting commits since then
, which is what it does; it then picks up added partitions since that commit, but the ones that were not synced before are never added.My current way of solving this is dropping the hive table and rerun HiveSyncTool from scratch. This adds all the partitions.
Steps to reproduce the behavior:
deviceId/year/month/day
(MultiPartKeysValueExtractor
), sync to hive the first time. All is fine though it may take a long timeExpected behavior
I would expect to either get an error if partitions are not synced, so I do not get an updated last commit time synced or to have them all detected immediately
Environment Description
Hudi version : 0.10.0
Spark version : 3.1.2
Hive version : client side: 2.3.7 through hudi, standalone metastore 3.0
Hadoop version : 3.2.0
Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2
Running on Docker? (yes/no) : yes (k8s)
The text was updated successfully, but these errors were encountered: