[SUPPORT] HiveSyncTool: missing partitions #6277

matthiasdg · 2022-08-02T09:42:40Z

Describe the problem you faced

We have some IoT data tables with a few thousands of partitions; typically deviceId/year/month/day.
We do not sync to hive every commit, but at regular intervals.
For one of these tables I added a few months of historic data for an additional set of devices, as opposed to daily updates for the existing set. Somehow hive syncing with HiveSyncTool afterwards must have gone wrong (unfortunately do not have logs, so not sure if it failed or passed silently without detecting some partitions (suspect the latter)) because not all these partitions are present in hive. If I now run HiveSyncTool again, I just get e.g. Last commit time synced is 20220802000054258, Getting commits since then, which is what it does; it then picks up added partitions since that commit, but the ones that were not synced before are never added.

My current way of solving this is dropping the hive table and rerun HiveSyncTool from scratch. This adds all the partitions.

Steps to reproduce the behavior:

Have a dataset with a large number of partitions deviceId/year/month/day (MultiPartKeysValueExtractor), sync to hive the first time. All is fine though it may take a long time
Adding data to the existing partitions (new months/days will be added), syncing to hive still works
Add a large amount of data for devices that were not in the set before, sync again -> in my case there are partitions for every new device, but lots of the underlying date partitions are missing.
drop hive table and resync from scratch -> all partitions are there.

Expected behavior
I would expect to either get an error if partitions are not synced, so I do not get an updated last commit time synced or to have them all detected immediately

Environment Description

Hudi version : 0.10.0
Spark version : 3.1.2
Hive version : client side: 2.3.7 through hudi, standalone metastore 3.0
Hadoop version : 3.2.0
Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2
Running on Docker? (yes/no) : yes (k8s)

The text was updated successfully, but these errors were encountered:

fengjian428 · 2022-08-02T13:11:01Z

We do not sync to hive every commit, but at regular intervals.

so, when a new device's data comes in, is there a chance multiple hive sync processes are running at the same time？

matthiasdg · 2022-08-03T09:48:17Z

Seems pretty unlikely, there's a single process writing updates, and after a commit it checks whether a configured nr of commits starting from the last sync has passed before syncing again. I think I also experienced the issue when syncing manually, but I have to add some more historic data, so I'll try some more then.

fengjian428 · 2022-08-04T08:12:53Z

@matthiasdg I have met a similar issue but it was caused by multiple writers.
if hive sync success, Hudi will write sync time into tbl properties. when the next hive sync begins, it will scan the commit meta files after the last sync time, then get all changed partitions. I feel there shouldn't be any partition lost in this logic.

matthiasdg · 2022-08-04T12:42:11Z

Yes, have also noticed the "last commit time synced is..." and "getting commits since then" log messages. Based on the logs, seems it's the timestamp of the last commit synced that is used (not the sync time itself)
Wondering how that works with commit retention now.

I even experience issues with a single writer and only manual syncing.
Test I did just now: I start with a fresh hive table from my existing data; I run HiveSyncTool and it says it adds 8753 partitions. After that I ingest the new data (this spans a lot of commits e.g. 60), run HiveSyncTool again, 1506 partitions are added (total=10259).
Now I drop the table and rerun HiveSyncTool on all data at once: 11055 partitions are added. So not sure why there is a difference. We do this kind of stuff not that often (most of the time it's just adding data for existing devices, so only a day and/or month partition will be added), but it's a bit troubling that there are no warnings/errors.

fengjian428 · 2022-08-04T13:02:42Z

oh, you remind me maybe it was caused by the commits retain strategy, let me check and fix it

parisni · 2022-08-16T08:40:54Z

@matthiasdg do you have hoodie.datasource.hive_sync.ignore_exceptions enabled ? This would silently ignore metastore trouble.
BTW, I guess hive_sync errors don't lead to rollback hudi commit anyway.

nsivabalan · 2022-08-27T21:52:55Z

one possible reason why you are seeing this.
I assume you are running hive sync as a standalone job and not along w/ your regular writes.
So, in such cases, hive sync will only consider commits in active timeline.

for eg, lets say you have 10 commits and ran hive sync to sync everything. and things are in good shape.
now you add 100 more commits. your cleaner and archival configs are such that, only last 20 commits are in active timeline. Now, if you run hive sync again, hudi might sync partitions added only in the last 20 commits and not 100.

Is there a chance this is happening in your case?

matthiasdg · 2022-08-29T09:29:09Z

Yup, that's more or less what I thought was the case. (Not sure what hudi setting exactly manages it, I mentioned commit retention in one of the earlier comments, but could be something else).
Is this behavior and the settings that impact it documented somewhere? It's not a big deal, just something to take into consideration

yihua added meta-sync priority:critical production down; pipelines stalled; Need help asap. labels Aug 8, 2022

yihua added this to Awaiting Triage in GI Tracker Board via automation Aug 8, 2022

nsivabalan assigned yihua Aug 9, 2022

nsivabalan self-assigned this Aug 27, 2022

xushiyan linked a pull request Sep 16, 2022 that will close this issue

[HUDI-4832] Fix drop partition meta sync #6662

Merged

4 tasks

xushiyan closed this as completed in #6662 Sep 19, 2022

GI Tracker Board automation moved this from Awaiting Triage to Done Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] HiveSyncTool: missing partitions #6277

[SUPPORT] HiveSyncTool: missing partitions #6277

matthiasdg commented Aug 2, 2022 •

edited

fengjian428 commented Aug 2, 2022 •

edited

matthiasdg commented Aug 3, 2022

fengjian428 commented Aug 4, 2022

matthiasdg commented Aug 4, 2022

fengjian428 commented Aug 4, 2022

parisni commented Aug 16, 2022

nsivabalan commented Aug 27, 2022

matthiasdg commented Aug 29, 2022

[SUPPORT] HiveSyncTool: missing partitions #6277

[SUPPORT] HiveSyncTool: missing partitions #6277

Comments

matthiasdg commented Aug 2, 2022 • edited

fengjian428 commented Aug 2, 2022 • edited

matthiasdg commented Aug 3, 2022

fengjian428 commented Aug 4, 2022

matthiasdg commented Aug 4, 2022

fengjian428 commented Aug 4, 2022

parisni commented Aug 16, 2022

nsivabalan commented Aug 27, 2022

matthiasdg commented Aug 29, 2022

matthiasdg commented Aug 2, 2022 •

edited

fengjian428 commented Aug 2, 2022 •

edited