Do we have any TTL mechanism in Hudi? #2743

aditiwari01 · 2021-03-30T08:31:56Z

Do we have any TTL mechanism in Hudi? We have streaming jobs writinng to Hudi and we do not want the data to grow indefinitely. Lets say we are only interested in rows updated in last 1 month. Is there any way to specify row level TTL?

bvaradar · 2021-03-31T00:44:37Z

Have opened a feature request for this : https://issues.apache.org/jira/browse/HUDI-1741

@nsivabalan @n3nash : FYI

nsivabalan · 2021-03-31T03:51:19Z

@aditiwari01 : if you are not running a continuous job (deltastreamer), is it possible to trigger a job at regular cadence to fetch records from hudi that are > 1 month old and issue deletes back to hudi. I know this may not be efficient, but atleast will unblock you for now.

aditiwari01 · 2021-03-31T05:44:35Z

I was thinking around similar lines but we do have continuous jobs (not deltastreamer, but spark streaming jobs with 5/10 mins minni batches). We can't have a separate job for deletion since we do not support concurrent writers.

Another possible solution can be to have our table partitioned at commit time and then have a manual cleanup schedule jobs which deletes older partitions. The challenges here are that I am not sure if deleting some partition from outside can messup hoodie meta in some way? Also this will require us to have global indexing to avoid duplicates, which in turn can result in increased latencies.

What are your thoughts on this?

nsivabalan · 2021-04-05T05:31:04Z

Yeah, we don't recommend deleting partitions outside of hudi. Hudi support insert_override operation. But not sure if it can just delete partition w/o ingesting any data.
@satishkotha : Is there a way to trigger deletion of a partition w/o any ingestion?

aditiwari01 · 2021-04-05T15:21:31Z

I think the ideal way should be around compactor and cleaner.
Time based cleaner and filtering records with older commit time while compacting base file should solve the issue.

If there's a workaround using any delete option, please let me know.

satishkotha · 2021-04-05T18:43:35Z

@nsivabalan @aditiwari01 We actually introduced a delete_partition operation. See https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L46

I think this can be called on empty dataframe (I havent tested it).
Looks like documentation is missing for this. @nsivabalan Could you open a ticket.

cc: @lw309637554 who implemented this API.

lw309637554 · 2021-04-06T01:51:58Z

@nsivabalan @aditiwari01 We actually introduced a delete_partition operation. See https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L46

I think this can be called on empty dataframe (I havent tested it).
Looks like documentation is missing for this. @nsivabalan Could you open a ticket.

cc: @lw309637554 who implemented this API.

@aditiwari01
hello

now you can batch delete partitions use hudiTable.deletePartitions() just like in https://issues.apache.org/jira/browse/HUDI-1350. Also a pr is doing more easy tool to delete partitions [HUDI-1531] Introduce HoodiePartitionCleaner to delete specific partition #2452
@aditiwari01 why you need to duplicate the data? In my scene, if the data is user behaviors log from kafka , not need to duplicate.
Now have a issue to add the doc https://issues.apache.org/jira/browse/HUDI-1674 @satishkotha @nsivabalan

aditiwari01 · 2021-04-06T02:35:59Z

Thanks @lw309637554 Will look into this deletePartition in depth.

As for my use case, the ideal situation would be to have some kind of row level TTL taken care by cleaner/compactor. In absence of any such feature, I was wondering if I could partition on commit time and regularly delete older partitions. Our use case includes updates and for a given key the updates can exist on different partitions. We want to avoid such duplicates. I think global indexing can help in here. (haven't tested yet).

nsivabalan · 2021-04-06T13:58:17Z

@n3nash @vinothchandar @bvaradar : any thoughts here or workarounds until we have per record support for TTL.

nsivabalan · 2021-04-07T15:39:12Z

@lw309637554 @satishkotha : fyi we are yet to add spark ds support for this "delete_partition" operation.

n3nash · 2021-04-08T05:24:06Z

@aditiwari01 I think you mentioned 2 issues here

Record level TTL -> We don't have such a feature in Hudi. Like others have pointed out, using the hudiTable.deletePartitions() API is a way to manage older partitions. Yes, you could partition based on _hoodie_commit_time or any other date based partitioning that structures your table to be eligible for deleting older partitions completely.
Duplicates across partitions -> If you have an update workload and are using the upsert API, yes, using a GlobalIndex will help eliminate duplicates for your table.

As @nsivabalan pointed out, we don't have such support out of the spark datasource but have a low level API as pointed above. We welcome contributions and would be good to add this support in spark datasource - let me know if you want to contribute this feature and we can guide you

aditiwari01 · 2021-04-08T05:38:52Z

@n3nash Thanks for the clarificatio. Can we create a jira for the same. I can't pick this right away but would try to conntribute as and when I get time.
Meanwhile I will try to directly use the low level api to unblock myself.

n3nash · 2021-04-08T06:03:07Z

@aditiwari01 Here is the ticket and is assigned to you for now :) BTW, there is some relevant work happening here #2452. Please comment on the PR for further changes.

nsivabalan added the awaiting-community-help label Mar 30, 2021

bvaradar removed the awaiting-community-help label Mar 31, 2021

n3nash added the awaiting-user-response label Mar 31, 2021

nsivabalan added awaiting-community-help and removed awaiting-user-response labels Mar 31, 2021

nsivabalan self-assigned this Apr 2, 2021

nsivabalan removed the awaiting-community-help label Apr 5, 2021

lw309637554 mentioned this issue Apr 6, 2021

[HUDI-1531] Introduce HoodiePartitionCleaner to delete specific partition #2452

Closed

5 tasks

nsivabalan added feature-enquiry issue contains feature enquiries/requests or great improvement ideas jira-filed labels Apr 7, 2021

n3nash added the priority:minor everything else; usability gaps; questions; feature reqs label Apr 8, 2021

n3nash closed this as completed Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we have any TTL mechanism in Hudi? #2743

Do we have any TTL mechanism in Hudi? #2743

aditiwari01 commented Mar 30, 2021

bvaradar commented Mar 31, 2021

nsivabalan commented Mar 31, 2021

aditiwari01 commented Mar 31, 2021

nsivabalan commented Apr 5, 2021

aditiwari01 commented Apr 5, 2021

satishkotha commented Apr 5, 2021 •

edited

lw309637554 commented Apr 6, 2021

aditiwari01 commented Apr 6, 2021

nsivabalan commented Apr 6, 2021

nsivabalan commented Apr 7, 2021

n3nash commented Apr 8, 2021

aditiwari01 commented Apr 8, 2021

n3nash commented Apr 8, 2021

Do we have any TTL mechanism in Hudi? #2743

Do we have any TTL mechanism in Hudi? #2743

Comments

aditiwari01 commented Mar 30, 2021

bvaradar commented Mar 31, 2021

nsivabalan commented Mar 31, 2021

aditiwari01 commented Mar 31, 2021

nsivabalan commented Apr 5, 2021

aditiwari01 commented Apr 5, 2021

satishkotha commented Apr 5, 2021 • edited

lw309637554 commented Apr 6, 2021

aditiwari01 commented Apr 6, 2021

nsivabalan commented Apr 6, 2021

nsivabalan commented Apr 7, 2021

n3nash commented Apr 8, 2021

aditiwari01 commented Apr 8, 2021

n3nash commented Apr 8, 2021

satishkotha commented Apr 5, 2021 •

edited