-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we have any TTL mechanism in Hudi? #2743
Comments
Have opened a feature request for this : https://issues.apache.org/jira/browse/HUDI-1741 @nsivabalan @n3nash : FYI |
@aditiwari01 : if you are not running a continuous job (deltastreamer), is it possible to trigger a job at regular cadence to fetch records from hudi that are > 1 month old and issue deletes back to hudi. I know this may not be efficient, but atleast will unblock you for now. |
I was thinking around similar lines but we do have continuous jobs (not deltastreamer, but spark streaming jobs with 5/10 mins minni batches). We can't have a separate job for deletion since we do not support concurrent writers. Another possible solution can be to have our table partitioned at commit time and then have a manual cleanup schedule jobs which deletes older partitions. The challenges here are that I am not sure if deleting some partition from outside can messup hoodie meta in some way? Also this will require us to have global indexing to avoid duplicates, which in turn can result in increased latencies. What are your thoughts on this? |
Yeah, we don't recommend deleting partitions outside of hudi. Hudi support insert_override operation. But not sure if it can just delete partition w/o ingesting any data. |
I think the ideal way should be around compactor and cleaner. If there's a workaround using any delete option, please let me know. |
@nsivabalan @aditiwari01 We actually introduced a delete_partition operation. See https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L46 I think this can be called on empty dataframe (I havent tested it). cc: @lw309637554 who implemented this API. |
@aditiwari01
|
Thanks @lw309637554 Will look into this deletePartition in depth. As for my use case, the ideal situation would be to have some kind of row level TTL taken care by cleaner/compactor. In absence of any such feature, I was wondering if I could partition on commit time and regularly delete older partitions. Our use case includes updates and for a given key the updates can exist on different partitions. We want to avoid such duplicates. I think global indexing can help in here. (haven't tested yet). |
@n3nash @vinothchandar @bvaradar : any thoughts here or workarounds until we have per record support for TTL. |
@lw309637554 @satishkotha : fyi we are yet to add spark ds support for this "delete_partition" operation. |
@aditiwari01 I think you mentioned 2 issues here
As @nsivabalan pointed out, we don't have such support out of the spark datasource but have a low level API as pointed above. We welcome contributions and would be good to add this support in spark datasource - let me know if you want to contribute this feature and we can guide you |
@n3nash Thanks for the clarificatio. Can we create a jira for the same. I can't pick this right away but would try to conntribute as and when I get time. |
@aditiwari01 Here is the ticket and is assigned to you for now :) BTW, there is some relevant work happening here #2452. Please comment on the PR for further changes. |
Do we have any TTL mechanism in Hudi? We have streaming jobs writinng to Hudi and we do not want the data to grow indefinitely. Lets say we are only interested in rows updated in last 1 month. Is there any way to specify row level TTL?
The text was updated successfully, but these errors were encountered: