Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we have any TTL mechanism in Hudi? #2743

Closed
aditiwari01 opened this issue Mar 30, 2021 · 13 comments
Closed

Do we have any TTL mechanism in Hudi? #2743

aditiwari01 opened this issue Mar 30, 2021 · 13 comments
Assignees
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas priority:minor everything else; usability gaps; questions; feature reqs

Comments

@aditiwari01
Copy link
Contributor

Do we have any TTL mechanism in Hudi? We have streaming jobs writinng to Hudi and we do not want the data to grow indefinitely. Lets say we are only interested in rows updated in last 1 month. Is there any way to specify row level TTL?

@bvaradar
Copy link
Contributor

Have opened a feature request for this : https://issues.apache.org/jira/browse/HUDI-1741

@nsivabalan @n3nash : FYI

@nsivabalan
Copy link
Contributor

@aditiwari01 : if you are not running a continuous job (deltastreamer), is it possible to trigger a job at regular cadence to fetch records from hudi that are > 1 month old and issue deletes back to hudi. I know this may not be efficient, but atleast will unblock you for now.

@aditiwari01
Copy link
Contributor Author

I was thinking around similar lines but we do have continuous jobs (not deltastreamer, but spark streaming jobs with 5/10 mins minni batches). We can't have a separate job for deletion since we do not support concurrent writers.

Another possible solution can be to have our table partitioned at commit time and then have a manual cleanup schedule jobs which deletes older partitions. The challenges here are that I am not sure if deleting some partition from outside can messup hoodie meta in some way? Also this will require us to have global indexing to avoid duplicates, which in turn can result in increased latencies.

What are your thoughts on this?

@nsivabalan
Copy link
Contributor

Yeah, we don't recommend deleting partitions outside of hudi. Hudi support insert_override operation. But not sure if it can just delete partition w/o ingesting any data.
@satishkotha : Is there a way to trigger deletion of a partition w/o any ingestion?

@aditiwari01
Copy link
Contributor Author

I think the ideal way should be around compactor and cleaner.
Time based cleaner and filtering records with older commit time while compacting base file should solve the issue.

If there's a workaround using any delete option, please let me know.

@satishkotha
Copy link
Member

satishkotha commented Apr 5, 2021

@nsivabalan @aditiwari01 We actually introduced a delete_partition operation. See https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L46

I think this can be called on empty dataframe (I havent tested it).
Looks like documentation is missing for this. @nsivabalan Could you open a ticket.

cc: @lw309637554 who implemented this API.

@lw309637554
Copy link
Contributor

@nsivabalan @aditiwari01 We actually introduced a delete_partition operation. See https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java#L46

I think this can be called on empty dataframe (I havent tested it).
Looks like documentation is missing for this. @nsivabalan Could you open a ticket.

cc: @lw309637554 who implemented this API.

@aditiwari01
hello

  1. now you can batch delete partitions use hudiTable.deletePartitions() just like in https://issues.apache.org/jira/browse/HUDI-1350. Also a pr is doing more easy tool to delete partitions [HUDI-1531] Introduce HoodiePartitionCleaner to delete specific partition #2452
  2. @aditiwari01 why you need to duplicate the data? In my scene, if the data is user behaviors log from kafka , not need to duplicate.
  3. Now have a issue to add the doc https://issues.apache.org/jira/browse/HUDI-1674 @satishkotha @nsivabalan

@aditiwari01
Copy link
Contributor Author

Thanks @lw309637554 Will look into this deletePartition in depth.

As for my use case, the ideal situation would be to have some kind of row level TTL taken care by cleaner/compactor. In absence of any such feature, I was wondering if I could partition on commit time and regularly delete older partitions. Our use case includes updates and for a given key the updates can exist on different partitions. We want to avoid such duplicates. I think global indexing can help in here. (haven't tested yet).

@nsivabalan
Copy link
Contributor

@n3nash @vinothchandar @bvaradar : any thoughts here or workarounds until we have per record support for TTL.

@nsivabalan nsivabalan added feature-enquiry issue contains feature enquiries/requests or great improvement ideas jira-filed labels Apr 7, 2021
@nsivabalan
Copy link
Contributor

@lw309637554 @satishkotha : fyi we are yet to add spark ds support for this "delete_partition" operation.

@n3nash
Copy link
Contributor

n3nash commented Apr 8, 2021

@aditiwari01 I think you mentioned 2 issues here

  1. Record level TTL -> We don't have such a feature in Hudi. Like others have pointed out, using the hudiTable.deletePartitions() API is a way to manage older partitions. Yes, you could partition based on _hoodie_commit_time or any other date based partitioning that structures your table to be eligible for deleting older partitions completely.
  2. Duplicates across partitions -> If you have an update workload and are using the upsert API, yes, using a GlobalIndex will help eliminate duplicates for your table.

As @nsivabalan pointed out, we don't have such support out of the spark datasource but have a low level API as pointed above. We welcome contributions and would be good to add this support in spark datasource - let me know if you want to contribute this feature and we can guide you

@n3nash n3nash added the priority:minor everything else; usability gaps; questions; feature reqs label Apr 8, 2021
@aditiwari01
Copy link
Contributor Author

@n3nash Thanks for the clarificatio. Can we create a jira for the same. I can't pick this right away but would try to conntribute as and when I get time.
Meanwhile I will try to directly use the low level api to unblock myself.

@n3nash
Copy link
Contributor

n3nash commented Apr 8, 2021

@aditiwari01 Here is the ticket and is assigned to you for now :) BTW, there is some relevant work happening here #2452. Please comment on the PR for further changes.

@n3nash n3nash closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas priority:minor everything else; usability gaps; questions; feature reqs
Projects
None yet
Development

No branches or pull requests

6 participants