-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
When performing a clean, the earliest commit to be retained obtained by the getEarliestCommitToRetain method in CleanPlanner is used as the endpoint of the clean. However, when a pending commit takes a long time and all the commits earlier than the pending commit have been achieved, the pending commit becomes the earliest active timeline. In this situation, if getEarliestCommitToRetain is called, it will return empty because there is no earlier commit than the pending commit. During an incremental clean, the previous endpoint, which is the last commit retained in the previous clean, is used as the starting point. However, if this starting point is empty, a full clean will be triggered, which is very resource-intensive.
To solve this problem without affecting normal clean, I set the EarliestCommitToRetain obtained in this case to the earliest pending commit. Since the endpoint will not be cleaned in the current clean, this approach can solve the aforementioned problem without affecting normal clean.
JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-6574
- Type: Bug