Skip to content

Fix cleaner based on hours for earliest commit to retain #15925

@hudi-bot

Description

@hudi-bot

When cleaner is based on hours, we estimate the earliest commit to retain based on current time zone and not UTC or the timezone used to generate the commit time. so, there could be some mis-calculations and lead to deleting additional slices. 

 

Ref: [https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511]

 
{code:java}
else if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
Instant instant = Instant.now();
ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant, ZoneId.systemDefault());
String earliestTimeToRetain = HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
earliestCommitToRetain = Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i -> HoodieTimeline.compareTimestamps(i.getTimestamp(),
HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
} {code}
 

 

Potential fixes:

  • Fix the time based on time zone set in table config. 

  • Fetch the latest completed commit and decide the earliest commit based on that.

JIRA info

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions