Skip to content

Request for decoupling archival from savepoints (infinite time travel) #18686

@bhavya-ganatra

Description

@bhavya-ganatra

Task Description

What needs to be done:
This issue is based on a discussion in the Hudi Slack channel regarding performance degradation with a large active timeline: https://apache-hudi.slack.com/archives/C4D716NPQ/p1774948526496729

Propose and/or implement a solution to decouple savepoints from timeline archival (e.g., enable archival without losing restore capability).

Additionally, update documentation to clearly state the impact of hoodie.archive.beyond.savepoint on savepoint restore behaviour.

Why this task is needed:

We are running a streaming pipeline writing to multiple Hudi MOR tables with:

  • Async compaction and cleaner
  • Commit frequency: every 5 minutes
  • Savepoints retained for 7 days (1 per 24 hours)

Savepoints are required for our backup/recovery strategy and cannot be reduced. However, savepoints block archival of commits in the timeline, leading to continuous timeline growth and noticeable performance degradation in both reads and writes.

Currently, the config hoodie.archive.beyond.savepoint allows archival beyond savepoints, but at the cost of losing savepoint restore capability (i.e., savepoints become non-recoverable): -> #6239

Hence, to resolve this, we need decoupling of savepoint from the timeline archival process, so that we can have "Restore capability" without having significant Performance degradation.

JFI: This task request was already part of this Jira: https://issues.apache.org/jira/browse/HUDI-4500. But since, Hudi is moved to Github Issues, I am creating this.

Task Type

Performance optimization

Related Issues

Parent feature issue: https://issues.apache.org/jira/browse/HUDI-4500
Related issues: https://issues.apache.org/jira/browse/HUDI-4501

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:devtaskDevelopment tasks and maintenance work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions