Skip to content

feat: Adding support to block archival on last known ECTR for v6 tables#18380

Open
nsivabalan wants to merge 1 commit intoapache:masterfrom
nsivabalan:archivalBlockECTRV6
Open

feat: Adding support to block archival on last known ECTR for v6 tables#18380
nsivabalan wants to merge 1 commit intoapache:masterfrom
nsivabalan:archivalBlockECTRV6

Conversation

@nsivabalan
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

This PR adds support to block archival based on the Earliest Commit To Retain (ECTR) from the last completed clean operation, preventing potential data leaks when cleaning configurations change between clean and archival runs.

Problem: Currently, archival recomputes ECTR independently based on cleaning configs at archival time, rather than reading it from the last clean plan. When cleaning configs change between clean and archival operations, archival may archive commits whose data files haven't been cleaned yet, leading to timeline metadata loss for existing data files.

Example scenario:

  1. Clean runs with retainCommits=5, computes ECTR=commit_100, cleans files older than commit_100
  2. Config changes to retainCommits=2 before next clean
  3. Archival runs with new config, recomputes ECTR=commit_103
  4. Archival archives commits 100-102, but their data files still exist (weren't cleaned yet)
  5. Result: Timeline metadata is lost for existing data files → data leak

Summary and Changelog

User-facing summary: Users can now optionally enable archival blocking based on ECTR from the last clean to prevent archiving commits whose data files haven't been cleaned. This is useful when cleaning configurations may change over time or when strict data retention guarantees are needed.

Detailed changelog:

Configuration Changes:

  • Added new advanced config hoodie.archive.block.on.latest.clean.ectr (default: false)
    • When enabled, archival reads ECTR from last completed clean metadata
    • Blocks archival of commits with timestamp >= ECTR
    • Marked as advanced config for power users
    • Available since version 1.2.0

Implementation Changes:

  • TimelineArchiverV1.java: Added ECTR blocking logic in getCommitInstantsToArchive() method
    • Reads ECTR from last completed clean's metadata (lines 274-294)
    • Filters commit timeline to exclude commits >= ECTR (lines 322-326)
    • Follows same pattern as existing compaction/clustering retention checks
    • Includes error handling with graceful degradation (logs warning if metadata read fails)
  • HoodieArchivalConfig.java: Added config property BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
    • Builder method: withBlockArchivalOnCleanECTR(boolean)
  • HoodieWriteConfig.java: Added access method shouldBlockArchivalOnCleanECTR()

Test Changes:

  • Added 7 comprehensive tests in TestHoodieTimelineArchiver.java (633 lines):
    a. testArchivalBlocksOnCleanECTRWhenEnabled - Core blocking functionality
    b. testArchivalProceedsNormallyWhenECTRBlockingDisabled - Backward compatibility
    c. testArchivalMakesProgressWhenECTRIsLaterThanArchivalWindow - Progress validation
    d. testArchivalContinuesWhenCleanMetadataIsMissing - Missing metadata handling
    e. testArchivalHandlesEmptyECTRInCleanMetadata - Empty ECTR handling
    f. testArchivalProceedsWhenCleanHasFileVersionsPolicyWithNullECTR - FILE_VERSIONS policy compatibility
    g. testArchivalBlocksOnCleanECTRWithTimelineArchiverV2AndVersion9 - Version 9 / LSM timeline compatibility

Impact

Public API Changes:

  • New config property: hoodie.archive.block.on.latest.clean.ectr (opt-in, default: false)
  • New builder method: HoodieArchivalConfig.Builder.withBlockArchivalOnCleanECTR(boolean)
  • New accessor: HoodieWriteConfig.shouldBlockArchivalOnCleanECTR()

User-facing changes:

  • When enabled, archival may retain more commits in active timeline if they haven't been cleaned
  • Timeline growth bounded by ECTR from last clean operation
  • No behavior change when config is disabled (default)

Performance impact:

  • Minimal: One additional metadata read per archival operation when enabled
  • Read operation is fast (single clean metadata file)
  • No impact when config is disabled (default)

Breaking changes: None - opt-in feature with no default behavior changes

Risk Level

low

Documentation Update

Config documentation:
The new config hoodie.archive.block.on.latest.clean.ectr is documented inline:
.withDocumentation("If enabled, archival will block on latest ECTR from last known clean")

Website documentation needed:

  • Add entry to config reference page for HoodieArchivalConfig.BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
  • Update archival section in Hudi docs to explain ECTR blocking feature
  • Add usage example showing when to enable this config
  • Document interaction with different cleaning policies (KEEP_LATEST_COMMITS vs KEEP_LATEST_FILE_VERSIONS)

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Mar 25, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.36%. Comparing base (2f07364) to head (cbd2f32).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...ent/timeline/versioning/v1/TimelineArchiverV1.java 85.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18380   +/-   ##
=========================================
  Coverage     68.36%   68.36%           
+ Complexity    27566    27554   -12     
=========================================
  Files          2432     2432           
  Lines        133175   133204   +29     
  Branches      16023    16029    +6     
=========================================
+ Hits          91047    91068   +21     
- Misses        35068    35074    +6     
- Partials       7060     7062    +2     
Flag Coverage Δ
common-and-other-modules 44.32% <41.37%> (+<0.01%) ⬆️
hadoop-mr-java-client 45.14% <20.68%> (-0.01%) ⬇️
spark-client-hadoop-common 48.45% <89.65%> (-0.12%) ⬇️
spark-java-tests 48.70% <41.37%> (-0.01%) ⬇️
spark-scala-tests 45.39% <20.68%> (+<0.01%) ⬆️
utilities 38.52% <41.37%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...a/org/apache/hudi/config/HoodieArchivalConfig.java 89.88% <100.00%> (+0.99%) ⬆️
...java/org/apache/hudi/config/HoodieWriteConfig.java 89.85% <100.00%> (+<0.01%) ⬆️
...ent/timeline/versioning/v1/TimelineArchiverV1.java 80.25% <85.00%> (+0.44%) ⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants