Skip to content

feat(rollback): add RollbackOrphanDetector utility for archive orphan guard (#18783)#18795

Open
shangxinli wants to merge 1 commit into
apache:masterfrom
shangxinli:feat/rollback-orphan-detector
Open

feat(rollback): add RollbackOrphanDetector utility for archive orphan guard (#18783)#18795
shangxinli wants to merge 1 commit into
apache:masterfrom
shangxinli:feat/rollback-orphan-detector

Conversation

@shangxinli
Copy link
Copy Markdown
Contributor

@shangxinli shangxinli commented May 20, 2026

Describe the issue this Pull Request addresses

See issue #18783 for the full problem statement. In summary: when a rollback partially fails (crash mid-rollback, marker loss, or a blocked storage close() that lands data after rollback completed) and the rollback instant is later archived, the system loses the metadata anchor that lets readers filter out the orphan files. Readers then return corrupt-parquet errors or duplicate records — a hard violation of the reader/writer isolation guarantee.

This PR is the foundation for an archive-time precondition check that prevents that loss-of-anchor scenario. It introduces a RollbackOrphanDetector and a feature-flag config; the actual wiring into the archive planner is a follow-up cascade PR. No behavior change yet.

Related:

  • Cascade PR (wires the detector in): feat/rollback-orphan-archive-precondition on the fork; will be opened after this merges
  • Companion CLI PR: feat/rollback-orphan-repair-cli on the fork; will be opened after this merges

Summary and Changelog

A new RollbackOrphanDetector utility plus the config that will eventually gate archival of rollback instants when their orphan files are still on storage. No behavior change yet — the config defaults to OFF.

  1. New RollbackOrphanDetector in hudi-client/hudi-client-common under org.apache.hudi.table.action.rollback. Two detection modes:
    • LIGHT — reads HoodieRollbackMetadata.failedDeleteFiles. O(metadata size). Catches files the rollback explicitly tried and failed to delete but misses post-rollback late landings.
    • THOROUGH — additionally lists the partitions named in the rollback metadata and matches filenames against the rollback's target instant time(s) (both base parquet and MoR log file naming). Bounded by partition count in the rollback metadata, not whole-table size.
  2. Safety floor: every candidate is cross-checked against completed instants in the active timeline — a file whose embedded instant is a COMPLETED commit is never flagged as an orphan, even if its filename matches the regex.
  3. New config hoodie.archive.rollback.orphan.guard.mode (values OFF / LIGHT / THOROUGH, default OFF) and a getter on HoodieWriteConfig.
  4. Two overloads: (HoodieTable, HoodieInstant, Mode) for the archive planner context and (HoodieTableMetaClient, HoodieInstant, Mode) for the hudi-cli context that follows in PR3.
  5. TestRollbackOrphanDetector with 5 tests covering OFF, LIGHT with empty/non-empty failedDeleteFiles, THOROUGH with a real partition listing, and the safety-floor case.

Impact

None until the new config is set to LIGHT or THOROUGH. The default OFF short-circuits before any work happens.

Risk Level

low

Documentation Update

A user-facing config doc update is appropriate once the cascade PR (which actually wires the detector in) lands — that's the change users would actually flip. This PR is plumbing only.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

…chive guard

Introduces RollbackOrphanDetector and a feature-flag config that will
later gate archival of rollback instants when their orphan files are
still on storage. No behavior change yet — this PR only lands the
building block and config; wiring into the archive planner follows in a
separate PR.

See apache#18783 for the motivating
problem: when a rollback partially fails (crash mid-rollback, marker
loss, or a blocked storage close() that lands data after rollback
completed) and the rollback instant is later archived, the system loses
the metadata anchor that lets readers filter out the orphan files,
leading to corrupt-parquet errors or duplicate records.

Two detection modes:

  - LIGHT    : reads HoodieRollbackMetadata.failedDeleteFiles.
  - THOROUGH : additionally lists partitions named in the rollback
               metadata and matches filenames against the target
               instant time(s). Catches late-landing writes.

A safety floor cross-checks every candidate against the completed
timeline so a legitimate committed file with a matching filename is
never flagged.

Config: hoodie.archive.rollback.orphan.guard.mode = OFF | LIGHT | THOROUGH
Default: OFF (no behavior change).

Test: TestRollbackOrphanDetector covers OFF / LIGHT / THOROUGH and the
safety-floor case.
@shangxinli shangxinli changed the title [HUDI-XXXX] Add RollbackOrphanDetector utility for rollback orphan archive guard (#18783, foundation) feat(rollback): add RollbackOrphanDetector utility for archive orphan guard (#18783) May 20, 2026
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 20, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 63.06306% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.25%. Comparing base (f044d3d) to head (8d8a87e).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
.../table/action/rollback/RollbackOrphanDetector.java 62.74% 22 Missing and 16 partials ⚠️
...a/org/apache/hudi/config/HoodieArchivalConfig.java 75.00% 2 Missing ⚠️
...java/org/apache/hudi/config/HoodieWriteConfig.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18795      +/-   ##
============================================
+ Coverage     68.22%   68.25%   +0.02%     
- Complexity    29290    29357      +67     
============================================
  Files          2525     2528       +3     
  Lines        141733   141969     +236     
  Branches      17614    17650      +36     
============================================
+ Hits          96698    96899     +201     
- Misses        37065    37086      +21     
- Partials       7970     7984      +14     
Flag Coverage Δ
common-and-other-modules 44.43% <63.06%> (+0.05%) ⬆️
hadoop-mr-java-client 44.86% <5.40%> (-0.10%) ⬇️
spark-client-hadoop-common 48.17% <5.40%> (-0.10%) ⬇️
spark-java-tests 48.81% <5.40%> (-0.04%) ⬇️
spark-scala-tests 44.89% <5.40%> (-0.06%) ⬇️
utilities 37.41% <5.40%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java 89.91% <0.00%> (-0.07%) ⬇️
...a/org/apache/hudi/config/HoodieArchivalConfig.java 88.65% <75.00%> (-1.23%) ⬇️
.../table/action/rollback/RollbackOrphanDetector.java 62.74% <62.74%> (ø)

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR introduces a RollbackOrphanDetector utility that's off-by-default and scoped as foundation for a follow-up archive-precondition PR — that staging keeps blast radius low. The main concern from this pass is the log-file regex in THOROUGH mode (and the matching INSTANT_REGEX_LOG constant): it appears to inspect the writeToken position rather than the embedded instant, so MoR log-file detection likely doesn't behave as the Javadoc suggests. A couple of smaller items (Javadoc/impl mismatch on hasOrphans short-circuit, and a locale concern in Mode.parse) are noted inline. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and documentation accuracy nits in RollbackOrphanDetector.java.

private static final Logger LOG = LoggerFactory.getLogger(RollbackOrphanDetector.class);

/** Matches Hudi base/log filenames carrying an embedded instant time at the trailing position. */
private static final String INSTANT_REGEX_BASE = "_(\\d+)\\.parquet$";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I think this log regex (and the matching one in buildPatterns below) inspects the wrong position. Hudi log files are named .<fileId>_<deltaCommitTime>.log.<version>_<writeToken> per FSUtils.LOG_FILE_PATTERN, with writeToken formatted as <digits>-<digits>-<digits> (e.g. 0-1-0). The current pattern \.log\.\d+_(\d+)(?:_[^/]*)?$ matches digits AFTER .log.<version>_, i.e. the writeToken position — for a typical name like .fileId_20260101000000000.log.1_0-1-0 it wouldn't match at all (trailing -1-0 blocks $), and for the rare _0 writeToken it would capture 0, not the instant. Either way, late-landing log files won't be detected in THOROUGH mode (the test only exercises parquet via createBaseFileToRollback, so this isn't caught). Did you mean something like _(\d+)\.log\.\d+(?:_[^/]*)?$ here, with the same fix to the buildPatterns variant? @yihua could you sanity-check the intended log-file matching here?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}

/**
* Returns true iff any orphan files are detected for {@code rollbackInstant}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The Javadoc says hasOrphans is 'Optimised to short-circuit as soon as the first orphan is found in THOROUGH', but the implementation below just calls detectOrphans(...).isEmpty() — which still lists every partition in the rollback metadata. Either the Javadoc should be relaxed, or hasOrphans should actually break out of the partition loop on first hit. For an archive-precondition call site, the difference can matter on rollbacks touching many partitions.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

if (value == null || value.isEmpty()) {
return OFF;
}
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Minor: toUpperCase() without a Locale uses the default locale. In a Turkish-locale JVM, 'light'.toUpperCase() produces LİGHT (dotted I), and Mode.valueOf would then fall back to OFF. Could you use toUpperCase(Locale.ROOT) here for safety?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

* <p>A safety floor cross-checks every candidate against the completed
* timeline: a file whose embedded instant appears as a {@code COMPLETED}
* commit in either the active or archived timeline is never flagged as an
* orphan, even if its filename matches the regex.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the class-level Javadoc says the safety floor checks "either the active or archived timeline", but the implementation in filterAgainstCompletedTimeline only examines the active timeline (and the inline comment there even explains why). Could you update this to say "the active timeline" so the documented guarantee matches what's actually implemented?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

for (HoodieRollbackPartitionMetadata pm : metadata.getPartitionMetadata().values()) {
List<String> ff = pm.getFailedDeleteFiles();
if (ff != null) {
failed.addAll(ff);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: ff is pretty terse here — could you rename it to failedFiles or failedDeleteFiles to make it easier to follow at a glance?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

String name = path.substring(path.lastIndexOf('/') + 1);
java.util.regex.Matcher m = Pattern.compile(INSTANT_REGEX_BASE).matcher(name);
if (m.find()) {
return m.group(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: java.util.regex.Matcher is used with a fully-qualified name here even though Pattern is already imported — worth adding import java.util.regex.Matcher;. Also, since INSTANT_REGEX_BASE and INSTANT_REGEX_LOG are constants, promoting them to pre-compiled static final Pattern fields would avoid recompiling the same patterns on every path checked.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@nsivabalan
Copy link
Copy Markdown
Contributor

responded in the github issue.

@danny0405 and @cshuo tried simulating this and ran flink jobs for hours and we could not reproduce the issue. So, using oss hudi, oss gcs, the issue is not a valid one. Likely you have some custom hudi or custom GCS internally deployed in your org which could cause the lingering files issue.

If you suspect its a valid oss bug, can you help us w/ a reproducible bug and we can take it from there.
As of now, we don't think this warrants a tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants