Skip to content

fix: do not fall back to timeline server markers on transient HDFS failures#18887

Open
vamshipasunuru1 wants to merge 2 commits into
apache:masterfrom
vamshipasunuru1:fix/marker-rollback-transient-io-failure
Open

fix: do not fall back to timeline server markers on transient HDFS failures#18887
vamshipasunuru1 wants to merge 2 commits into
apache:masterfrom
vamshipasunuru1:fix/marker-rollback-transient-io-failure

Conversation

@vamshipasunuru1
Copy link
Copy Markdown

Change Description

When MARKERS.type is absent, MarkerBasedRollbackUtils.getAllMarkerPaths() first tries DIRECT markers and catches IOException | IllegalArgumentException to fall back to TIMELINE_SERVER_BASED. This catch is too broad.

The problem: A transient HDFS error (e.g., "Server too busy" / RetriableException) is also an IOException. When it's caught, the code falls back to the timeline server marker path, which looks in a different location and finds 0 markers — causing the rollback to skip deleting data files and leaving orphan files on the table.

Fix

Split the exception handling:

  • IOException → propagate, let rollback fail and retry (transient HDFS failures should not silently produce an incorrect rollback)
  • IllegalArgumentException → keep the fallback (this indicates a marker path format mismatch, the existing intended behavior)

Testing

Added TestMarkerBasedRollbackUtils with unit tests covering:

  • IOException is propagated (no fallback)
  • IllegalArgumentException still falls back to timeline server markers

Risk

Low — the change only affects the MARKERS.type-absent code path, and only for transient IO failures. The IllegalArgumentException fallback behavior is preserved.

…ilures

When MARKERS.type is absent, MarkerBasedRollbackUtils tries DIRECT markers
and catches IOException to fall back to TIMELINE_SERVER_BASED. This was too
broad: a transient "Server too busy" RetriableException is also an IOException,
causing rollback to use the timeline server marker path which finds 0 markers
and deletes nothing, leaving orphan data files behind.

Only catch IllegalArgumentException (marker path format mismatch) for the
fallback. Let IOException propagate so the rollback fails and retries rather
than silently producing an incorrect result.
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 31, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The intent — propagating transient HDFS IOExceptions instead of silently falling back to timeline-server markers (and leaving orphan data files) — is a real correctness improvement. A couple of issues worth a closer look in the inline comments: the new IllegalArgumentException branch references an undefined fileSystem variable (the surrounding code uses storage), and the new test mocks metaClient.getFs() even though the production code goes through table.getStorage()/readMarkerType(storage, …), so the injected IOException likely never reaches the catch block under test. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One logging style nit below — both new catch blocks switched from SLF4J parameterized logging to String.format, losing the exception stack trace in the process.

// IllegalArgumentException indicates a marker path format mismatch — fall back to timeline server.
log.warn(String.format("%s not present and %s marker failed with error: %s. So, falling back to %s marker",
MARKER_TYPE_FILENAME, DIRECT, e.getMessage(), TIMELINE_SERVER_BASED));
return getTimelineServerBasedMarkers(context, parallelism, markerDir, fileSystem);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Is fileSystem defined anywhere in scope? The surrounding method uses storage (a HoodieStorage from table.getStorage()), and the line you removed passed storage to getTimelineServerBasedMarkers. As written this looks like it won't compile — did you mean storage here?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

when(mockMetaClient.getMarkerFolderPath(instant)).thenReturn(markerDir);
when(mockContext.getHadoopConf()).thenReturn(new SerializableConfiguration(new Configuration()));

// MARKERS.type is absent — simulates a write that was interrupted before writing the type file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The test stubs metaClient.getFs() and HoodieWrapperFileSystem.listStatus(...), but getAllMarkerPaths reaches the catch block via table.getStorage()readMarkerType(storage, markerDir)WriteMarkers.allMarkerFilePaths() (which goes through HoodieStorage, not the Hadoop FS). Without stubbing table.getStorage(), storage is null and the call should NPE before ever reaching the IOException path you're trying to exercise — so the assertion would be matching the wrong exception type. Could you verify this test actually fails before the fix and passes after?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

// Do NOT fall back to TIMELINE_SERVER_BASED on transient IO failures (e.g., HDFS throttling).
// The timeline server looks in a different location and would return 0 markers, causing the
// rollback to skip deleting data files and leaving orphan files on the table.
log.warn(String.format("%s not present and %s marker listing failed with IO error: %s. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: both new catch blocks switched from SLF4J parameterized logging to String.format, which eagerly formats the string and drops the stack trace. Could you use the parameterized form and pass e as the last argument? e.g. log.warn("{} not present and {} marker listing failed with IO error, propagating rather than falling back to {}", MARKER_TYPE_FILENAME, DIRECT, TIMELINE_SERVER_BASED, e); — same applies to the IllegalArgumentException block at line 83.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants