fix: do not fall back to timeline server markers on transient HDFS failures#18887
fix: do not fall back to timeline server markers on transient HDFS failures#18887vamshipasunuru1 wants to merge 2 commits into
Conversation
…ilures When MARKERS.type is absent, MarkerBasedRollbackUtils tries DIRECT markers and catches IOException to fall back to TIMELINE_SERVER_BASED. This was too broad: a transient "Server too busy" RetriableException is also an IOException, causing rollback to use the timeline server marker path which finds 0 markers and deletes nothing, leaving orphan data files behind. Only catch IllegalArgumentException (marker path format mismatch) for the fallback. Let IOException propagate so the rollback fails and retries rather than silently producing an incorrect result.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The intent — propagating transient HDFS IOExceptions instead of silently falling back to timeline-server markers (and leaving orphan data files) — is a real correctness improvement. A couple of issues worth a closer look in the inline comments: the new IllegalArgumentException branch references an undefined fileSystem variable (the surrounding code uses storage), and the new test mocks metaClient.getFs() even though the production code goes through table.getStorage()/readMarkerType(storage, …), so the injected IOException likely never reaches the catch block under test. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One logging style nit below — both new catch blocks switched from SLF4J parameterized logging to String.format, losing the exception stack trace in the process.
| // IllegalArgumentException indicates a marker path format mismatch — fall back to timeline server. | ||
| log.warn(String.format("%s not present and %s marker failed with error: %s. So, falling back to %s marker", | ||
| MARKER_TYPE_FILENAME, DIRECT, e.getMessage(), TIMELINE_SERVER_BASED)); | ||
| return getTimelineServerBasedMarkers(context, parallelism, markerDir, fileSystem); |
There was a problem hiding this comment.
🤖 Is fileSystem defined anywhere in scope? The surrounding method uses storage (a HoodieStorage from table.getStorage()), and the line you removed passed storage to getTimelineServerBasedMarkers. As written this looks like it won't compile — did you mean storage here?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| when(mockMetaClient.getMarkerFolderPath(instant)).thenReturn(markerDir); | ||
| when(mockContext.getHadoopConf()).thenReturn(new SerializableConfiguration(new Configuration())); | ||
|
|
||
| // MARKERS.type is absent — simulates a write that was interrupted before writing the type file |
There was a problem hiding this comment.
🤖 The test stubs metaClient.getFs() and HoodieWrapperFileSystem.listStatus(...), but getAllMarkerPaths reaches the catch block via table.getStorage() → readMarkerType(storage, markerDir) → WriteMarkers.allMarkerFilePaths() (which goes through HoodieStorage, not the Hadoop FS). Without stubbing table.getStorage(), storage is null and the call should NPE before ever reaching the IOException path you're trying to exercise — so the assertion would be matching the wrong exception type. Could you verify this test actually fails before the fix and passes after?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| // Do NOT fall back to TIMELINE_SERVER_BASED on transient IO failures (e.g., HDFS throttling). | ||
| // The timeline server looks in a different location and would return 0 markers, causing the | ||
| // rollback to skip deleting data files and leaving orphan files on the table. | ||
| log.warn(String.format("%s not present and %s marker listing failed with IO error: %s. " |
There was a problem hiding this comment.
🤖 nit: both new catch blocks switched from SLF4J parameterized logging to String.format, which eagerly formats the string and drops the stack trace. Could you use the parameterized form and pass e as the last argument? e.g. log.warn("{} not present and {} marker listing failed with IO error, propagating rather than falling back to {}", MARKER_TYPE_FILENAME, DIRECT, TIMELINE_SERVER_BASED, e); — same applies to the IllegalArgumentException block at line 83.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Change Description
When
MARKERS.typeis absent,MarkerBasedRollbackUtils.getAllMarkerPaths()first triesDIRECTmarkers and catchesIOException | IllegalArgumentExceptionto fall back toTIMELINE_SERVER_BASED. This catch is too broad.The problem: A transient HDFS error (e.g., "Server too busy" /
RetriableException) is also anIOException. When it's caught, the code falls back to the timeline server marker path, which looks in a different location and finds 0 markers — causing the rollback to skip deleting data files and leaving orphan files on the table.Fix
Split the exception handling:
IOException→ propagate, let rollback fail and retry (transient HDFS failures should not silently produce an incorrect rollback)IllegalArgumentException→ keep the fallback (this indicates a marker path format mismatch, the existing intended behavior)Testing
Added
TestMarkerBasedRollbackUtilswith unit tests covering:IOExceptionis propagated (no fallback)IllegalArgumentExceptionstill falls back to timeline server markersRisk
Low — the change only affects the
MARKERS.type-absent code path, and only for transient IO failures. TheIllegalArgumentExceptionfallback behavior is preserved.