HBASE-29987 Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop#7909
Merged
Apache9 merged 2 commits intoapache:masterfrom Mar 13, 2026
Conversation
added 2 commits
March 11, 2026 11:34
When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart.
Apache9
approved these changes
Mar 13, 2026
Apache9
pushed a commit
that referenced
this pull request
Mar 13, 2026
…cted in ReplicationSourceWALReader run loop (#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)
Apache9
pushed a commit
that referenced
this pull request
Mar 13, 2026
…cted in ReplicationSourceWALReader run loop (#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)
sidkhillon
added a commit
to HubSpot/hbase
that referenced
this pull request
Mar 13, 2026
… file switch detected in ReplicationSourceWALReader run loop (apache#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)
Apache9
pushed a commit
that referenced
this pull request
Mar 14, 2026
…cted in ReplicationSourceWALReader run loop (#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ReplicationSourceWALReaderdetects a WAL file switch via theswitched()check, it was not resettingcurrentPosition. If the outer loop later restarted and recreated theWALEntryStream, it would use the old file's position on the new file, causingEOFException: Cannot seek after EOF. The corrupted position also gets persisted to ZK, making it unrecoverable without manual intervention (recreating the replication peer).currentPositionto the stream's position (0) when a file switch is detected, so any subsequent stream creation opens the new file at the correct offset.WALEntryFilterRetryableExceptionto trigger the outer loop restart.