HBASE-29987 Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop by sidkhillon · Pull Request #7909 · apache/hbase

sidkhillon · 2026-03-11T18:43:07Z

Summary

When ReplicationSourceWALReader detects a WAL file switch via the switched() check, it was not resetting currentPosition. If the outer loop later restarted and recreated the WALEntryStream, it would use the old file's position on the new file, causing EOFException: Cannot seek after EOF. The corrupted position also gets persisted to ZK, making it unrecoverable without manual intervention (recreating the replication peer).
Fix: reset currentPosition to the stream's position (0) when a file switch is detected, so any subsequent stream creation opens the new file at the correct offset.
Add a regression test that reproduces the scenario using a WALEntryFilterRetryableException to trigger the outer loop restart.

When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart.

…cted in ReplicationSourceWALReader run loop (#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)

… file switch detected in ReplicationSourceWALReader run loop (apache#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)

…cted in ReplicationSourceWALReader run loop (#7909) When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check, it enqueues an EOF batch but does not update currentPosition. If the outer loop restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old file, applied to the new file. This causes an infinite retry loop (EOFException: Cannot seek after EOF) and the corrupted position may be persisted to ZK, surviving restarts. The fix resets currentPosition to entryStream.getPosition() (which returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch. Includes a regression test that reproduces the bug by using nb.capacity=1 to force EOF detection at line 153 (not inside readWALEntries), combined with a WALEntryFilterRetryableException on the first entry of the new file to trigger the outer loop restart. Co-authored-by: skhillon <skhillon@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> (cherry picked from commit e4f9c65)

skhillon added 2 commits March 11, 2026 11:34

Clean up test comments for position reset regression test

72df82d

sidkhillon marked this pull request as ready for review March 11, 2026 18:48

Apache9 approved these changes Mar 13, 2026

View reviewed changes

Apache9 merged commit e4f9c65 into apache:master Mar 13, 2026
8 checks passed

sidkhillon mentioned this pull request Mar 16, 2026

HubSpot Backport HBASE-29987 Replication position corruption when WAL… HubSpot/hbase#242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-29987 Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop#7909

HBASE-29987 Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop#7909
Apache9 merged 2 commits intoapache:masterfrom
HubSpot:fix-replication-stuck-upstream

sidkhillon commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sidkhillon commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sidkhillon commented Mar 11, 2026 •

edited

Loading