Skip to content

HBASE-29987 Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop#7909

Merged
Apache9 merged 2 commits intoapache:masterfrom
HubSpot:fix-replication-stuck-upstream
Mar 13, 2026
Merged

HBASE-29987 Replication position corruption when WAL file switch detected in ReplicationSourceWALReader run loop#7909
Apache9 merged 2 commits intoapache:masterfrom
HubSpot:fix-replication-stuck-upstream

Conversation

@sidkhillon
Copy link
Contributor

@sidkhillon sidkhillon commented Mar 11, 2026

Summary

  • When ReplicationSourceWALReader detects a WAL file switch via the switched() check, it was not resetting currentPosition. If the outer loop later restarted and recreated the WALEntryStream, it would use the old file's position on the new file, causing EOFException: Cannot seek after EOF. The corrupted position also gets persisted to ZK, making it unrecoverable without manual intervention (recreating the replication peer).
  • Fix: reset currentPosition to the stream's position (0) when a file switch is detected, so any subsequent stream creation opens the new file at the correct offset.
  • Add a regression test that reproduces the scenario using a WALEntryFilterRetryableException to trigger the outer loop restart.

skhillon added 2 commits March 11, 2026 11:34
When ReplicationSourceWALReader.run() detects a WAL file switch via the
switched() check, it enqueues an EOF batch but does not update
currentPosition. If the outer loop restarts (e.g., due to
WALEntryFilterRetryableException), the new WALEntryStream is created
with the stale position from the old file, applied to the new file.
This causes an infinite retry loop (EOFException: Cannot seek after EOF)
and the corrupted position may be persisted to ZK, surviving restarts.

The fix resets currentPosition to entryStream.getPosition() (which
returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch.

Includes a regression test that reproduces the bug by using
nb.capacity=1 to force EOF detection at line 153 (not inside
readWALEntries), combined with a WALEntryFilterRetryableException on
the first entry of the new file to trigger the outer loop restart.
@sidkhillon sidkhillon marked this pull request as ready for review March 11, 2026 18:48
@Apache9 Apache9 merged commit e4f9c65 into apache:master Mar 13, 2026
8 checks passed
Apache9 pushed a commit that referenced this pull request Mar 13, 2026
…cted in ReplicationSourceWALReader run loop (#7909)

When ReplicationSourceWALReader.run() detects a WAL file switch via the
switched() check, it enqueues an EOF batch but does not update
currentPosition. If the outer loop restarts (e.g., due to
WALEntryFilterRetryableException), the new WALEntryStream is created
with the stale position from the old file, applied to the new file.
This causes an infinite retry loop (EOFException: Cannot seek after EOF)
and the corrupted position may be persisted to ZK, surviving restarts.

The fix resets currentPosition to entryStream.getPosition() (which
returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch.

Includes a regression test that reproduces the bug by using
nb.capacity=1 to force EOF detection at line 153 (not inside
readWALEntries), combined with a WALEntryFilterRetryableException on
the first entry of the new file to trigger the outer loop restart.

Co-authored-by: skhillon <skhillon@hubspot.com>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
(cherry picked from commit e4f9c65)
Apache9 pushed a commit that referenced this pull request Mar 13, 2026
…cted in ReplicationSourceWALReader run loop (#7909)

When ReplicationSourceWALReader.run() detects a WAL file switch via the
switched() check, it enqueues an EOF batch but does not update
currentPosition. If the outer loop restarts (e.g., due to
WALEntryFilterRetryableException), the new WALEntryStream is created
with the stale position from the old file, applied to the new file.
This causes an infinite retry loop (EOFException: Cannot seek after EOF)
and the corrupted position may be persisted to ZK, surviving restarts.

The fix resets currentPosition to entryStream.getPosition() (which
returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch.

Includes a regression test that reproduces the bug by using
nb.capacity=1 to force EOF detection at line 153 (not inside
readWALEntries), combined with a WALEntryFilterRetryableException on
the first entry of the new file to trigger the outer loop restart.

Co-authored-by: skhillon <skhillon@hubspot.com>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
(cherry picked from commit e4f9c65)
sidkhillon added a commit to HubSpot/hbase that referenced this pull request Mar 13, 2026
… file switch detected in ReplicationSourceWALReader run loop (apache#7909)

When ReplicationSourceWALReader.run() detects a WAL file switch via the
switched() check, it enqueues an EOF batch but does not update
currentPosition. If the outer loop restarts (e.g., due to
WALEntryFilterRetryableException), the new WALEntryStream is created
with the stale position from the old file, applied to the new file.
This causes an infinite retry loop (EOFException: Cannot seek after EOF)
and the corrupted position may be persisted to ZK, surviving restarts.

The fix resets currentPosition to entryStream.getPosition() (which
returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch.

Includes a regression test that reproduces the bug by using
nb.capacity=1 to force EOF detection at line 153 (not inside
readWALEntries), combined with a WALEntryFilterRetryableException on
the first entry of the new file to trigger the outer loop restart.

Co-authored-by: skhillon <skhillon@hubspot.com>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
(cherry picked from commit e4f9c65)
Apache9 pushed a commit that referenced this pull request Mar 14, 2026
…cted in ReplicationSourceWALReader run loop (#7909)

When ReplicationSourceWALReader.run() detects a WAL file switch via the
switched() check, it enqueues an EOF batch but does not update
currentPosition. If the outer loop restarts (e.g., due to
WALEntryFilterRetryableException), the new WALEntryStream is created
with the stale position from the old file, applied to the new file.
This causes an infinite retry loop (EOFException: Cannot seek after EOF)
and the corrupted position may be persisted to ZK, surviving restarts.

The fix resets currentPosition to entryStream.getPosition() (which
returns 0 after dequeueCurrentLog()) before enqueuing the EOF batch.

Includes a regression test that reproduces the bug by using
nb.capacity=1 to force EOF detection at line 153 (not inside
readWALEntries), combined with a WALEntryFilterRetryableException on
the first entry of the new file to trigger the outer loop restart.

Co-authored-by: skhillon <skhillon@hubspot.com>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
(cherry picked from commit e4f9c65)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants