Skip to content

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem(base master and fix conflict)#2382

Open
rooookkie wants to merge 3 commits intoapache:masterfrom
rooookkie:ZOOKEEPER-2789-zxid-overflow
Open

ZOOKEEPER-2789: Reassign ZXID for solving 32bit overflow problem(base master and fix conflict)#2382
rooookkie wants to merge 3 commits intoapache:masterfrom
rooookkie:ZOOKEEPER-2789-zxid-overflow

Conversation

@rooookkie
Copy link
Copy Markdown

This PR addresses ZOOKEEPER-2789, which resolves the ZXID 32-bit counter overflow problem.

In ZooKeeper, the ZXID is a 64-bit number composed of a 32-bit epoch and a 32-bit counter. When the 32-bit counter overflows, it forces a leader re-election and can cause serious issues in the cluster. This change:

  1. Extends the counter bit width: Extends ZxidUtils to support configurable epoch/counter bit positions, enabling a 40-bit counter (from the original 32-bit), which significantly delays the overflow threshold.
  2. Supports smooth rolling upgrade: Adds upgrade coordination logic in QuorumPeer and Learner, allowing the cluster to transition from 32-bit to 40-bit counter mode without restart via rolling upgrade.
  3. Replaces hardcoded bit operations: Replaces all hardcoded ZXID bit operations (& 0xffffffffL, >> 32, << 32) with ZxidUtils utility methods across the codebase, making the code more maintainable and consistent.

Brief changelog

  • Extended ZxidUtils with configurable epoch high position (32-bit / 40-bit), clearEpoch(), clearCounter(), and getCounterLowPosition() methods
  • Added smooth upgrade support in QuorumPeer and Learner for transitioning from 32-bit to 40-bit counter
  • Replaced all hardcoded ZXID bit operations with ZxidUtils methods across Leader, LearnerHandler, FollowerZooKeeperServer, and ObserverMaster
  • Updated Leader.propose() to use ZxidUtils.getCounterLowPosition() for rollover detection
  • Added unit tests in LearnerHandlerTest for ZXID reassignment scenarios
  • Updated Zab1_0Test, ZxidRolloverTest, FollowerResyncConcurrencyTest, and ReconfigTest to use ZxidUtils methods

How does this change relate to the original PR?

This is a re-submission of #2164. The original authors (@asdf2014 and @ganzichen) have been inactive for a long time and the PR has gone stale, so I'm picking it up and rebasing it on the latest master branch.

Conflict resolution:

  • The buildRequestToProcess method in FollowerZooKeeperServer was removed upstream by ZOOKEEPER-4925, so the conflict was resolved by dropping that method and applying the bit-operation replacement directly to the current logRequest() method.
  • Additionally replaced bit operations in ObserverMaster.processAck() which was not covered by the original PR.

Testing done

  • Unit tests added in LearnerHandlerTest for ZXID reassignment
  • Existing ZxidRolloverTest, Zab1_0Test, FollowerResyncConcurrencyTest, and ReconfigTest updated
  • All existing tests pass

Original Authors

Credit to the original authors of #2164:

@anmolnar anmolnar requested a review from kezhuw May 7, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants