Fix backoff when connecting to cluster with leader election in progress #32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the client attempts to connect to a cluster that's in the process of leader re-election, the TCP connection initially succeeds but is then closed by the ZK node. The current backoff logic treats this as a "success" and resets the backoff, resulting in reconnect attempts without any backoff, and with many clients this can result in a considerable load on the ZK servers at a very critical moment, potentially delaying the re-election and recovery.
The effect can be seen with https://github.com/anttirt/zktestcluster (you'll need Virtualbox and Vagrant) and any client connecting to it using ZooKeeperNet. Set up the cluster (
vagrant up
), connect a client to it and runvagrant -c ssh //vagrant/restart-leader.sh
. You should see the client attempting many (dozens) of reconnects with no backoff.This patch delays resetting the backoff until we have a confirmed session on the ZK node.