Fix backoff when connecting to cluster with leader election in progress #32

anttirt · 2016-02-03T13:28:48Z

When the client attempts to connect to a cluster that's in the process of leader re-election, the TCP connection initially succeeds but is then closed by the ZK node. The current backoff logic treats this as a "success" and resets the backoff, resulting in reconnect attempts without any backoff, and with many clients this can result in a considerable load on the ZK servers at a very critical moment, potentially delaying the re-election and recovery.

The effect can be seen with https://github.com/anttirt/zktestcluster (you'll need Virtualbox and Vagrant) and any client connecting to it using ZooKeeperNet. Set up the cluster (vagrant up), connect a client to it and run vagrant -c ssh //vagrant/restart-leader.sh. You should see the client attempting many (dozens) of reconnects with no backoff.

This patch delays resetting the backoff until we have a confirmed session on the ZK node.

Fix backoff when connecting to cluster with leader election in progress

Fix backoff when connecting to cluster with leader election in progress

a01b5fa

ewhauser added a commit that referenced this pull request Feb 8, 2016

Merge pull request #32 from anttirt/trunk

927dbb1

Fix backoff when connecting to cluster with leader election in progress

ewhauser merged commit 927dbb1 into ewhauser:trunk Feb 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backoff when connecting to cluster with leader election in progress #32

Fix backoff when connecting to cluster with leader election in progress #32

anttirt commented Feb 3, 2016

Fix backoff when connecting to cluster with leader election in progress #32

Fix backoff when connecting to cluster with leader election in progress #32

Conversation

anttirt commented Feb 3, 2016