Recover from transient gossip failures #1446

mrjana · 2016-09-15T05:29:10Z

Currently if there is any transient gossip failure in any node the
recoevry process depends on other nodes propogating the information
indirectly. In cases if these transient failures affects all the nodes
that this node has in its memberlist then this node will be permenantly
cutoff from the the gossip channel. Added node state management code in
networkdb to address these problems by trying to rejoin the cluster via
the failed nodes when there is a failure. This also necessitates the
need to add new messages called node event messages to differentiate
between node leave and node failure.

Signed-off-by: Jana Radhakrishnan mrjana@docker.com

sanimej · 2016-09-16T17:36:43Z

@mrjana In the working state if I shut the interface on the worker that is being used for the connectivity with the manager the gossip cluster fails. If I bring it up even after a minute gossip cluster seems to be setup correctly; ie: new services created across nodes can be reached. Without the changes from this PR is there some other sequence that can re-establish the gossip cluster ? In this case gRPC session also gets created again. Not sure if that is playing a role.

mrjana · 2016-09-16T17:43:37Z

@sanimej if you directly shutdown the interface on the node, it clears the ip address on the interface and so there will be a direct error on golang api to send the UDP packet out. When that happens memberlist does not consider(correctly) that as a remote node failure but rather a problem on the local end and will keep retrying. Once you re-establish the link, the probe will succeed and everything will work. If you want to simulate remote node failure you need to make packets drop or bring down the switch/bridge which connects these nodes.

sanimej · 2016-09-19T22:23:17Z

networkdb/cluster.go

+	// has these in leaving/deleting state still. This is
+	// facilitate fast convergence after recovering from a gossip
+	// failure.
+	nDB.updateLocalStateTime()


If we update the local state to a new time shouldn't the updates be sent to all nodes, including the ones still in the cluster ?

This update is just make sure to achieve fast convergence on the node which is recovering from failure. Since it doesn't have any real state change there is no need to update all the nodes in the cluster.

Currently if there is any transient gossip failure in any node the recoevry process depends on other nodes propogating the information indirectly. In cases if these transient failures affects all the nodes that this node has in its memberlist then this node will be permenantly cutoff from the the gossip channel. Added node state management code in networkdb to address these problems by trying to rejoin the cluster via the failed nodes when there is a failure. This also necessitates the need to add new messages called node event messages to differentiate between node leave and node failure. Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>

sanimej · 2016-09-20T00:55:38Z

LGTM

GordonTheTurtle added the status/0-triage label Sep 15, 2016

This was referenced Sep 15, 2016

Docker 1.12 swarm mode load balancing not consistently working moby/moby#25325

Closed

Docker Swarm node starts routing requests to wrong containers moby/moby#25985

Closed

mrjana force-pushed the networkdb branch from 3f28ac9 to d4f03a7 Compare September 16, 2016 19:33

mrjana mentioned this pull request Sep 19, 2016

Overlay / ingress network routing breaks on service restart moby/moby#26563

Closed

sanimej reviewed Sep 19, 2016

View reviewed changes

mrjana force-pushed the networkdb branch from d4f03a7 to 716810d Compare September 19, 2016 22:58

sanimej merged commit 993ba56 into moby:master Sep 20, 2016

mrjana mentioned this pull request Sep 20, 2016

Docker 1.12 RC3: network connectivity problem between the containers in a service moby/moby#24496

Closed

mrjana mentioned this pull request Sep 27, 2016

Update CHANGELOG for 1.12.2 moby/moby#26956

Merged

fcrisciani mentioned this pull request Jul 20, 2018

Add NetworkDB docs #2238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from transient gossip failures #1446

Recover from transient gossip failures #1446

mrjana commented Sep 15, 2016

sanimej commented Sep 16, 2016

mrjana commented Sep 16, 2016

sanimej Sep 19, 2016

mrjana Sep 19, 2016

sanimej commented Sep 20, 2016

Recover from transient gossip failures #1446

Recover from transient gossip failures #1446

Conversation

mrjana commented Sep 15, 2016

sanimej commented Sep 16, 2016

mrjana commented Sep 16, 2016

sanimej Sep 19, 2016

Choose a reason for hiding this comment

mrjana Sep 19, 2016

Choose a reason for hiding this comment

sanimej commented Sep 20, 2016