Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from transient gossip failures #1446

Merged
merged 1 commit into from
Sep 20, 2016
Merged

Conversation

mrjana
Copy link
Contributor

@mrjana mrjana commented Sep 15, 2016

Currently if there is any transient gossip failure in any node the
recoevry process depends on other nodes propogating the information
indirectly. In cases if these transient failures affects all the nodes
that this node has in its memberlist then this node will be permenantly
cutoff from the the gossip channel. Added node state management code in
networkdb to address these problems by trying to rejoin the cluster via
the failed nodes when there is a failure. This also necessitates the
need to add new messages called node event messages to differentiate
between node leave and node failure.

Signed-off-by: Jana Radhakrishnan mrjana@docker.com

@sanimej
Copy link

sanimej commented Sep 16, 2016

@mrjana In the working state if I shut the interface on the worker that is being used for the connectivity with the manager the gossip cluster fails. If I bring it up even after a minute gossip cluster seems to be setup correctly; ie: new services created across nodes can be reached. Without the changes from this PR is there some other sequence that can re-establish the gossip cluster ? In this case gRPC session also gets created again. Not sure if that is playing a role.

@mrjana
Copy link
Contributor Author

mrjana commented Sep 16, 2016

@sanimej if you directly shutdown the interface on the node, it clears the ip address on the interface and so there will be a direct error on golang api to send the UDP packet out. When that happens memberlist does not consider(correctly) that as a remote node failure but rather a problem on the local end and will keep retrying. Once you re-establish the link, the probe will succeed and everything will work. If you want to simulate remote node failure you need to make packets drop or bring down the switch/bridge which connects these nodes.

// has these in leaving/deleting state still. This is
// facilitate fast convergence after recovering from a gossip
// failure.
nDB.updateLocalStateTime()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we update the local state to a new time shouldn't the updates be sent to all nodes, including the ones still in the cluster ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update is just make sure to achieve fast convergence on the node which is recovering from failure. Since it doesn't have any real state change there is no need to update all the nodes in the cluster.

Currently if there is any transient gossip failure in any node the
recoevry process depends on other nodes propogating the information
indirectly. In cases if these transient failures affects all the nodes
that this node has in its memberlist then this node will be permenantly
cutoff from the the gossip channel. Added node state management code in
networkdb to address these problems by trying to rejoin the cluster via
the failed nodes when there is a failure. This also necessitates the
need to add new messages called node event messages to differentiate
between node leave and node failure.

Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
@sanimej
Copy link

sanimej commented Sep 20, 2016

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants