-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minority failure during cluster configuration change risks deadlock #3699
Comments
@andres-erbsen I could reproduce, so this is a bug. The reason AFAIK is that |
@andres-erbsen Interesting... This is a bug. As far as I can see, the root cause is not raft but a bug in transportation layer. We will fix it soon. |
cc @aphyr |
Any updates on this? |
@daniel-ziegler I am going to look into this soon. This was not a priority since it does not happen frequently in real world. |
cluster integration now supports adding members with stopped nodes, too Fixes etcd-io#3699
cluster integration now supports adding members with stopped nodes, too Fixes etcd-io#3699
Adding a node to a degraded cluster and then electing it as the leader causes the addition not to reach the nodes that were down. In particular, starting with a cluster of 3 nodes
a
,b
,c
and addingd
whilea
is not available and then electingd
as the leader makesa
unable to participate in the cluster.a
will remain useless as long asd
is the leader. This is especially bad in case another node (e.g.c
) goes down, and the cluster would have to rely on the participation ofa
to make progress. Thus it is possible to have a situation where there is a live majority from both the old cluster (a
,b
) and the new cluster (a
,b
,d
) but the cluster is stuck becausea
does not listen tod
even thoughd
is the leader.I am not sure how to fix this. The invariant from the thesis does not seem applicable since etcd configuration changes apply at a different time and I do not understand the correctness reasoning behind the etcd configuration change algorithm well enough to make changes to it.
A screencast of me reproducing this issue is available at
http://web.mit.edu/andreser/Public/etcd-reconfiguration-deadlock/index.html
The text was updated successfully, but these errors were encountered: