Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
inconsistency in writing to etcd (V3.0.14) - I think i have a broken cluster. #7533
i think that one of my clusters is broken.
In a working cluster (our test env) :
In my broken cluster (our production env):
This issue causes major problem in our production environment,
I think i have a lead to the root cause, not sure though :
When trying :
What should i do in such case?
Updates should be visible on a majority of members, so something is clearly wrong.
This was tested with
This is strange because if 01 is accepting updates that aren't visible on 02 and 03, it should have a raft index larger than the other members, but instead it has 51548878 < 51548922, 51548944.
"eventually made it" sounds like something could be misconfigured. What were the problems / what was the workaround?
The following information should help with debugging:
@heyitsanthony yeah, it was tested just like test env.
Could it be that whenever kube-apiserver starts it picks one of the etcd servers to talk to ?
ETCDCTL_API=3 ./bin/etcdctl -w json get abc :
ETCDCTL_API=3 ./bin/etcdctl member list
It shouldn't matter so long as the requests go through consensus (which they do).
This is bad.
What steps were taken to add
As a workaround, the easiest fix would be to
it didn't work, then we used this thread to solve it:
if i want to remove 02 and 03 and then re add them with fresh data directories, what flags should i use? the current command i use to start the etcd binary is this :
I'd prefer not to send you the data directories since they contain information about our production environment, such as secret keys etc..
@eran-totango that command looks OK. Some comments:
Note that there'll be a brief loss of availability when going from 1->2 nodes since the cluster has to wait until the second member comes up. It's possible to do this without the major outage (there'll be a short leader election) by removing/adding the leader node until 01 is elected (this will be reflected in
No thoughts on what to do without direct access to the wal/snap. /cc @xiang90 any thoughts?
and i'm getting this error:
I did the exact same steps in my test environment (1-4) and it worked perfectly.