-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistency in writing to etcd (V3.0.14) - I think i have a broken cluster. #7533
Comments
How did you recover the new server? Did you try and restore from backup or do something else? |
@philips i didn't restore from backup. |
Updates should be visible on a majority of members, so something is clearly wrong.
This was tested with
This is strange because if 01 is accepting updates that aren't visible on 02 and 03, it should have a raft index larger than the other members, but instead it has 51548878 < 51548922, 51548944.
"eventually made it" sounds like something could be misconfigured. What were the problems / what was the workaround? The following information should help with debugging:
|
@heyitsanthony yeah, it was tested just like test env. Could it be that whenever kube-apiserver starts it picks one of the etcd servers to talk to ? ETCDCTL_API=3 ./bin/etcdctl -w json get abc :
ETCDCTL_API=3 ./bin/etcdctl member list
server logs: |
It shouldn't matter so long as the requests go through consensus (which they do).
This is bad. What steps were taken to add As a workaround, the easiest fix would be to |
@heyitsanthony
it didn't work, then we used this thread to solve it: if i want to remove 02 and 03 and then re add them with fresh data directories, what flags should i use? the current command i use to start the etcd binary is this :
I'd prefer not to send you the data directories since they contain information about our production environment, such as secret keys etc.. |
@eran-totango that command looks OK. Some comments:
Note that there'll be a brief loss of availability when going from 1->2 nodes since the cluster has to wait until the second member comes up. It's possible to do this without the major outage (there'll be a short leader election) by removing/adding the leader node until 01 is elected (this will be reflected in No thoughts on what to do without direct access to the wal/snap. /cc @xiang90 any thoughts? |
@heyitsanthony We can write a tool to clear out the actual value and leave the metadata. @eran-totango We do want to figure out the root cause of the issue. If you would love to help, we can probably hack out a tool to wipe out the sensitive data before you send anything to us. |
@xiang90 sure, let's do it. |
@heyitsanthony @xiang90
and i'm getting this error:
I did the exact same steps in my test environment (1-4) and it worked perfectly. |
@eran-totango k8s never works with etcd 3.0.14. it works with etcd 3.0.17+. |
@eran-totango Well. I was wrong. etcd 3.0.12+ should be OK. But have you ever ran your cluster with a previous version of etcd? Or it was created with etcd 3.0.14? |
This is fixed by #7203. You need a newer version of etcdctl to recover the backup. Try etcd 3.0.17. |
@xiang90 when i created the kubernetes cluster i was running etcd V3.0.7 and then i upgraded to V3.0.14. I'll try to restore with etcdctl V3.0.17 and will let you know. should i upgrade my etcd to V3.0.17 as well? |
@eran-totango yes, upgrading is recommended |
Appears to be configuration issue and possibly state machine inconsistency that's since been fixed; not much else to do here. Closing. |
Hi,
i think that one of my clusters is broken.
In a working cluster (our test env) :
When changing a setting in a kubernetes deployment spec. (for example. number of replicas),
all of my etcd servers (01,02 and 03) are immediately updated with the new setting.
by checking : ETCDCTL_API=3 etcdctl get /registry/deployments/default/
in each server.
In my broken cluster (our production env):
when i try to change a setting in a deployment,only etcd-01 is updated with the new settings (02 and 03) aren't being updated.
This issue causes major problem in our production environment,
for example: if kube-apiserver restarts, it takes an old configuration which causes our micro services to run with old versions etc..
I think i have a lead to the root cause, not sure though :
a few weeks ago, etcd-prod03 died. (failed its ec2 status checks).
We were having problems while trying to replace it with a new servers, but we eventually made it.
Could that be the problem ? that even though it has successfully connected to the cluster,
we're facing problems because of it?
When trying :
ETCDCTL_API=3 etcdctl endpoint status
I get:
What should i do in such case?
Any help will be appreciated, Thanks.
The text was updated successfully, but these errors were encountered: