Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistency in writing to etcd (V3.0.14) - I think i have a broken cluster. #7533

Closed
eran-totango opened this issue Mar 19, 2017 · 16 comments

Comments

4 participants
@eran-totango
Copy link

commented Mar 19, 2017

Hi,

i think that one of my clusters is broken.

In a working cluster (our test env) :
When changing a setting in a kubernetes deployment spec. (for example. number of replicas),
all of my etcd servers (01,02 and 03) are immediately updated with the new setting.
by checking : ETCDCTL_API=3 etcdctl get /registry/deployments/default/
in each server.

In my broken cluster (our production env):
when i try to change a setting in a deployment,only etcd-01 is updated with the new settings (02 and 03) aren't being updated.

This issue causes major problem in our production environment,
for example: if kube-apiserver restarts, it takes an old configuration which causes our micro services to run with old versions etc..

I think i have a lead to the root cause, not sure though :
a few weeks ago, etcd-prod03 died. (failed its ec2 status checks).
We were having problems while trying to replace it with a new servers, but we eventually made it.
Could that be the problem ? that even though it has successfully connected to the cluster,
we're facing problems because of it?

When trying : ETCDCTL_API=3 etcdctl endpoint status
I get:

etcd-prod01: 127.0.0.1:2379, e2f74e1dab85cd6, 3.0.14, 20 MB, false, 5568, 51548878
etcd-prod02: 127.0.0.1:2379, 89bb99adae595af9, 3.0.14, 55 MB, false, 5568, 51548922
etcd-prod03: 127.0.0.1:2379, 44170dda23246fe, 3.0.14, 55 MB, true, 5568, 51548944

What should i do in such case?
Any help will be appreciated, Thanks.

@philips

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2017

How did you recover the new server? Did you try and restore from backup or do something else?

@eran-totango

This comment has been minimized.

Copy link
Author

commented Mar 19, 2017

@philips i didn't restore from backup.
This server was added as a "new" server to the cluster

@heyitsanthony

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2017

Updates should be visible on a majority of members, so something is clearly wrong.

when i try to change a setting in a deployment,only etcd-01 is updated with the new settings (02 and 03) aren't being updated.

This was tested with etcdctl get like on the test env?

etcd-prod01: 127.0.0.1:2379, e2f74e1dab85cd6, 3.0.14, 20 MB, false, 5568, 51548878
etcd-prod02: 127.0.0.1:2379, 89bb99adae595af9, 3.0.14, 55 MB, false, 5568, 51548922
etcd-prod03: 127.0.0.1:2379, 44170dda23246fe, 3.0.14, 55 MB, true, 5568, 51548944

This is strange because if 01 is accepting updates that aren't visible on 02 and 03, it should have a raft index larger than the other members, but instead it has 51548878 < 51548922, 51548944.

We were having problems while trying to replace it with a new servers, but we eventually made it. Could that be the problem ? that even though it has successfully connected to the cluster,
we're facing problems because of it?

"eventually made it" sounds like something could be misconfigured. What were the problems / what was the workaround?

The following information should help with debugging:

  1. etcd server logs for each member
  2. ETCDCTL_API=3 ./bin/etcdctl -w json get abc for each member
  3. ETCDCTL_API=3 ./bin/etcdctl member list for each member
@eran-totango

This comment has been minimized.

Copy link
Author

commented Mar 20, 2017

@heyitsanthony yeah, it was tested just like test env.

Could it be that whenever kube-apiserver starts it picks one of the etcd servers to talk to ?
that will maybe explain why it currently writes to 01 (and before its latest start it might have written to other server). not sure though

ETCDCTL_API=3 ./bin/etcdctl -w json get abc :

etcd-prod01 - {"header":{"cluster_id":16818498639733619310,"member_id":1022164153822371030,"revision":19787000,"raft_term":5568}}
etcd-prod02 - {"header":{"cluster_id":16818498639733619310,"member_id":9924695175074503417,"revision":15390316,"raft_term":5568}}
etcd-prod03 - {"header":{"cluster_id":16818498639733619310,"member_id":306650346849191678,"revision":15390319,"raft_term":5568}}

ETCDCTL_API=3 ./bin/etcdctl member list

gives the exact same input on the 3 servers : 
44170dda23246fe, started, etcd-prod03, http://etcd-prod03.internal:2380, http://10.0.107.64:2379
e2f74e1dab85cd6, started, etcd-prod01, http://etcd-prod01.internal:2380, http://10.0.107.200:2379
89bb99adae595af9, started, etcd-prod02, http://etcd-prod02.internal:2380, http://10.0.107.60:2379

server logs:
etcd-prod01 : https://ufile.io/bcf0c
etcd-prod02 : https://ufile.io/ff8ed
etcd-prod03: https://ufile.io/96da81

@heyitsanthony

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2017

Could it be that whenever kube-apiserver starts it picks one of the etcd servers to talk to ?

It shouldn't matter so long as the requests go through consensus (which they do).

etcd-prod01 - {"header":{"cluster_id":16818498639733619310,"member_id":1022164153822371030,"revision":19787000,"raft_term":5568}}
etcd-prod02 - {"header":{"cluster_id":16818498639733619310,"member_id":9924695175074503417,"revision":15390316,"raft_term":5568}}
etcd-prod03 - {"header":{"cluster_id":16818498639733619310,"member_id":306650346849191678,"revision":15390319,"raft_term":5568}}

This is bad. 01's revision is way ahead of 02 and 03.

What steps were taken to add 03? @xiang90 thinks it's using an old snapshot that's out of sync with the raft log which is causing k8s's compare-and-swaps to fail on 02 and 03 but succeed on 01 (there's a safeguard in 3.1.0 for that now). Would it be possible send the etcd data directories (snap and wal directories) for each member to team-etcd@coreos.com to confirm?

As a workaround, the easiest fix would be to etcdctl member remove members etcd-02 and etcd-03 through etcd-01's endpoint (backing up member directories first), then etcdctl member add 02 and 03 back to 01 with fresh data directories.

@eran-totango

This comment has been minimized.

Copy link
Author

commented Mar 20, 2017

@heyitsanthony
in order to re add 03 to the cluster, we launched a new ec2 instance and used this configuration:

etcd --name etcd-prod03 
--initial-advertise-peer-urls http://10.0.107.200:2380 
--listen-peer-urls http://10.0.107.200:2380 
--listen-client-urls http://10.0.107.200:2379,http://127.0.0.1:2379 
--advertise-client-urls http://10.0.107.200:2379 
--initial-cluster etcd-prod01=http://etcd-prod01.internal:2380,etcd-prod02=http://etcd-prod02.internal:2380,etcd-prod03=http://etcd-prod03.internal:2380 
--initial-cluster-state new 
--data-dir /mnt/etcd

it didn't work, then we used this thread to solve it:
#2780
from what i remember, we deleted the data directory and used --initial-cluster-state existing.

if i want to remove 02 and 03 and then re add them with fresh data directories, what flags should i use? the current command i use to start the etcd binary is this :

etcd --name <node_name> 
--initial-advertise-peer-urls http://<node_ip>:2380 
--listen-peer-urls http://<node_ip>:2380 
--listen-client-urls http://<node_ip>:2379,http://127.0.0.1:2379 
--advertise-client-urls http://<node_ip>:2379 
--initial-cluster etcd-prod01=http://etcd-prod01.internal:2380,etcd-prod02=http://etcd-prod02.internal:2380,etcd-prod03=http://etcd-prod03.internal:2380 
--initial-cluster-state new --data-dir /mnt/etcd

I'd prefer not to send you the data directories since they contain information about our production environment, such as secret keys etc..
Is there anything i can do to check something for you?

@heyitsanthony

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2017

@eran-totango that command looks OK. Some comments:

Note that there'll be a brief loss of availability when going from 1->2 nodes since the cluster has to wait until the second member comes up. It's possible to do this without the major outage (there'll be a short leader election) by removing/adding the leader node until 01 is elected (this will be reflected in etcdctl endpoint status and the logs), then remove/add the remaining member.

No thoughts on what to do without direct access to the wal/snap. /cc @xiang90 any thoughts?

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2017

@heyitsanthony We can write a tool to clear out the actual value and leave the metadata.

@eran-totango We do want to figure out the root cause of the issue. If you would love to help, we can probably hack out a tool to wipe out the sensitive data before you send anything to us.

@eran-totango

This comment has been minimized.

Copy link
Author

commented Mar 21, 2017

@xiang90 sure, let's do it.

@eran-totango

This comment has been minimized.

Copy link
Author

commented Mar 21, 2017

@heyitsanthony @xiang90
I'm now having problems when trying to restore from snapshot :(

  1. I created a snapshot of etcd-prod01 using this command:
    ETCDCTL_API=3 etcdctl --endpoints http://localhost:2379 snapshot save snapshot.db
  2. I created a new etcd cluster from scratch.
    (etcd-prod-01, etcd-prod-02, etcd-prod-03 instead of etcd-prod01, etcd-prod02, etcd-prod03)
  3. I copied the etcd-prod01 snapshot file to etcd-prod-01.
  4. I tried to restore it with this command :
    from etcd-prod-01:
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db 
--name etcd-prod-01 (the new server)
--initial-cluster etcd-prod-01=http://etcd-prod-01.internal:2380,etcd-prod-02=http://etcd-prod-02.internal:2380,etcd-prod-03=http://etcd-prod-03.internal:2380 
--initial-advertise-peer-urls http://10.0.107.103:2380

and i'm getting this error:

2017-03-21 11:37:16.057468 I | netutil: resolving etcd-prod-01.internal:2380 to 10.0.107.103:2380
2017-03-21 11:37:16.253211 I | mvcc: restore compact to 19944978
panic: no lessor to attach lease

goroutine 1 [running]:
panic(0xc60aa0, 0xc82000c300)
	/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).restore(0xc8200b0180, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:420 +0x1335
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.NewStore(0x7f2fcd3af138, 0xc820219b60, 0x0, 0x0, 0x7f2fcd3af190, 0xc8201b55a8, 0x20)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:120 +0x39e
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command.makeDB(0xc8201c4fe0, 0x1d, 0x7ffe3c7d0808, 0xb, 0x3)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command/snapshot_command.go:374 +0xaef
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command.snapshotRestoreCommandFunc(0xc8201ba200, 0xc8201bc380, 0x1, 0x7)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/command/snapshot_command.go:194 +0x66a
github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra.(*Command).execute(0xc8201ba200, 0xc8201bc310, 0x7, 0x7, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra/command.go:572 +0x85a
github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x1592160, 0xc8201ba200, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra/command.go:662 +0x53f
github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra.(*Command).Execute(0x1592160, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/spf13/cobra/command.go:618 +0x2d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3.Start()
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdctl/ctlv3/ctl.go:96 +0x8f
main.main()
	/home/gyuho/go/src/github.com/coreos/etcd/cmd/etcdctl/main.go:40 +0x111

I did the exact same steps in my test environment (1-4) and it worked perfectly.
Both clusters are running V3.0.14. what etcd version should i use?
(running a standalone kubernetes cluster V1.5.2)

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2017

@eran-totango k8s never works with etcd 3.0.14. it works with etcd 3.0.17+.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2017

@eran-totango Well. I was wrong. etcd 3.0.12+ should be OK.

But have you ever ran your cluster with a previous version of etcd? Or it was created with etcd 3.0.14?

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2017

2017-03-21 11:37:16.253211 I | mvcc: restore compact to 19944978
panic: no lessor to attach lease

This is fixed by #7203.

You need a newer version of etcdctl to recover the backup. Try etcd 3.0.17.

@eran-totango

This comment has been minimized.

Copy link
Author

commented Mar 21, 2017

@xiang90 when i created the kubernetes cluster i was running etcd V3.0.7 and then i upgraded to V3.0.14. I'll try to restore with etcdctl V3.0.17 and will let you know.

should i upgrade my etcd to V3.0.17 as well?

@heyitsanthony

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2017

@eran-totango yes, upgrading is recommended

@heyitsanthony

This comment has been minimized.

Copy link
Contributor

commented May 26, 2017

Appears to be configuration issue and possibly state machine inconsistency that's since been fixed; not much else to do here. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.