Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd crashes after restart with panic: unexpected removal of unknown remote peer #13119

Closed
smira opened this issue Jun 17, 2021 · 12 comments
Closed
Labels

Comments

@smira
Copy link

smira commented Jun 17, 2021

This seems to be similar to some old issues, e.g. #3106

etcd version: 3.5.0

The exact steps to reproduce are not clear, but the bug is reproducible with the integration test for Talos (github.com/talos-systems/talos).

The integration test performs multiple steps reconfiguring the 3 node cluster:

  • removing members and adding them back
  • recovering etcd from snapshot on all nodes
  • moving leadership from one node to another

Eventually a test reboots a node (which eventually restarts etcd), and one of the members goes into infinite crash loop with the panic:

172.20.0.4: panic: unexpected removal of unknown remote peer
172.20.0.4: 
172.20.0.4: goroutine 180 [running]:
172.20.0.4: go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001143c0, 0xc00020d540, 0x1, 0x1)
172.20.0.4: 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
172.20.0.4: go.uber.org/zap.(*Logger).Panic(0xc0003f0320, 0x122d9ab, 0x29, 0xc00020d540, 0x1, 0x1)
172.20.0.4: 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc000451500, 0x429dbf9564993e31)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:346 +0x58c
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc000451500, 0x429dbf9564993e31)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:329 +0x7d
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc000483800, 0x1, 0x429dbf9564993e31, 0x0, 0x0, 0x0, 0x3e317a1b9d752e72, 0xc000414c80, 0x0, 0x0, ...)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2301 +0x872
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc000483800, 0xc0000ea030, 0x2a3, 0x2a9, 0xc000414c80, 0x0, 0x0, 0x0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2133 +0x59a
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc000483800, 0xc000414c80, 0xc000424fd0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1357 +0xe5
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc000483800, 0xc000414c80, 0xc000424fd0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1179 +0x88
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8(0x1399c10, 0xc0003c7a80)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1111 +0x3c
172.20.0.4: go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc000390cc0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xf3
172.20.0.4: created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13b
@smira
Copy link
Author

smira commented Jun 17, 2021

Complete log of the etcd server:

172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.318Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["/usr/local/bin/etcd","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--peer-client-cert-auth=true","--listen-client-urls=https://0.0.0.0:2379","--advertise-client-urls=https://172.20.0.4:2379","--data-dir=/var/lib/etcd","--key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--name=talos-default-master-3","--listen-peer-urls=https://0.0.0.0:2380","--cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.318Z","caller":"etcdmain/etcd.go:93","msg":"detected default host for advertise","host":"172.20.0.4"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.318Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.318Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://0.0.0.0:2380"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.318Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.324Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://0.0.0.0:2379"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.324Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"talos-default-master-3","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://172.20.0.4:2380"],"listen-peer-urls":["https://0.0.0.0:2380"],"advertise-client-urls":["https://172.20.0.4:2379"],"listen-client-urls":["https://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.326Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"1.649487ms"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.329Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":37,"snapshot-size":"9.2 kB"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.329Z","caller":"etcdserver/server.go:518","msg":"recovered v3 backend from snapshot","backend-size-bytes":3563520,"backend-size":"3.6 MB","backend-size-in-use-bytes":3547136,"backend-size-in-use":"3.5 MB"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.331Z","caller":"etcdserver/raft.go:483","msg":"restarting local member","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b","commit-index":712}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"37665d9de7f5c22b switched to configuration voters=(4800203426205613617 16955893525476640094) learners=(3991981052493414955)"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"37665d9de7f5c22b became follower at term 7"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft 37665d9de7f5c22b [peers: [429dbf9564993e31,eb4f6f534361855e], term: 7, commit: 712, applied: 37, lastindex: 712, lastterm: 7]"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","caller":"membership/cluster.go:276","msg":"recovered/added member from store","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b","recovered-remote-peer-id":"37665d9de7f5c22b","recovered-remote-peer-urls":["https://172.20.0.4:2380"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","caller":"membership/cluster.go:276","msg":"recovered/added member from store","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b","recovered-remote-peer-id":"9daebfd8954a9ca0","recovered-remote-peer-urls":["https://172.20.0.3:2380"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","caller":"membership/cluster.go:276","msg":"recovered/added member from store","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b","recovered-remote-peer-id":"eb4f6f534361855e","recovered-remote-peer-urls":["https://172.20.0.2:2380"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.332Z","caller":"membership/cluster.go:285","msg":"set cluster version from store","cluster-version":"3.5"}
172.20.0.4: {"level":"warn","ts":"2021-06-17T20:22:48.333Z","caller":"auth/store.go:1220","msg":"simple token is not cryptographically signed"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.334Z","caller":"mvcc/kvstore.go:345","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":12075}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.338Z","caller":"mvcc/kvstore.go:415","msg":"kvstore restored","current-rev":14276}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.339Z","caller":"etcdserver/quota.go:94","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0","remote-peer-urls":["https://172.20.0.3:2380"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.340Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.341Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.341Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.342Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.342Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"37665d9de7f5c22b","remote-peer-id":"eb4f6f534361855e","remote-peer-urls":["https://172.20.0.2:2380"]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.342Z","caller":"etcdserver/server.go:834","msg":"starting etcd server","local-member-id":"37665d9de7f5c22b","local-server-version":"3.5.0","cluster-id":"a3390e43eb5274e2","cluster-version":"3.5"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.343Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"37665d9de7f5c22b","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.343Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"37665d9de7f5c22b","remote-peer-id":"eb4f6f534361855e"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.345Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.345Z","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"37665d9de7f5c22b","remote-peer-id":"9daebfd8954a9ca0"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.346Z","caller":"embed/etcd.go:687","msg":"starting with client TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = false, crl-file = ","cipher-suites":[]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.346Z","caller":"embed/etcd.go:276","msg":"now serving peer/client/metrics","local-member-id":"37665d9de7f5c22b","initial-advertise-peer-urls":["http://172.20.0.4:2380"],"listen-peer-urls":["https://0.0.0.0:2380"],"advertise-client-urls":["https://172.20.0.4:2379"],"listen-client-urls":["https://0.0.0.0:2379"],"listen-metrics-urls":[]}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.346Z","caller":"etcdserver/server.go:744","msg":"starting initial election tick advance","election-ticks":10}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.346Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"37665d9de7f5c22b switched to configuration voters=(3991981052493414955 4800203426205613617 16955893525476640094)"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.346Z","caller":"membership/cluster.go:482","msg":"promote member","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b"}
172.20.0.4: {"level":"info","ts":"2021-06-17T20:22:48.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"37665d9de7f5c22b switched to configuration voters=(3991981052493414955 16955893525476640094)"}
172.20.0.4: {"level":"warn","ts":"2021-06-17T20:22:48.347Z","caller":"membership/cluster.go:427","msg":"skipped removing already removed member","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b","removed-remote-peer-id":"429dbf9564993e31"}
172.20.0.4: {"level":"panic","ts":"2021-06-17T20:22:48.347Z","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"429dbf9564993e31","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:157"}
172.20.0.4: panic: unexpected removal of unknown remote peer
172.20.0.4: 
172.20.0.4: goroutine 180 [running]:
172.20.0.4: go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001143c0, 0xc00020d540, 0x1, 0x1)
172.20.0.4: 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
172.20.0.4: go.uber.org/zap.(*Logger).Panic(0xc0003f0320, 0x122d9ab, 0x29, 0xc00020d540, 0x1, 0x1)
172.20.0.4: 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc000451500, 0x429dbf9564993e31)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:346 +0x58c
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc000451500, 0x429dbf9564993e31)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:329 +0x7d
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc000483800, 0x1, 0x429dbf9564993e31, 0x0, 0x0, 0x0, 0x3e317a1b9d752e72, 0xc000414c80, 0x0, 0x0, ...)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2301 +0x872
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc000483800, 0xc0000ea030, 0x2a3, 0x2a9, 0xc000414c80, 0x0, 0x0, 0x0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2133 +0x59a
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc000483800, 0xc000414c80, 0xc000424fd0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1357 +0xe5
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc000483800, 0xc000414c80, 0xc000424fd0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1179 +0x88
172.20.0.4: go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8(0x1399c10, 0xc0003c7a80)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1111 +0x3c
172.20.0.4: go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc000390cc0)
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xf3
172.20.0.4: created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
172.20.0.4: 	/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13b

@smira
Copy link
Author

smira commented Jun 17, 2021

what is interesting, one line before the panic etcd seems to ignore already removed member:

172.20.0.4: {"level":"warn","ts":"2021-06-17T20:22:48.347Z","caller":"membership/cluster.go:427","msg":"skipped removing already removed member","cluster-id":"a3390e43eb5274e2","local-member-id":"37665d9de7f5c22b","removed-remote-peer-id":"429dbf9564993e31"}

but panics right after that.

@smira
Copy link
Author

smira commented Jun 17, 2021

exactly same integration test steps pass just fine with 3.4.15, but with etcd 3.5.0 Talos also switched to adding new members as learners and promoting them, so not clear if learner mode plays some role here.

@nickpoorman
Copy link

This can be reproduced by running the https://github.com/etcd-io/etcd/tree/main/contrib/raftexample and issuing a delete for a node that doesn't exist:

curl -L http://127.0.0.1:12380/9 -XDELETE

@hangzhouwanjun
Copy link

I also encountered the same problem. That change 3 nodes cluster to 1 nodes cluster, and restart etcd.service.

@abezard
Copy link

abezard commented Aug 13, 2021

I'm also hitting this exact issue when upgrading from 3.4.X to 3.5.0. Here are the steps I followed to reproduce:

  • I had a 7 nodes cluster
  • I rotated all the nodes in my cluster 1 by 1 (maintenance)
  • a week later I tried to upgrade my cluster, started with one node
  • all the nodes I tried to upgrade failed to start with the same panic: unexpected removal of unknown remote peer
  • if I delete 1 node and add it back, it will first start fine and then eventually fail on the same error.

@medva1997
Copy link

etcd cant start with --force-new-cluster=true with panic: unexpected removal of unknown remote peer , and i can`t downgrade to 3.4.16.
I was able to start my cluster via etcdctl snapshot restore

@kvaps
Copy link

kvaps commented Aug 26, 2021

Same here tried to recover cluster after #13196 (comment), by starting it with --force-new-cluster but etcd didn't start with the following error:

{"level":"panic","ts":"2021-08-26T20:36:06.308Z","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"16c4e714954750a7","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:346\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:329\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2301\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2133\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1357\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1179\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1111\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:157"}
panic: unexpected removal of unknown remote peer

goroutine 143 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0002026c0, 0xc004de4e00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000091270, 0x122d9ab, 0x29, 0xc004de4e00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc000212fc0, 0x16c4e714954750a7)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:346 +0x58c
go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc000212fc0, 0x16c4e714954750a7)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/api/rafthttp/transport.go:329 +0x7d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc00047ea00, 0x1, 0x16c4e714954750a7, 0x0, 0x0, 0x0, 0x0, 0xc0035b0500, 0x0, 0x0, ...)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2301 +0x872
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc00047ea00, 0xc003081ca0, 0x132, 0x167, 0xc0035b0500, 0xc003b0ab98, 0xc003b0ab50, 0xc705fd)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:2133 +0x59a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc00047ea00, 0xc0035b0500, 0xc003b24dc0)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1357 +0xe5
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc00047ea00, 0xc0035b0500, 0xc003b24dc0)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1179 +0x88
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8(0x1399c10, 0xc003df8540)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:1111 +0x3c
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc000485980)
	/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xf3
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
	/tmp/etcd-release-3.5.0/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13b
command terminated with exit code 2

smira added a commit to smira/talos that referenced this issue Oct 15, 2021
See https://github.com/etcd-io/etcd/releases/tag/v3.5.1

This version has a fix for member info getting out of sync:
etcd-io/etcd#13119

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
smira added a commit to smira/talos that referenced this issue Oct 15, 2021
See https://github.com/etcd-io/etcd/releases/tag/v3.5.1

This version has a fix for member info getting out of sync:
etcd-io/etcd#13119

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
@smira
Copy link
Author

smira commented Oct 15, 2021

The issue seems to be fixed with etcd 3.5.1

smira added a commit to smira/talos that referenced this issue Oct 18, 2021
See https://github.com/etcd-io/etcd/releases/tag/v3.5.1

This version has a fix for member info getting out of sync:
etcd-io/etcd#13119

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
@moonovo
Copy link

moonovo commented Nov 15, 2021

is this issue fixed by PR 13348 ?

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2021

@stale
Copy link

stale bot commented Feb 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

8 participants