single-node etcd cluster for kubernetes spontaneously corrupted, cannot restore #16471

dbeal-eth · 2023-08-25T05:41:04Z

dbeal-eth
Aug 25, 2023

hello! I have been running a kubernetes cluster with etcd backend for 2-3 years now. This weekend, however, I ran into what appears to be a completely unpredicatable issue which has stopped me from being able to restore my kubernetes cluster (and I would rather avoid having to reimport all my manifests from scratch).

I run my kubernetes cluster with a single node on microk8s, mostly for running my personal production software. Upon initial inspection, my etcd node was down, and it was failing to start with the following logs:

Aug 24 23:39:12 dell-ubuntu systemd[1]: Started Service for snap application microk8s.daemon-etcd.
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:12.888Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/snap/microk8s/5643/etcd","--data-dir=/var/snap/microk8s/common/var/run/etcd","--advertise-client-urls=https://192.168.85.7:12379","--listen-client-urls=http://0.0.0.0:2379,https://0.0.0.0:12379","--trusted-ca-file=/var/snap/microk8s/5643/certs/ca.crt","--cert-file=/var/snap/microk8s/5643/certs/server.crt","--key-file=/var/snap/microk8s/5643/certs/server.key","--enable-v2=false","--force-new-cluster"]}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:12.889Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/snap/microk8s/common/var/run/etcd","dir-type":"member"}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:12.889Z","caller":"embed/etcd.go:124","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:12.889Z","caller":"embed/etcd.go:132","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379","https://0.0.0.0:12379"]}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"warn","ts":"2023-08-24T23:39:12.889Z","caller":"embed/etcd.go:610","msg":"scheme is HTTP while key and cert files are present; ignoring key and cert files","client-url":"http://0.0.0.0:2379"}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:12.890Z","caller":"embed/etcd.go:306","msg":"starting an etcd server","etcd-version":"3.5.5","git-sha":"19002cf","go-version":"go1.20.6","go-os":"linux","go-arch":"amd64","max-cpu-set":32,"max-cpu-available":32,"member-initialized":true,"name":"default","data-dir":"/var/snap/microk8s/common/var/run/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/snap/microk8s/common/var/run/etcd/member","force-new-cluster":true,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.85.7:12379"],"listen-client-urls":["http://0.0.0.0:2379","https://0.0.0.0:12379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:12.896Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/snap/microk8s/common/var/run/etcd/member/snap/db","took":"5.755896ms"}
Aug 24 23:39:12 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"warn","ts":"2023-08-24T23:39:12.896Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"00000000000007fd-000000000fa954d7.wal.broken"}
Aug 24 23:39:14 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:14.199Z","caller":"embed/etcd.go:371","msg":"closing etcd server","name":"default","data-dir":"/var/snap/microk8s/common/var/run/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.85.7:12379"]}
Aug 24 23:39:14 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"info","ts":"2023-08-24T23:39:14.200Z","caller":"embed/etcd.go:373","msg":"closed etcd server","name":"default","data-dir":"/var/snap/microk8s/common/var/run/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.85.7:12379"]}
Aug 24 23:39:14 dell-ubuntu microk8s.daemon-etcd[4173482]: {"level":"fatal","ts":"2023-08-24T23:39:14.200Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 167, fileSize(50860032) - offset(50860008) - padBytes(1) = entryLimit(23)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/build/microk8s/parts/etcd/build/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/build/microk8s/parts/etcd/build/etcd/server/etcdmain/main.go:40\nmain.main\n\t/build/microk8s/parts/etcd/build/etcd/server/main.go:32\nruntime.main\n\t/snap/go/current/src/runtime/proc.go:250"}
Aug 24 23:39:14 dell-ubuntu systemd[1]: snap.microk8s.daemon-etcd.service: Main process exited, code=exited, status=1/FAILURE

I am not sure how I ran into this error as my cluster has not been using any options which have shown a history of developing an issue here (ex. #14025). So I decided to try deleting files from my member/wal directory in an attempt to clear out the offending log. The most recent .wal file turned out to be the offendor, and after renaming to .broken, my etcd node now starts up without error with the following startup logs:

Aug 25 05:04:10 dell-ubuntu systemd[1]: Started Service for snap application microk8s.daemon-etcd.
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:10.714Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/snap/microk8s/5643/etcd","--data-dir=/var/snap/microk8s/common/var/run/etcd","--advertise-client-urls=https://192.168.85.7:12379","--listen-client-urls=https://0.0.0.0:12379","--trusted-ca-file=/var/snap/microk8s/5643/certs/ca.crt","--cert-file=/var/snap/microk8s/5643/certs/server.crt","--key-file=/var/snap/microk8s/5643/certs/server.key","--enable-v2=false"]}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:10.715Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/snap/microk8s/common/var/run/etcd","dir-type":"member"}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:10.715Z","caller":"embed/etcd.go:124","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:10.715Z","caller":"embed/etcd.go:132","msg":"configuring client listeners","listen-client-urls":["https://0.0.0.0:12379"]}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:10.716Z","caller":"embed/etcd.go:306","msg":"starting an etcd server","etcd-version":"3.5.5","git-sha":"19002cf","go-version":"go1.20.6","go-os":"linux","go-arch":"amd64","max-cpu-set":32,"max-cpu-available":32,"member-initialized":true,"name":"default","data-dir":"/var/snap/microk8s/common/var/run/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/snap/microk8s/common/var/run/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.85.7:12379"],"listen-client-urls":["https://0.0.0.0:12379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:10.722Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/snap/microk8s/common/var/run/etcd/member/snap/db","took":"5.835682ms"}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:10.723Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"00000000000007fd-000000000fa954d7.wal.broken"}
Aug 25 05:04:10 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:10.723Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000ce6-00000000196639ee.wal.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"000000000000030a-0000000019676210.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"0000000000000e52-00000000196778a4.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"00000000000001f7-0000000019675fe8.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"0000000000000daf-000000001967775e.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"000000000000057d-00000000196766f5.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"0000000000000a36-000000001967706b.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"0000000000000a08-000000001967700e.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"0000000000000446-0000000019676487.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.576Z","caller":"snap/snapshotter.go:249","msg":"found unexpected non-snap file; skipping","path":"0000000000000c66-00000000196774cc.snap.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.597Z","caller":"etcdserver/server.go:716","msg":"detected custom v2store content. Etcd v3.5 is the last version allowing to access it using API v2. Please remove the content."}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.598Z","caller":"etcdserver/server.go:509","msg":"recovered v2 store from snapshot","snapshot-index":426104371,"snapshot-size":"545 kB"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.598Z","caller":"etcdserver/server.go:522","msg":"recovered v3 backend from snapshot","backend-size-bytes":17571840,"backend-size":"18 MB","backend-size-in-use-bytes":4681728,"backend-size-in-use":"4.7 MB"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.598Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"00000000000007fd-000000000fa954d7.wal.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.598Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000ce6-00000000196639ee.wal.broken"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.922Z","caller":"etcdserver/raft.go:529","msg":"restarting local member","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","commit-index":426129901}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.924Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d switched to configuration voters=(10276657743932975437)"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.924Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became follower at term 111"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.924Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 111, commit: 426129901, applied: 426104371, lastindex: 426129901, lastterm: 111]"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.924Z","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.924Z","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","recovered-remote-peer-id":"8e9e05c52164694d","recovered-remote-peer-urls":["http://localhost:2380"]}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.924Z","caller":"membership/cluster.go:287","msg":"set cluster version from store","cluster-version":"3.5"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:11.924Z","caller":"auth/store.go:1233","msg":"simple token is not cryptographically signed"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.925Z","caller":"mvcc/kvstore.go:323","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":227316496}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.930Z","caller":"mvcc/kvstore.go:393","msg":"kvstore restored","current-rev":227317188}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.931Z","caller":"etcdserver/quota.go:94","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.931Z","caller":"etcdserver/server.go:845","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-version":"3.5.5","cluster-id":"cdf818194e3a8c32","cluster-version":"3.5"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.931Z","caller":"etcdserver/server.go:738","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"8e9e05c52164694d","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.934Z","caller":"embed/etcd.go:685","msg":"starting with client TLS","tls-info":"cert = /var/snap/microk8s/5643/certs/server.crt, key = /var/snap/microk8s/5643/certs/server.key, client-cert=, client-key=, trusted-ca = /var/snap/microk8s/5643/certs/ca.crt, client-cert-auth = false, crl-file = ","cipher-suites":[]}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.934Z","caller":"embed/etcd.go:584","msg":"serving peer traffic","address":"127.0.0.1:2380"}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.934Z","caller":"embed/etcd.go:275","msg":"now serving peer/client/metrics","local-member-id":"8e9e05c52164694d","initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://192.168.85.7:12379"],"listen-client-urls":["https://0.0.0.0:12379"],"listen-metrics-urls":[]}
Aug 25 05:04:11 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:11.934Z","caller":"embed/etcd.go:556","msg":"cmux::serve","address":"127.0.0.1:2380"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d is starting a new election at term 111"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became pre-candidate at term 111"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d received MsgPreVoteResp from 8e9e05c52164694d at term 111"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became candidate at term 112"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 112"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became leader at term 112"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.925Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 112"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.928Z","caller":"wal/wal.go:785","msg":"created a new WAL segment","path":"/var/snap/microk8s/common/var/run/etcd/member/wal/0000000000000ce6-00000000196639f0.wal"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.928Z","caller":"embed/serve.go:100","msg":"ready to serve client requests"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.928Z","caller":"etcdserver/server.go:2054","msg":"published local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[https://192.168.85.7:12379]}","request-path":"/0/members/8e9e05c52164694d/attributes","cluster-id":"cdf818194e3a8c32","publish-timeout":"7s"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.929Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.929Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
Aug 25 05:04:12 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"info","ts":"2023-08-25T05:04:12.933Z","caller":"embed/serve.go:198","msg":"serving client traffic securely","address":"[::]:12379"}
Aug 25 05:04:20 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:20.836Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-08-25T05:04:15.931Z","time spent":"4.904799908s","remote":"127.0.0.1:49282","response type":"/etcdserverpb.KV/DeleteRange","request count":0,"request size":28,"response count":0,"response size":0,"request content":"key:\"/coreos.com/network/config\" "}
Aug 25 05:04:22 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:22.271Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-08-25T05:04:15.270Z","time spent":"7.000651843s","remote":"127.0.0.1:44364","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Aug 25 05:04:22 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:22.375Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-08-25T05:04:15.374Z","time spent":"7.000276802s","remote":"127.0.0.1:44580","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Aug 25 05:04:22 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:22.391Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-08-25T05:04:15.390Z","time spent":"7.000973249s","remote":"127.0.0.1:44452","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Aug 25 05:04:22 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:22.391Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-08-25T05:04:15.390Z","time spent":"7.000783114s","remote":"127.0.0.1:44580","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Aug 25 05:04:25 dell-ubuntu microk8s.daemon-etcd[344511]: {"level":"warn","ts":"2023-08-25T05:04:25.871Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-08-25T05:04:20.887Z","time spent":"4.98361771s","remote":"127.0.0.1:49290","response type":"/etcdserverpb.KV/Put","request count":1,"request size":86,"response count":0,"response size":0,"request content":"key:\"/coreos.com/network/config\" value_size:56 "}
... repeats ...

however, despite the node appearing healthy and running normally, when I attempt to perform any write operation upon the data already inside the cluster, it fails. For example, if I attempt to delete a rogue pod on my kubernetes cluster:

beald@dell-ubuntu:~$ kubectl delete po some-pod
Error from server: etcdserver: request timed out

Read operations appear to work OK (ex. I am able to list pods, deployments, etc.)

this type of error appears to happen for any type of writes upon existing data. for another example, I noticed that in the logs there was a message "level":"warn","ts":"2023-08-25T05:11:43.511Z","caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"694d89c114805b7f","error":"etcdserver: request timed out"}. Thinking that this lease may have something to do with why all the writes were blocked, I tried revoking it with etcdctl lease revoke lease timetolive 694d89c114805b7f, but this gave an error Error: failed to revoke lease (context deadline exceeded). I later tried running etcdctl lease timetolive 694d89c114805b7f which yielded lease 694d89c114805b7f granted with TTL(15s), remaining(-1735s), meaning the cluster was already trying to clear the expired lease, but it cant, so this is most likely a symptom and not the cause. BTW, the time to failure is always 7 seconds, nad the logs are spamming a lot of 7s requests occuring, but I am not sure what that would mean/be caused by.

BTW there are many other things I have tried but trying to keep this trimmed down to things relevant to the discussion.

Happy to provide original database files/WAL to maintainers of the project if the solution turns out to be sophisticated enough to share.

Answered by jmhbnz

Aug 25, 2023

Hey @dbeal-eth - Have you tried restoring with snapshot/db file to a new data directory?

Refer: https://etcd.io/docs/v3.5/op-guide/recovery/#restoring-a-cluster

View full answer

jmhbnz · 2023-08-25T09:49:24Z

jmhbnz
Aug 25, 2023
Maintainer

Hey @dbeal-eth - Have you tried restoring with snapshot/db file to a new data directory?

Refer: https://etcd.io/docs/v3.5/op-guide/recovery/#restoring-a-cluster

4 replies

dbeal-eth Aug 25, 2023
Author

Yes, I tried using etcd restore. However, the new data directory it created was identical to the original database dir without all the extra wal or snap files (which I had alreadt tries), so I assumed it had no effect.

dbeal-eth Aug 25, 2023
Author

Is it required to supply --initial-cluster option to have a positive effect?

dbeal-eth Aug 26, 2023
Author

I managed to restore the cluster with the instructions you provided. However, some things threw me off on following this solution taht I think would be worth mentioning:

I originally tried using --force-new-cluster to restore the cluster since it was alreayd a single node cluster, but that didnt seem to have an effect
I had already tried using etcdctl snapshot restore snapshot.db, but I didnt use the --initial-cluster options as you suggested. Once I added --initial-cluster, the database was restored successfully

also though the issue has been resolved, it would be nice to know what caused the original error which I sent on this thread in the first place. was this a bug in etcd, or some sort? I am not aware of any shutdown/unclean restart of the server, but who knows I guess.

jmhbnz Aug 26, 2023
Maintainer

Glad to hear the restore procedure worked and thanks for the notes on how it went.

I'm really not sure on how it ended up in the broken state sorry!

ahrtr · 2023-08-26T13:01:08Z

ahrtr
Aug 26, 2023
Maintainer

The "wal: max entry size limit exceeded" error indicates that the last record of a WAL file is corrupted. Etcd will repair the file automatically. If etcd fails to repair the file, then the instance will fatal out. Please provide the complete log.
The issue of "remaining(-1735s)" may be cause by the not being able to deleting objects. So the lease always exist in the db file. Again, please provide the complete log.

3 replies

dbeal-eth Aug 26, 2023
Author

The "wal: max entry size limit exceeded" error indicates that the last record of a WAL file is corrupted. Etcd will repair the file automatically. If etcd fails to repair the file, then the instance will fatal out. Please provide the complete log.

yes, etcd was fatalling out. the log I sent you above is complete (notice how I included systemd[1]: Started Service for snap application microk8s.daemon-etcd. and snap.microk8s.daemon-etcd.service: Main process exited, code=exited, status=1/FAILURE journald messages in the logs above.

ahrtr Aug 26, 2023
Maintainer

It turned out to be that you ran into a known issue, which was resolved in 3.5.7. Please get more detailed info in #15069

ahrtr Aug 26, 2023
Maintainer

For the issue of failing to writing data after the recovery, can you provide the complete log after you perform any write operation (e.g. kubectl delete po xxx)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single-node etcd cluster for kubernetes spontaneously corrupted, cannot restore #16471

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

single-node etcd cluster for kubernetes spontaneously corrupted, cannot restore #16471

dbeal-eth Aug 25, 2023

Replies: 2 comments · 7 replies

jmhbnz Aug 25, 2023 Maintainer

dbeal-eth Aug 25, 2023 Author

dbeal-eth Aug 25, 2023 Author

dbeal-eth Aug 26, 2023 Author

jmhbnz Aug 26, 2023 Maintainer

ahrtr Aug 26, 2023 Maintainer

dbeal-eth Aug 26, 2023 Author

ahrtr Aug 26, 2023 Maintainer

ahrtr Aug 26, 2023 Maintainer

dbeal-eth
Aug 25, 2023

Replies: 2 comments 7 replies

jmhbnz
Aug 25, 2023
Maintainer

dbeal-eth Aug 25, 2023
Author

dbeal-eth Aug 25, 2023
Author

dbeal-eth Aug 26, 2023
Author

jmhbnz Aug 26, 2023
Maintainer

ahrtr
Aug 26, 2023
Maintainer

dbeal-eth Aug 26, 2023
Author

ahrtr Aug 26, 2023
Maintainer

ahrtr Aug 26, 2023
Maintainer