etcd member cannot start up because of liveness check failures #2109

armstrongli · 2019-08-05T06:46:25Z

We encountered etcd member start failures recently in our kubernetes cluster. the phenomena is that etcd members keep crashing loop off.

the error is raised by no space left on the disk. the logs of etcd server container is no space left on the disk.
After checking the disk file status, it is true that all the disk space is used by by etcd snap dir.

-bash-4.2# ls -lht
total 983G
-rw------- 1 root root 145M Aug  4 22:11 tmp158215125
-rw------- 1 root root 799M Aug  4 22:10 db
-rw------- 1 root root 256M Aug  4 22:04 tmp300815437
-rw------- 1 root root 405M Aug  4 22:01 tmp320617345
-rw------- 1 root root 255M Aug  4 21:53 tmp153885235
-rw------- 1 root root 288M Aug  4 21:49 tmp208855777
-rw------- 1 root root 268M Aug  4 21:41 tmp525631539
......

After checking the logs of etcd server, it shows that the server has start up logs, it shows that the server start up normally and exit without any specific errors. the error code of container is 137. It means that the container exits because of kill command from container runtime. And the error is not OOM. It means that it is killed on purpose.

After checking the logs of kubelet, it shows that it is killed because of liveness check failures:

Aug 04 22:50:44 foo.bar.com kubelet[24271]: I0804 22:50:44.137115   24271 prober.go:111] Liveness probe for "etcd-0010_foooo(f830b222-90d5-11e9-8966-74dbd1802f80):etcd" failed (failure): Error:  context deadline exceeded

after checking the logs of etcd server, it shows that there are logs to get snapshot data from leader, but there are no logs about successfully loaded. It means that etcd container is killed before it is able to start up.

...
2019-08-05 05:12:50.179645 I | rafthttp: receiving database snapshot [index:200843417, from 9df0abd4b7831a12] ...
...

The liveness check of ETCD Pod is

    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://localhost:2379 --cert=/etc/etcdtls/operator/etcd-tls/etcd-client.crt
          --key=/etc/etcdtls/operator/etcd-tls/etcd-client.key --cacert=/etc/etcdtls/operator/etcd-tls/etcd-client-ca.crt
          get foo
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 60
      successThreshold: 1
      timeoutSeconds: 10

kubelet starts doing liveness check after 10s and there are 3 failures allowed before the container is killed. it means: 10 + 3 * 10 = 40s.

The time is too short to transmit a snapshot from leader to member.

The text was updated successfully, but these errors were encountered:

armstrongli · 2019-08-05T06:47:03Z

Here is the PR to fix this issue: #2108

eliaoggian · 2020-02-28T16:16:49Z

My problem was related to never compacting and defragging the DB, this caused the DB to become huge (1.5GB) and therefore the new pod failed to start in time as the readiness probe failed.

After compacting and defragging, the DB is 700kB and the pods start with no issues.

Maybe this helps.

armstrongli linked a pull request Aug 5, 2019 that will close this issue

tuning the liveness check time of etcd pod #2108

Open

eliaoggian mentioned this issue Feb 28, 2020

failure to start member in existing cluster, readiness probe fails #2164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd member cannot start up because of liveness check failures #2109

etcd member cannot start up because of liveness check failures #2109

armstrongli commented Aug 5, 2019

armstrongli commented Aug 5, 2019

eliaoggian commented Feb 28, 2020

etcd member cannot start up because of liveness check failures #2109

etcd member cannot start up because of liveness check failures #2109

Comments

armstrongli commented Aug 5, 2019

armstrongli commented Aug 5, 2019

eliaoggian commented Feb 28, 2020