You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.
We encountered etcd member start failures recently in our kubernetes cluster. the phenomena is that etcd members keep crashing loop off.
the error is raised by no space left on the disk. the logs of etcd server container is no space left on the disk.
After checking the disk file status, it is true that all the disk space is used by by etcd snap dir.
-bash-4.2# ls -lht
total 983G
-rw------- 1 root root 145M Aug 4 22:11 tmp158215125
-rw------- 1 root root 799M Aug 4 22:10 db
-rw------- 1 root root 256M Aug 4 22:04 tmp300815437
-rw------- 1 root root 405M Aug 4 22:01 tmp320617345
-rw------- 1 root root 255M Aug 4 21:53 tmp153885235
-rw------- 1 root root 288M Aug 4 21:49 tmp208855777
-rw------- 1 root root 268M Aug 4 21:41 tmp525631539
......
After checking the logs of etcd server, it shows that the server has start up logs, it shows that the server start up normally and exit without any specific errors. the error code of container is 137. It means that the container exits because of kill command from container runtime. And the error is not OOM. It means that it is killed on purpose.
After checking the logs of kubelet, it shows that it is killed because of liveness check failures:
after checking the logs of etcd server, it shows that there are logs to get snapshot data from leader, but there are no logs about successfully loaded. It means that etcd container is killed before it is able to start up.
...
2019-08-05 05:12:50.179645 I | rafthttp: receiving database snapshot [index:200843417, from 9df0abd4b7831a12] ...
...
My problem was related to never compacting and defragging the DB, this caused the DB to become huge (1.5GB) and therefore the new pod failed to start in time as the readiness probe failed.
After compacting and defragging, the DB is 700kB and the pods start with no issues.
Maybe this helps.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We encountered etcd member start failures recently in our kubernetes cluster. the phenomena is that etcd members keep crashing loop off.
the error is raised by no space left on the disk. the logs of etcd server container is
no space left on the disk
.After checking the disk file status, it is true that all the disk space is used by by etcd snap dir.
After checking the logs of etcd server, it shows that the server has start up logs, it shows that the server start up normally and exit without any specific errors. the error code of container is
137
. It means that the container exits because ofkill
command from container runtime. And the error is not OOM. It means that it is killed on purpose.After checking the logs of kubelet, it shows that it is killed because of liveness check failures:
after checking the logs of etcd server, it shows that there are logs to get snapshot data from leader, but there are no logs about successfully loaded. It means that etcd container is killed before it is able to start up.
... 2019-08-05 05:12:50.179645 I | rafthttp: receiving database snapshot [index:200843417, from 9df0abd4b7831a12] ... ...
The liveness check of ETCD Pod is
kubelet starts doing liveness check after 10s and there are 3 failures allowed before the container is killed. it means: 10 + 3 * 10 = 40s.
The time is too short to transmit a snapshot from leader to member.
The text was updated successfully, but these errors were encountered: