-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Description
We have noticed strange downtimes all over our kubernetes infrastructure since the recent upgrades to v3510. The root cause seems to be that the containerd service is not properly restarting when it exits sometimes. The symptoms are as follows: The containerd systemd unit stays in active(running)
state even when the main process has exited even though the systemd unit has ExitType
set to main and not cgroup. Containers already running stay active, but everthing else (i.e. kubernetes, docker) stops working obviously.
The systemd unit looks like this
containerd.service - containerd container runtime
Loaded: loaded (/run/systemd/system/containerd.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/containerd.service.d
└─10-kubeone.conf
Active: active (running) since Wed 2023-08-09 19:27:23 UTC; 11min ago
Docs: https://containerd.io
Process: 1778 ExecStartPre=mkdir -p /run/docker/libcontainerd (code=exited, status=0/SUCCESS)
Process: 1802 ExecStartPre=ln -fs /run/containerd/containerd.sock /run/docker/libcontainerd/docker-containerd.sock (code=exited, status=0/SUCCESS)
Process: 1809 ExecStart=/usr/bin/env PATH=${TORCX_BINDIR}:${PATH} ${TORCX_BINDIR}/containerd --config ${CONTAINERD_CONFIG} (code=killed, signal=HUP)
Main PID: 1809 (code=killed, signal=HUP)
Impact
This effectively breaks any environment requiring containerd restarts/reloads such as environments using alternative runtimes that are loaded after the initial boot process e.g. nvidia runtime or kata-containers.
Environment and steps to reproduce
- Set-up: Boot a fresh flatcar 3510.2.6 VM using the flatcar_production_qemu.sh
- Task: Restart containerd when at least one container is running (if no containers are running the bug does not occur)
- Action(s):
a. Start a container e.g.docker run -d busybox sleep 9999999
b. Look atsystemctl status containerd
, make sure the main process and container are running, note the main PID
c.kill -SIGHUP <containerd-main-pid>
- Error: Look at
systemctl status containerd
again and wait for a restart (which will not happen)
Expected behavior
Containerd is restarted as specified in the systemd unit. I have just verified this with a fresh 3374.2.5 VM and containerd restarts as excpeded.
Additional information
I am not entirely sure what release has introduced this behaviour since it took me a while to track this down, but it must have happened somewhere between 3374.2.5 and 3510.2.6. It probably was 3510.2.5 though since it upgraded systemd to 252.11
I would greatly appreciate if someone has any idea for a temporary hotfix other than switching to LTS until this is fixed.