Skip to content

Containerd not restarting properly after upgrade to systemd 252.11 on stable #1157

@heilerich

Description

@heilerich

Description

We have noticed strange downtimes all over our kubernetes infrastructure since the recent upgrades to v3510. The root cause seems to be that the containerd service is not properly restarting when it exits sometimes. The symptoms are as follows: The containerd systemd unit stays in active(running) state even when the main process has exited even though the systemd unit has ExitType set to main and not cgroup. Containers already running stay active, but everthing else (i.e. kubernetes, docker) stops working obviously.

The systemd unit looks like this

containerd.service - containerd container runtime
     Loaded: loaded (/run/systemd/system/containerd.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/containerd.service.d
             └─10-kubeone.conf
     Active: active (running) since Wed 2023-08-09 19:27:23 UTC; 11min ago
       Docs: https://containerd.io
    Process: 1778 ExecStartPre=mkdir -p /run/docker/libcontainerd (code=exited, status=0/SUCCESS)
    Process: 1802 ExecStartPre=ln -fs /run/containerd/containerd.sock /run/docker/libcontainerd/docker-containerd.sock (code=exited, status=0/SUCCESS)
    Process: 1809 ExecStart=/usr/bin/env PATH=${TORCX_BINDIR}:${PATH} ${TORCX_BINDIR}/containerd --config ${CONTAINERD_CONFIG} (code=killed, signal=HUP)
   Main PID: 1809 (code=killed, signal=HUP)

Impact

This effectively breaks any environment requiring containerd restarts/reloads such as environments using alternative runtimes that are loaded after the initial boot process e.g. nvidia runtime or kata-containers.

Environment and steps to reproduce

  1. Set-up: Boot a fresh flatcar 3510.2.6 VM using the flatcar_production_qemu.sh
  2. Task: Restart containerd when at least one container is running (if no containers are running the bug does not occur)
  3. Action(s):
    a. Start a container e.g. docker run -d busybox sleep 9999999
    b. Look at systemctl status containerd, make sure the main process and container are running, note the main PID
    c. kill -SIGHUP <containerd-main-pid>
  4. Error: Look at systemctl status containerd again and wait for a restart (which will not happen)

Expected behavior

Containerd is restarted as specified in the systemd unit. I have just verified this with a fresh 3374.2.5 VM and containerd restarts as excpeded.

Additional information

I am not entirely sure what release has introduced this behaviour since it took me a while to track this down, but it must have happened somewhere between 3374.2.5 and 3510.2.6. It probably was 3510.2.5 though since it upgraded systemd to 252.11

I would greatly appreciate if someone has any idea for a temporary hotfix other than switching to LTS until this is fixed.

Metadata

Metadata

Assignees

Labels

channel/alphaIssue concerns the Alpha channel.channel/betaIssue concerns the Beta channel.channel/stableIssue concerns the Stable channel.kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions