Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd not restarting properly after upgrade to systemd 252.11 on stable #1157

Closed
heilerich opened this issue Aug 9, 2023 · 9 comments · Fixed by flatcar/scripts#1058
Assignees
Labels
channel/alpha Issue concerns the Alpha channel. channel/beta Issue concerns the Beta channel. channel/stable Issue concerns the Stable channel. kind/bug Something isn't working

Comments

@heilerich
Copy link

Description

We have noticed strange downtimes all over our kubernetes infrastructure since the recent upgrades to v3510. The root cause seems to be that the containerd service is not properly restarting when it exits sometimes. The symptoms are as follows: The containerd systemd unit stays in active(running) state even when the main process has exited even though the systemd unit has ExitType set to main and not cgroup. Containers already running stay active, but everthing else (i.e. kubernetes, docker) stops working obviously.

The systemd unit looks like this

containerd.service - containerd container runtime
     Loaded: loaded (/run/systemd/system/containerd.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/containerd.service.d
             └─10-kubeone.conf
     Active: active (running) since Wed 2023-08-09 19:27:23 UTC; 11min ago
       Docs: https://containerd.io
    Process: 1778 ExecStartPre=mkdir -p /run/docker/libcontainerd (code=exited, status=0/SUCCESS)
    Process: 1802 ExecStartPre=ln -fs /run/containerd/containerd.sock /run/docker/libcontainerd/docker-containerd.sock (code=exited, status=0/SUCCESS)
    Process: 1809 ExecStart=/usr/bin/env PATH=${TORCX_BINDIR}:${PATH} ${TORCX_BINDIR}/containerd --config ${CONTAINERD_CONFIG} (code=killed, signal=HUP)
   Main PID: 1809 (code=killed, signal=HUP)

Impact

This effectively breaks any environment requiring containerd restarts/reloads such as environments using alternative runtimes that are loaded after the initial boot process e.g. nvidia runtime or kata-containers.

Environment and steps to reproduce

  1. Set-up: Boot a fresh flatcar 3510.2.6 VM using the flatcar_production_qemu.sh
  2. Task: Restart containerd when at least one container is running (if no containers are running the bug does not occur)
  3. Action(s):
    a. Start a container e.g. docker run -d busybox sleep 9999999
    b. Look at systemctl status containerd, make sure the main process and container are running, note the main PID
    c. kill -SIGHUP <containerd-main-pid>
  4. Error: Look at systemctl status containerd again and wait for a restart (which will not happen)

Expected behavior

Containerd is restarted as specified in the systemd unit. I have just verified this with a fresh 3374.2.5 VM and containerd restarts as excpeded.

Additional information

I am not entirely sure what release has introduced this behaviour since it took me a while to track this down, but it must have happened somewhere between 3374.2.5 and 3510.2.6. It probably was 3510.2.5 though since it upgraded systemd to 252.11

I would greatly appreciate if someone has any idea for a temporary hotfix other than switching to LTS until this is fixed.

@heilerich heilerich added the kind/bug Something isn't working label Aug 9, 2023
@jepio
Copy link
Member

jepio commented Aug 10, 2023

Thanks for the detailed report! This is a regression in systemd v252.11, fixed in v252.12. Here are some links:

A workaround is to execute systemctl restart containerd (which works well for k8s/kata but stops docker).

@heilerich
Copy link
Author

Understood. Thanks for the quick reaction.

We are using these systemd units distributed via ignition as a rather crude solution for now

# /etc/systemd/system/restart-containerd.timer
[Unit]
Description=Check containerd health and restart if necessary
After=containerd.service

[Timer]
OnBootSec=1min
OnUnitActiveSec=1min

[Install]
WantedBy=timers.target
# /etc/systemd/system/restart-containerd.service
[Unit]
Description=Restart containerd if unreachable
After=containerd.service

[Service]
ExecCondition=/bin/sh -c '! /usr/bin/crictl info -q > /dev/null || (echo "containerd is running"; exit 1)'
ExecStart=/usr/bin/systemctl restart containerd

Is there an estimate as to when systemd v252.12 or higher might hit stable?

@jepio
Copy link
Member

jepio commented Aug 10, 2023

@tormath1 is (likely) going to try cherry-picking the commit into stable before v252.12+ arrives. In that case we would aim for the next stable.

@tormath1 tormath1 added channel/alpha Issue concerns the Alpha channel. channel/beta Issue concerns the Beta channel. channel/stable Issue concerns the Stable channel. labels Aug 10, 2023
@tormath1 tormath1 self-assigned this Aug 10, 2023
@tormath1
Copy link
Contributor

tormath1 commented Aug 10, 2023

@heilerich I'm curious, who's sending the SIGHUP signal? Did you try with a SIGTERM or SIGKILL as workaround? While extending our test case, I noticed that I was not able to reproduce using SIGKILL or SIGTERM.

@heilerich
Copy link
Author

@tormath1 You are absolutely right, SIGTERM and SIGKILL do not cause the same problem.

I know that nvidia's container toolkit is doing this, but we also had problems on a cluster that does not have NVIDIA devices. I can't say right know what was killing containerd there. Possibly kata order kubevirt? I would have to ask.

tormath1 added a commit to flatcar/scripts that referenced this issue Aug 10, 2023
If fixes an issue with Systemd service restart when the main process is
being killed by a SIGHUP signal.

See also: flatcar/Flatcar#1157

Commit-Ref: systemd/systemd-stable@34e834f

Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
tormath1 added a commit to flatcar/scripts that referenced this issue Aug 10, 2023
If fixes an issue with Systemd service restart when the main process is
being killed by a SIGHUP signal.

See also: flatcar/Flatcar#1157

Commit-Ref: systemd/systemd-stable@34e834f

Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
@tormath1
Copy link
Contributor

@heilerich thanks for the information. Looks like with Nvidia you can specify to restart using systemctl restart containerd.service rather than using Signal: https://github.com/NVIDIA/nvidia-container-toolkit/blob/22d7b52a58d9af932f0313f6adff8437522f7e10/tools/container/container.go#L43

@heilerich
Copy link
Author

I can confirm that this works. gpu-operator related problems can be fixed by adding

toolkit:
  env:
  - name: RUNTIME_ARGS
     value: --restart-mode=systemd

to the NVIDIA ClusterPolicy.

tormath1 added a commit to flatcar/scripts that referenced this issue Aug 11, 2023
If fixes an issue with Systemd service restart when the main process is
being killed by a SIGHUP signal.

See also: flatcar/Flatcar#1157

Commit-Ref: systemd/systemd-stable@34e834f

Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
tormath1 added a commit to flatcar/scripts that referenced this issue Aug 11, 2023
If fixes an issue with Systemd service restart when the main process is
being killed by a SIGHUP signal.

See also: flatcar/Flatcar#1157

Commit-Ref: systemd/systemd-stable@34e834f

Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
tormath1 added a commit to flatcar/scripts that referenced this issue Aug 11, 2023
If fixes an issue with Systemd service restart when the main process is
being killed by a SIGHUP signal.

See also: flatcar/Flatcar#1157

Commit-Ref: systemd/systemd-stable@34e834f

Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
@tormath1
Copy link
Contributor

Fix has been backported to all channels and it will be available in the next set of releases around the first week of September (https://github.com/orgs/flatcar/projects/7/views/8). A test case has been added too.

@heilerich
Copy link
Author

Just as a note for other people hitting this problem ... We identified one more culprit causing downtimes related to this bug. The system ugprade component of Rancher / RKE2 can get stuck in a deadlock while upgrading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
channel/alpha Issue concerns the Alpha channel. channel/beta Issue concerns the Beta channel. channel/stable Issue concerns the Stable channel. kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants