On container creation: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) #3808

nefelim4ag · 2020-05-29T09:28:45Z

BUG REPORT INFORMATION

Description
We moved to use CRI-O 1.17.2 on our clusters
OS latest U18.04 used
K8s 1.16.8

After that we sometime experience different errors from DBus daemon, like that:

May 29 12:11:22 K8S-VA1 dbus-daemon[1516]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 29 12:11:47 K8S-VA1 dbus-daemon[1516]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 29 12:12:12 K8S-VA1 dbus-daemon[1516]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 29 12:12:37 K8S-VA1 dbus-daemon[1516]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

systemctl reload dbus - helps for some time, because that drop all dbus connections IIRC.

Also i see some messages about limits, so i rise them in hope that will help:

<busconfig>
  <limit name="max_connections_per_user">2048</limit>   
  <limit name="max_replies_per_connection">2048</limit>
</busconfig>

I currently think that happens because too many pods try start at same time, i.e. we mosly experience that when updating workloads.

Maybe you can direct where to look?

Steps to reproduce the issue:
(WIll be added later, when i have 99% steps to reproduction)

Thanks!

The text was updated successfully, but these errors were encountered:

mrunalp · 2020-05-29T18:42:23Z

@nefelim4ag Can you check if there are any leaks in the number of units/scopes
systemctl list-units ?

nefelim4ag · 2020-06-01T08:50:37Z

@mrunalp
You mean something like this?

K8S-VA1:~# crictl ps | wc -l
242
K8S-VA1:~# systemctl list-units | grep scope | wc -l
843

haircommander · 2020-06-01T19:50:14Z

@nefelim4ag I have a suspicion that many of your units are inactive, and could be cleaned up
what's the output of systemctl list-units --all | grep crio | grep inactive ?

for context, we are working on having systemd do this automatically (by settting the CollectMode to inactive-or-failed), but we need to do some branching on systemd version to make sure we have the option. you can follow that work here

nefelim4ag · 2020-06-02T10:17:31Z

@haircommander all of them in failed state:

K8S-VA1:~# systemctl status crio-fd525157e9bc5c56c9b89d50d5cefa768577e215a9b2a07b7aa87248087d9547.scope
● crio-fd525157e9bc5c56c9b89d50d5cefa768577e215a9b2a07b7aa87248087d9547.scope - libcontainer container fd525157e9bc5c56c9b89d50d5cefa768577e215a9b2a07b7aa87248087d9547
   Loaded: loaded (/run/systemd/transient/crio-fd525157e9bc5c56c9b89d50d5cefa768577e215a9b2a07b7aa87248087d9547.scope; transient)
Transient: yes
   Active: failed (Result: resources)
      CPU: 0

i.e. i not have inactive units :)

haircommander · 2020-06-02T13:19:46Z

ah! per: https://www.freedesktop.org/software/systemd/man/systemd.unit.html, CollectMode defaults to inactive and failed scopes are not cleaned up. changing the CollectMode would solve the leak
(I should have asked for failed scopes, but we arrived at the same conclusion despite my being confused 😃 )

nefelim4ag · 2020-06-03T22:33:37Z

@haircommander, can you please explain why that units leads to problems with Dbus?
And can i someway fix that?
i.e. write simple service which will collect / delete failed units?

mrunalp · 2020-06-04T01:06:42Z

@nefelim4ag Can you try running systemctl reset-failed and see if the failed units get cleaned up? Meanwhile, we will fix this in CRI-O by setting the CollectMode for the runc containers.

nefelim4ag · 2020-06-04T13:56:13Z

@mrunalp doesn't help, i.e. we recreate many containers and got again:

Jun 04 16:53:53 P20B-SR7-1R1-K8S-VA6 dbus-daemon[1620]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jun 04 16:54:18 P20B-SR7-1R1-K8S-VA6 dbus-daemon[1620]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jun 04 16:54:43 P20B-SR7-1R1-K8S-VA6 dbus-daemon[1620]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Jun 04 16:55:08 P20B-SR7-1R1-K8S-VA6 dbus-daemon[1620]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

Jun 04 13:38:11 P20B-SR7-1R1-K8S-VA6 crio[1661]: time="2020-06-04 13:38:11.072010439+03:00" level=info msg="exec'd [rm /tmp/rbd-remounter.health] in kube-system/rbd-remounter-gjdwk/rbd-remounter" id=e8c726d9-8265-4105-8240-17583cef186f
Jun 04 13:38:43 P20B-SR7-1R1-K8S-VA6 crio[1661]: time="2020-06-04 13:38:43.476211150+03:00" level=error msg="Container creation error: time=\"2020-06-04T13:38:18+03:00\" level=warning msg=\"signal: killed\"\ntime=\"2020-06-04T13:38:43+03:00\" level=error msg=\"container_linux.go:349: starting container process caused \\\"process_linux.go:297: applying cgroup configuration for process caused \\\\\\\"Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)\\\\\\\"\\\"\"\ncontainer_linux.go:349: starting container process caused \"process_linux.go:297: applying cgroup configuration for process caused \\\"Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)\\\"\"\n"

Run systemctl reset-failed - doesn't resolve problem

nefelim4ag · 2020-06-29T13:50:13Z

lnykryn/systemd-rhel#266

Looks like bug in old systemd/dbus which fixed in 242

haircommander · 2020-06-29T13:53:45Z

ahh good find @nefelim4ag . Can we close this issue then?

nefelim4ag · 2020-06-29T19:21:04Z

@haircommander
Not tested, i will close after testing of new systemd

nefelim4ag · 2020-07-09T09:54:35Z

@haircommander
We update our clusters to systemd version: https://launchpad.net/ubuntu/+source/systemd/237-3ubuntu10.41
With fix for cookie overflow - that not fix timeout problem

nefelim4ag · 2020-07-09T10:07:05Z

#  dbus-send --system --dest=org.freedesktop.systemd1 --type=method_call --print-reply /org/freedesktop/systemd1 org.freedesktop.DBus.Introspectable.Introspect
Failed to open connection to "system" message bus: The maximum number of active connections for UID 0 has been reached

# dbus-monitor --system
signal time=1594289017.252996 sender=org.freedesktop.DBus -> destination=:1.4767 serial=2 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameAcquired
   string ":1.4767"
signal time=1594289017.253031 sender=org.freedesktop.DBus -> destination=:1.4767 serial=4 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameLost
   string ":1.4767"
signal time=1594289027.003428 sender=:1.3 -> destination=(null destination) serial=13 path=/org/freedesktop/Accounts/User1001; interface=org.freedesktop.Accounts.User; member=Changed
error time=1594289037.374962 sender=org.freedesktop.DBus -> destination=:1.8 error_name=org.freedesktop.DBus.Error.TimedOut reply_serial=4311
   string "Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)"
method call time=1594289037.375331 sender=:1.8 -> destination=org.freedesktop.systemd1 serial=4312 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=StartTransientUnit
   string "kubepods-burstable-pod14dede99_34ea_42f8_98fd_36097937b515.slice"
   string "replace"
   array [
      struct {
         string "Description"
         variant             string "libcontainer container kubepods-burstable-pod14dede99_34ea_42f8_98fd_36097937b515.slice"
      }
      struct {
         string "Wants"
         variant             array [
               string "kubepods-burstable.slice"
            ]
      }
      struct {
         string "Delegate"
         variant             boolean true
      }
      struct {
         string "MemoryAccounting"
         variant             boolean true
      }
      struct {
         string "CPUAccounting"
         variant             boolean true
      }
      struct {
         string "BlockIOAccounting"
         variant             boolean true
      }
      struct {
         string "DefaultDependencies"
         variant             boolean false
      }
      struct {
         string "MemoryLimit"
         variant             uint64 2151677952
      }
      struct {
         string "CPUShares"
         variant             uint64 870
      }
      struct {
         string "CPUQuotaPerSecUSec"
         variant             uint64 1050000
      }
   ]
   array [
   ]
method call time=1594289047.602720 sender=:not.active.yet -> destination=org.freedesktop.DBus serial=1 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=Hello
error time=1594289047.602736 sender=org.freedesktop.DBus -> destination=:not.active.yet error_name=org.freedesktop.DBus.Error.LimitsExceeded reply_serial=1
   string "The maximum number of active connections for UID 0 has been reached"
error time=1594289062.384765 sender=org.freedesktop.DBus -> destination=:1.8 error_name=org.freedesktop.DBus.Error.TimedOut reply_serial=4312
   string "Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)"
method call time=1594289062.385155 sender=:1.8 -> destination=org.freedesktop.systemd1 serial=4313 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=StartTransientUnit
   string "kubepods-burstable-pod1f4c3e81_e98e_4523_bd88_734405c91afd.slice"
   string "replace"
   array [
      struct {
         string "Description"
         variant             string "libcontainer container kubepods-burstable-pod1f4c3e81_e98e_4523_bd88_734405c91afd.slice"
      }
      struct {
         string "Wants"
         variant             array [
               string "kubepods-burstable.slice"
            ]
      }
      struct {
         string "Delegate"
         variant             boolean true
      }
      struct {
         string "MemoryAccounting"
         variant             boolean true
      }
      struct {
         string "CPUAccounting"
         variant             boolean true
      }
      struct {
         string "BlockIOAccounting"
         variant             boolean true
      }
      struct {
         string "DefaultDependencies"
         variant             boolean false
      }
      struct {
         string "MemoryLimit"
         variant             uint64 1742733312
      }
      struct {
         string "CPUShares"
         variant             uint64 665
      }
      struct {
         string "CPUQuotaPerSecUSec"
         variant             uint64 750000
      }
   ]
   array [
   ]

haircommander · 2020-07-09T16:50:19Z

if you set max sessions to something really big (100000 or something) does this still happen? I wonder if we're still leaking, or if your container workload really needs > 2048 connections

haircommander · 2020-07-09T16:51:11Z

unrelated aside: why are you k8s and cri-o versions not in sync? usually users should match minor versions (cri-o 1.17.z with k8s 1.17.z)

nefelim4ag · 2020-07-09T20:47:19Z

@haircommander, because when we did move to CRI-O, we was not ready for K8s update, and i just think about version compatibility like match in go/python k8s libs, i.e. client must have at least cluster version in my opinion - so i'm not afraid of issues with that.

That really matter in that case with sessions?

nefelim4ag · 2020-07-09T20:50:38Z

@haircommander
you talk about:

<busconfig>
<limit name="max_connections_per_user">2048</limit>
</busconfig>

?

On ubuntu default value was 256, i've try 2048 on one node (i assume at least 1-3 per container in bad scenario, we have pod limit 500 per node, but never reach that)

I want to ask - 256 are just very small limit for CRI-O? i.e. you assume what every moderm linux system have at least limit like 2048?

haircommander · 2020-07-09T20:57:58Z

bumping the max_connections_per_user is a sanity check for me at this point. no prob if you think 2048 is sufficient, but I'm mostly curious what happens with a preposterously large value

also, what version of runc are you using?
are you using systemd cgroup manager (default in cri-o)? does this happen with cgroupfs?

I don't think it's related, I'm mostly curious why you didn't try cri-o 1.16.z first

nefelim4ag · 2020-07-09T23:29:08Z

@haircommander
Latest runc from CRI-O ppa,

~# apt-cache policy cri-o-runc 
cri-o-runc:
  Installed: 1.0.0-3~dev2
  Candidate: 1.0.0-3~dev2
  Version table:
 *** 1.0.0-3~dev2 500

We use systemd cgroup manager - yep,
nope i never try cgroupfs with CRI-O, and i'm not sure if that needed. Of course i never experienced such dbus problems with docker/cgroupfs, but seen different, more interesting issues :)

IIRC i did a try 1.16 - that's not change dbus behavior

nefelim4ag · 2020-07-13T03:04:42Z

I think you can try reproduce that behavior, i just use that as test and scale it up/down several times (one time in my case),

---
#application/deployment.yaml 
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 100
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 50m
            ephemeral-storage: 32Mi
            memory: 128Mi
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - node-name

nefelim4ag · 2020-07-13T03:21:01Z

I tried set 131768 as connection limit, yep it harder to hit (more time), but reproducible

nefelim4ag · 2020-11-20T13:18:46Z

Working on 20.04 (systemd 245.4-4ubuntu3.3), with k8s 1.18 & cri-o 1.18

nefelim4ag closed this as completed Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On container creation: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) #3808

On container creation: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) #3808

nefelim4ag commented May 29, 2020 •

edited

mrunalp commented May 29, 2020

nefelim4ag commented Jun 1, 2020 •

edited

haircommander commented Jun 1, 2020

nefelim4ag commented Jun 2, 2020 •

edited

haircommander commented Jun 2, 2020

nefelim4ag commented Jun 3, 2020 •

edited

mrunalp commented Jun 4, 2020

nefelim4ag commented Jun 4, 2020

nefelim4ag commented Jun 29, 2020 •

edited

haircommander commented Jun 29, 2020

nefelim4ag commented Jun 29, 2020

nefelim4ag commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020

haircommander commented Jul 9, 2020

haircommander commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020 •

edited

haircommander commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020 •

edited

nefelim4ag commented Jul 13, 2020

nefelim4ag commented Jul 13, 2020

nefelim4ag commented Nov 20, 2020 •

edited

On container creation: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) #3808

On container creation: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) #3808

Comments

nefelim4ag commented May 29, 2020 • edited

BUG REPORT INFORMATION

mrunalp commented May 29, 2020

nefelim4ag commented Jun 1, 2020 • edited

haircommander commented Jun 1, 2020

nefelim4ag commented Jun 2, 2020 • edited

haircommander commented Jun 2, 2020

nefelim4ag commented Jun 3, 2020 • edited

mrunalp commented Jun 4, 2020

nefelim4ag commented Jun 4, 2020

nefelim4ag commented Jun 29, 2020 • edited

haircommander commented Jun 29, 2020

nefelim4ag commented Jun 29, 2020

nefelim4ag commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020

haircommander commented Jul 9, 2020

haircommander commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020 • edited

haircommander commented Jul 9, 2020

nefelim4ag commented Jul 9, 2020 • edited

nefelim4ag commented Jul 13, 2020

nefelim4ag commented Jul 13, 2020

nefelim4ag commented Nov 20, 2020 • edited

nefelim4ag commented May 29, 2020 •

edited

nefelim4ag commented Jun 1, 2020 •

edited

nefelim4ag commented Jun 2, 2020 •

edited

nefelim4ag commented Jun 3, 2020 •

edited

nefelim4ag commented Jun 29, 2020 •

edited

nefelim4ag commented Jul 9, 2020 •

edited

nefelim4ag commented Jul 9, 2020 •

edited

nefelim4ag commented Nov 20, 2020 •

edited