New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On container creation: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) #3808
Comments
@nefelim4ag Can you check if there are any leaks in the number of units/scopes |
@mrunalp
|
@nefelim4ag I have a suspicion that many of your units are inactive, and could be cleaned up for context, we are working on having systemd do this automatically (by settting the CollectMode to inactive-or-failed), but we need to do some branching on systemd version to make sure we have the option. you can follow that work here |
@haircommander all of them in failed state:
i.e. i not have inactive units :) |
ah! per: https://www.freedesktop.org/software/systemd/man/systemd.unit.html, CollectMode defaults to |
@haircommander, can you please explain why that units leads to problems with Dbus? |
@nefelim4ag Can you try running |
@mrunalp doesn't help, i.e. we recreate many containers and got again:
Run |
Looks like bug in old systemd/dbus which fixed in 242 |
ahh good find @nefelim4ag . Can we close this issue then? |
@haircommander |
@haircommander |
|
if you set max sessions to something really big (100000 or something) does this still happen? I wonder if we're still leaking, or if your container workload really needs > 2048 connections |
unrelated aside: why are you k8s and cri-o versions not in sync? usually users should match minor versions (cri-o 1.17.z with k8s 1.17.z) |
@haircommander, because when we did move to CRI-O, we was not ready for K8s update, and i just think about version compatibility like match in go/python k8s libs, i.e. client must have at least cluster version in my opinion - so i'm not afraid of issues with that. That really matter in that case with sessions? |
@haircommander
? On ubuntu default value was 256, i've try 2048 on one node (i assume at least 1-3 per container in bad scenario, we have pod limit 500 per node, but never reach that) I want to ask - 256 are just very small limit for CRI-O? i.e. you assume what every moderm linux system have at least limit like 2048? |
bumping the max_connections_per_user is a sanity check for me at this point. no prob if you think 2048 is sufficient, but I'm mostly curious what happens with a preposterously large value also, what version of runc are you using? I don't think it's related, I'm mostly curious why you didn't try cri-o 1.16.z first |
@haircommander
We use systemd cgroup manager - yep, IIRC i did a try 1.16 - that's not change dbus behavior |
I think you can try reproduce that behavior, i just use that as test and scale it up/down several times (one time in my case),
|
I tried set 131768 as connection limit, yep it harder to hit (more time), but reproducible |
Working on 20.04 (systemd 245.4-4ubuntu3.3), with k8s 1.18 & cri-o 1.18 |
BUG REPORT INFORMATION
Description
We moved to use CRI-O 1.17.2 on our clusters
OS latest U18.04 used
K8s 1.16.8
After that we sometime experience different errors from DBus daemon, like that:
systemctl reload dbus - helps for some time, because that drop all dbus connections IIRC.
Also i see some messages about limits, so i rise them in hope that will help:
I currently think that happens because too many pods try start at same time, i.e. we mosly experience that when updating workloads.
Maybe you can direct where to look?
Steps to reproduce the issue:
(WIll be added later, when i have 99% steps to reproduction)
Thanks!
The text was updated successfully, but these errors were encountered: