Review / revisit systemd unit files #73

thaJeztah · 2017-08-02T16:40:37Z

We currently have two different unit files; one for .deb based packages, and one for .rpm. The .rpm version currently assumes systemd 226 or older, which is correct for CentOS and RHEL (RHEL 7.4 uses systemd-219-42.el7.x86_64), but incorrect for (at least) Fedora.

Default install of Docker CE 17.07 on Fedora 26:

systemctl cat docker.service
# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
# TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

[Install]
WantedBy=multi-user.target

Version of systemd running:

$ systemctl --version
systemd 233
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN default-hierarchy=hybrid

Things to notice;

We set LimitNOFILE, LimitNPROC and LimitCORE to infinity to prevent overhead due to accounting
We don't set TasksMax as it's not supported on older versions of systemd (and those versions are not affected by systemd setting a low value)

Configuring TasksMax

On newer versions of systemd we should set TasksMax because the default set by systemd is too low. All docker processes, including containers are started as a child of dockerd, so 4915 processes can easiliy be reached on bigger servers (see moby/moby#23332) (Looks like the Limit was raised since the original limit of 512 systemd/systemd#3211)

$ cat /sys/fs/cgroup/pids/system.slice/docker.service/pids.max
4915

In our .deb packages we automatically set this option based on systemd version; we should have a similar approach for our RPM packages.

From the systemd man-page:

TasksMax=N
Specify the maximum number of tasks that may be created in the unit. This
ensures that the number of tasks accounted for the unit (see above) stays below
a specific limit. This either takes an absolute number of tasks or a percentage
value that is taken relative to the configured maximum number of tasks on the
system. If assigned the special value "infinity", no tasks limit is applied.
This controls the "pids.max" control group attribute. For details about this
control group attribute, see pids.txt.

Implies "TasksAccounting=true". The system default for this setting may be
controlled with DefaultTasksMax= in systemd-system.conf(5).

Disable accounting (if possible)

Reading this blog post; Enable CPU and Memory accounting for docker (or any systemd service) I found that systemd has options to disable accounting. We should consider using these options instead of setting the limits to infinity (which does have the same effect). I have not found yet which version of systemd introduced these options.

The following options are available (see systemd.resource-control;

MemoryAccounting=no
TasksAccounting=no (same result as our current TasksMax=infinity)
CPUAccounting
IOAccounting=no (replaces BlockIOAccounting)
BlockIOAccounting=no (deprecated, see IOAccounting)

The defaults on Fedora 26 look like this;

[root@fedora-2gb-ams3-01 ~]# systemctl show docker | grep Accounting
CPUAccounting=no
IOAccounting=no
BlockIOAccounting=no
MemoryAccounting=no
TasksAccounting=yes

Questions to answer

Which version of systemd introduced the xxAccounting options?
What's the right approach for the RPM packages to override the default based on systemd version? (custom crafted unit file, drop-in file, or modify the unit file like we do for the debs?)

The text was updated successfully, but these errors were encountered:

thaJeztah · 2017-08-02T16:42:40Z

ping @andrewhsu @seemethere PTAL

thaJeztah · 2017-08-02T17:22:14Z

TasksAccounting was added in the same version as TasksMax; both in version 227

thaJeztah · 2017-08-02T18:00:48Z

The LimitNOFILE, LimitNPROC and LimitCORE options are all setting ulimits (see systemd.exec man page),

Looks like there's no common flag to set these all to "unlimited"

[root@fedora-2gb-ams3-01 ~]# systemctl show docker | grep Limit
MemoryLimit=18446744073709551615
LimitCPU=18446744073709551615
LimitCPUSoft=18446744073709551615
LimitFSIZE=18446744073709551615
LimitFSIZESoft=18446744073709551615
LimitDATA=18446744073709551615
LimitDATASoft=18446744073709551615
LimitSTACK=18446744073709551615
LimitSTACKSoft=8388608
LimitCORE=18446744073709551615
LimitCORESoft=18446744073709551615
LimitRSS=18446744073709551615
LimitRSSSoft=18446744073709551615
LimitNOFILE=18446744073709551615
LimitNOFILESoft=18446744073709551615
LimitAS=18446744073709551615
LimitASSoft=18446744073709551615
LimitNPROC=18446744073709551615
LimitNPROCSoft=18446744073709551615
LimitMEMLOCK=65536
LimitMEMLOCKSoft=65536
LimitLOCKS=18446744073709551615
LimitLOCKSSoft=18446744073709551615
LimitSIGPENDING=7912
LimitSIGPENDINGSoft=7912
LimitMSGQUEUE=819200
LimitMSGQUEUESoft=819200
LimitNICE=0
LimitNICESoft=0
LimitRTPRIO=0
LimitRTPRIOSoft=0
LimitRTTIME=18446744073709551615
LimitRTTIMESoft=18446744073709551615

seemethere · 2018-01-18T19:33:13Z

@thaJeztah is this still an issue?

thaJeztah · 2018-01-18T22:55:23Z

Actually, yes, this is still an issue: forgot I opened this one, but this is what @kolyshkin ran into while working on these for Docker EE

We should also review the other options, and see if we should use those

jmarcos-cano · 2018-07-13T06:17:22Z

This might not be the right place to ask, but im currently facing some problems with my nodejs app in Docker.

According to this article the way to enable core dumps is by setting ulimit -c unlimited, apparently Docker for CentOS 7 turns on this by default (according to its unit file 'infinity').

Regarding to Docker is it safe to turn LimitCORE=0 ?

Im afraid of :

Having non-zero Limit*s causes performance problems due to accounting overhead in the kernel. We recommend using cgroups to do container-local accounting.
We set LimitNOFILE, LimitNPROC and LimitCORE to infinity to prevent overhead due to accounting

thaJeztah · 2018-07-13T06:29:47Z

Looks like you want to enable this for a specific container; in that case, use the --ulimit option on docker run; https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container---ulimit

jmarcos-cano · 2018-07-13T13:38:13Z

sorry no mentioned it earlier, im on swarm mode sir.

thaJeztah · 2018-07-13T14:17:42Z

Nope, that option isn't available (yet?) for swarm mode; the problem is not reproducible when starting a container manually? (docker run?) Setting the daemon option will affect every container, so I doubt that's what you want.

jmarcos-cano · 2018-07-17T21:28:23Z

yep it is not reproducible in single docker run, yep I can disable that feature for every container, however i'm just afraid of

Having non-zero Limit*s causes performance problems due to accounting overhead in the kernel. > We recommend using cgroups to do container-local accounting.
We set LimitNOFILE, LimitNPROC and LimitCORE to infinity to prevent overhead due to accounting

what are the performance problems (kernel overhead)?

thaJeztah · 2018-07-17T21:51:12Z

performance will be that the kernel enables accounting when setting those options: I'm not sure if the overhead has been benchmarked, and how much it is

thaJeztah · 2019-01-17T17:12:26Z

Wondering though; we currently set these options on dockerd because (before Docker 18.09), containerd, and thus the containers were started as child-process of dockerd.

With 18.09, containerd is started as a standalone unit, and no longer a child-process of dockerd, so I think we need to tweak those limit with in mind that they only apply to dockerd itself 🤔

thaJeztah · 2019-01-17T17:18:53Z

Also, some options that I didn't know about in traefik/traefik#4302, but they may not be applicable for our daemons

polarathene · 2023-03-06T03:30:49Z

UPDATE: With the exception of some talk below about accounting, the bulk of this has been revised into a much lengthier document: containerd/containerd#7566 (comment)

Personal experience

FWIW, this was the cause of some difficult to debug behaviour with a container that ran a daemonized process which initialized by iterating through the range of FDs inherited and closing them. A fairly common practice, although we now have more modern syscalls to do that efficiently (which the program has since fixed upstream).

For the maintainers of a Docker image that used that program, this delayed the program start up by 8 minutes (over a billion FDs to loop through), but due to different distros used by maintainers, it originally was not clear how to reproduce until this issue was identified.

We've been using --ulimit "nofile=$(ulimit -Sn):$(ulimit -Hn)" with docker run so that running our test suite locally no longer runs into that issue calling tests to fail from time outs.

I recently saw the comment in the docker.service file (which lead me here from google) and wasn't sure on what the advice was advocating, but now I see in this issue that I should enable various accounting features? (bit unclear since the comment says accounting in the kernel is the perf issue?)

I'm not sure what the actual perf impact is otherwise that the config tries to warn against. But I do know that these infinity defaults have caused surprises twice when running third-party software (postsrsd and fail2ban) in containers that would otherwise be fine on the host because the limits are not excessive there. Both projects have since addressed the issue on their end (although some distros like Debian still has not pulled in the update with the fix for postsrsd).

When it's not practical to adjust docker.service, should we advise users to add --ulimit "nofile=$(ulimit -Sn):$(ulimit -Hn)"? That inherits the limits as if running on the host, but I do recall reading that dockerd may need higher limits than what the host defaults are to support many containers or specific workloads (but how often do they really need limits over a million?).

Impact of cited kernel accounting overhead is ambiguous

When looking into the CPUAccounting, I am wondering if things have changed since the issue opened in 2017?

July 2018 - CPUAccounting=yes enabling an expensive CPU controller: cgroup v2: Don't require CPU controller for CPU accounting in 4.15+ systemd/systemd#10507 (causing 4% higher CPU overhead in heavy context switching activity)
Nov 2018 - That CPU Controller was made no longer required for newer kernels (4.15+) with cgroups v2: CPUAccounting=yes unnecessarily enables the expensive CPU controller systemd/systemd#9650
CPUAccounting=yes can be set in a service and bubbles up the parent hierarchy? Docs refer to a DefaultCPUAccounting setting if preferring not to opt-in individual services.
```
$ grep Accounting /etc/systemd/system.conf
#DefaultCPUAccounting=no
```

History of changes with limits to `docker.service`

Looking at history of docker.service (also summed up well by @thaJeztah in Oct 2022 here):

May 2021: LimitNOFILE=infinity + LimitNPROC=infinity brought back into docker.service to sync with Docker CE's equivalent config.
- This PR was a merge commit of this one from Sep 2018 (commit history interleaved ranges from 2017-2021).
- As the PR main diff shows, LimitNPROC=infinity was already added, and LimitNOFILE=1048576 was changed to infinity by the PR merge (initially confused since I'm using git blame on the master branch).
July 2016: LimitNOFILE=infinity changed to LimitNOFILE=1048576 (this number is 2^20).
- Discussion references a 2009 StackOverflow answer about infinity being capped to 2^20 in a specific distro release / kernel. On some systems today, that ceiling can be 1024 times higher (2^30 == 1073741816, over 1 billion).
July 2016: LimitNOFILE and LimitNPROC changed from 1048576 to infinity.
- This PR reverted the LimitNOFILE change shortly after as described above.
March 2014: Original LimitNOFILE + LimitNPROC added with 1048576.
- Linked PR comment mentions that this 2^20 value is already higher than Docker needs:
  
  there's nothing special about 1048576. In fact it's higher than is actually necessary (at the moment). It is however what we've tested with
- It appears it was later changed to infinity to improve CI times where a smaller limit was applied (like this comment about Ubuntu 14.04 adjusting any limit exceeding 2^20 to 2^10?).
- PR also referenced relevant systemd docs (which may have changed since 2014):
  
  LimitNOFILE: Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux.
  Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults
  
  Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, (i.e. do not use select(2)).
  Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit.

I've seen a project where hard limit needed to be raised from an Amazon image to roughly 100k, well below the 2^20 hard-coded limit set historically (The AWS team had their own issues with limits being set excessively low instead).

There's the mentioned docker-library/mysql issue above my comment where memory consumption can be ridiculous. Similar scenarios mentioned in the related Oct 2022 containerd discussion.

So is it still relevant to follow the linked advice to enable CPU and Memory accounting? I have not checked, but assume it doesn't workaround the problems caused with excessive ulimit values in docker.service?

Advice for non-zero limits seems contradictory and can result in far worse side-effects

It seems that infinity is more problematic / surprising than avoiding any limit config in docker.service? Comment them out if you think it's worth retaining for some Docker deployments (and to prevent future regressions).

This seems more like a problem that affects large scale enterprise deployments where they should already have the expertise to know or be informed to adjust limits to infinity or appropriate value in docker.service, not a sane default to ship?

Docker images that actually require it, can document that, which seems far more relevant than expecting images to document lowering limits (or running containers with sane limits), especially when it's due to third-party programs daemonizing or interacting with the limits in other ways like mysql that behave appropriately with sane limits instead of the maximum possible via infinity? (extra frustrating when some environments like Debian was limited to 2^20, despite docker.service configured with LimitNOFILE=infinity)

In these situations, the inline documentation about overhead concerns to justify infinity is conflicting, since it causes much higher resource usage that is severe in impact vs the concerns of overhead I sourced from investigating above.

polarathene · 2023-10-06T23:57:48Z

Memory accounting overhead resource

I have a WIP document from earlier in the year when I was active here, that was documenting references regarding accounting overhead, IIRC. I may not get around to resuming work on that to publish it here, but I recently saw this article for the upcoming 6.7 kernel that might be a helpful reference.

In particular it demonstrates memory accounting overhead is now minimized for root cgroups, while user cgroups still aren't quite there, but a 30% reduction for both:

That is from a micro-benchmark designed to stress such overhead, so in real-world usage the overhead is probably significantly less still?

If you are interested in the draft resource I had been working on previously, I could dig it up and provide as-is. Probably not necessary as it's better to have known problems that workarounds can reference, which IIRC wasn't the case for the config concerns here? (status in 2019 was unsure if config overhead disclaimer remained accurate)

Overhead disclaimer

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.

FWIW for anybody landing here and reading the above earlier discussion on LimitNOFILE=infinity, both moby and containerd have now merged a fix that removes the explicit LimitNOFILE line to instead rely on the implicit 1024:524288 limits:

In these situations, the inline documentation about overhead concerns to justify infinity is conflicting, since it causes much higher resource usage that is severe in impact vs the concerns of overhead I sourced from investigating above. - End of prior comment

This observation of mine may only have been due to LimitNOFILE=infinity.

The now reasonable 1024:524288 limits still easily permit running 65k busybox containers:

It is more likely to hit other bottlenecks first in real-world deployments such as ~1k containers per bridge network, than running that many containers.
The reproduction for 65k busybox containers also estimates 688k tasks + 217GiB memory would be required. It provides a baseline where real workloads would obviously use much more, but still a useful insight.
The soft limit of 1024 previously introduced a limitation of running only 200 busybox containers.

Additionally with the earlier 2019 comment in this discussion regarding dockerd no longer starting containerd as a child process. Other changes and improvements may have happened over the years too.

LimitNPROC=infinity opinion shared in Feb 2023 that suggests leaving it as-is is probably fine for now (until there are reports similar to RLIMIT_NOFILE to suggest change is warranted).

LimitCORE=infinity has a known problem (reported July 2021):

Several customers had full disks because of too many core dumps.
There is no reason to allow more than a few MB. (the number is in 1K units)

I found that this is the issue causing our devices to brick themselves.

thaJeztah added the kind/enhancement label Aug 2, 2017

kolyshkin mentioned this issue Jan 19, 2018

Uncomment TasksMax=unlimited for recent distros docker/docker-ce-packaging#78

Merged

andrewhsu mentioned this issue Aug 28, 2018

Missing limits in service file docker/containerd-packaging#34

Closed

thaJeztah mentioned this issue Jan 9, 2019

Spawning PTY processes is many times slower on Docker 18.09 #502

Closed

3 tasks

kolyshkin mentioned this issue Mar 1, 2019

Default per-container ulimits are too generous moby/moby#38814

Open

vvidovic mentioned this issue Oct 16, 2019

No status provided on response: unknown #813

Closed

3 tasks

notnoop mentioned this issue Oct 25, 2019

nomad coping with OS resources limits hashicorp/nomad#6557

Open

thaJeztah mentioned this issue Aug 13, 2020

systemd: use LimitNOFILE=infinity instead of hard-coded max value containerd/containerd#4475

Merged

AndreKR mentioned this issue Mar 18, 2021

Rook/Ceph on k3s on Fedora CoreOS is extremely slow coreos/fedora-coreos-tracker#329

Closed

thaJeztah mentioned this issue Mar 24, 2022

1.5.10 causes memory leak in mysql container; regression from 1.4.13 containerd/containerd#6707

Open

mikehenrty mentioned this issue Feb 16, 2023

Really high memory usage docker-library/mysql#579

Closed

polarathene mentioned this issue Mar 9, 2023

Revert commit that changed LimitNOFILE to infinity to avoid regressions containerd/containerd#7566

Closed

polarathene mentioned this issue May 13, 2023

fix: Normalize RLIMIT_NOFILE (LimitNOFILE) to sensible defaults moby/moby#45534

Merged

iutx mentioned this issue Aug 3, 2023

Hang with Erlang Runtime in Kubernetes Pod erlang/otp#7523

Closed

liudonghua123 mentioned this issue Nov 30, 2023

help request: docker-compose启动mysql登陆不上 eolinker/apinto-dashboard#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review / revisit systemd unit files #73

Review / revisit systemd unit files #73

thaJeztah commented Aug 2, 2017 •

edited

thaJeztah commented Aug 2, 2017

thaJeztah commented Aug 2, 2017

thaJeztah commented Aug 2, 2017

seemethere commented Jan 18, 2018

thaJeztah commented Jan 18, 2018 •

edited

jmarcos-cano commented Jul 13, 2018 •

edited

thaJeztah commented Jul 13, 2018

jmarcos-cano commented Jul 13, 2018 •

edited

thaJeztah commented Jul 13, 2018

jmarcos-cano commented Jul 17, 2018

thaJeztah commented Jul 17, 2018

thaJeztah commented Jan 17, 2019

thaJeztah commented Jan 17, 2019

polarathene commented Mar 6, 2023 •

edited

polarathene commented Oct 6, 2023 •

edited

Review / revisit systemd unit files #73

Review / revisit systemd unit files #73

Comments

thaJeztah commented Aug 2, 2017 • edited

Default install of Docker CE 17.07 on Fedora 26:

Configuring TasksMax

Disable accounting (if possible)

Questions to answer

thaJeztah commented Aug 2, 2017

thaJeztah commented Aug 2, 2017

thaJeztah commented Aug 2, 2017

seemethere commented Jan 18, 2018

thaJeztah commented Jan 18, 2018 • edited

jmarcos-cano commented Jul 13, 2018 • edited

thaJeztah commented Jul 13, 2018

jmarcos-cano commented Jul 13, 2018 • edited

thaJeztah commented Jul 13, 2018

jmarcos-cano commented Jul 17, 2018

thaJeztah commented Jul 17, 2018

thaJeztah commented Jan 17, 2019

thaJeztah commented Jan 17, 2019

polarathene commented Mar 6, 2023 • edited

Impact of cited kernel accounting overhead is ambiguous

History of changes with limits to docker.service

Advice for non-zero limits seems contradictory and can result in far worse side-effects

polarathene commented Oct 6, 2023 • edited

Memory accounting overhead resource

Overhead disclaimer

thaJeztah commented Aug 2, 2017 •

edited

thaJeztah commented Jan 18, 2018 •

edited

jmarcos-cano commented Jul 13, 2018 •

edited

jmarcos-cano commented Jul 13, 2018 •

edited

polarathene commented Mar 6, 2023 •

edited

History of changes with limits to `docker.service`

polarathene commented Oct 6, 2023 •

edited