Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review / revisit systemd unit files #73

Open
thaJeztah opened this issue Aug 2, 2017 · 15 comments
Open

Review / revisit systemd unit files #73

thaJeztah opened this issue Aug 2, 2017 · 15 comments

Comments

@thaJeztah
Copy link
Member

thaJeztah commented Aug 2, 2017

We currently have two different unit files; one for .deb based packages, and one for .rpm. The .rpm version currently assumes systemd 226 or older, which is correct for CentOS and RHEL (RHEL 7.4 uses systemd-219-42.el7.x86_64), but incorrect for (at least) Fedora.

Default install of Docker CE 17.07 on Fedora 26:

systemctl cat docker.service
# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
# TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

[Install]
WantedBy=multi-user.target

Version of systemd running:

$ systemctl --version
systemd 233
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN default-hierarchy=hybrid

Things to notice;

  • We set LimitNOFILE, LimitNPROC and LimitCORE to infinity to prevent overhead due to accounting
  • We don't set TasksMax as it's not supported on older versions of systemd (and those versions are not affected by systemd setting a low value)

Configuring TasksMax

On newer versions of systemd we should set TasksMax because the default set by systemd is too low. All docker processes, including containers are started as a child of dockerd, so 4915 processes can easiliy be reached on bigger servers (see moby/moby#23332) (Looks like the Limit was raised since the original limit of 512 systemd/systemd#3211)

$ cat /sys/fs/cgroup/pids/system.slice/docker.service/pids.max
4915

In our .deb packages we automatically set this option based on systemd version; we should have a similar approach for our RPM packages.

From the systemd man-page:

TasksMax=N
Specify the maximum number of tasks that may be created in the unit. This
ensures that the number of tasks accounted for the unit (see above) stays below
a specific limit. This either takes an absolute number of tasks or a percentage
value that is taken relative to the configured maximum number of tasks on the
system. If assigned the special value "infinity", no tasks limit is applied.
This controls the "pids.max" control group attribute. For details about this
control group attribute, see pids.txt.

Implies "TasksAccounting=true". The system default for this setting may be
controlled with DefaultTasksMax= in systemd-system.conf(5).

Disable accounting (if possible)

Reading this blog post; Enable CPU and Memory accounting for docker (or any systemd service) I found that systemd has options to disable accounting. We should consider using these options instead of setting the limits to infinity (which does have the same effect). I have not found yet which version of systemd introduced these options.

The following options are available (see systemd.resource-control;

  • MemoryAccounting=no
  • TasksAccounting=no (same result as our current TasksMax=infinity)
  • CPUAccounting
  • IOAccounting=no (replaces BlockIOAccounting)
  • BlockIOAccounting=no (deprecated, see IOAccounting)

The defaults on Fedora 26 look like this;

[root@fedora-2gb-ams3-01 ~]# systemctl show docker | grep Accounting
CPUAccounting=no
IOAccounting=no
BlockIOAccounting=no
MemoryAccounting=no
TasksAccounting=yes

Questions to answer

  • Which version of systemd introduced the xxAccounting options?
  • What's the right approach for the RPM packages to override the default based on systemd version? (custom crafted unit file, drop-in file, or modify the unit file like we do for the debs?)
@thaJeztah
Copy link
Member Author

ping @andrewhsu @seemethere PTAL

@thaJeztah
Copy link
Member Author

TasksAccounting was added in the same version as TasksMax; both in version 227

@thaJeztah
Copy link
Member Author

The LimitNOFILE, LimitNPROC and LimitCORE options are all setting ulimits (see systemd.exec man page),

Looks like there's no common flag to set these all to "unlimited"

[root@fedora-2gb-ams3-01 ~]# systemctl show docker | grep Limit
MemoryLimit=18446744073709551615
LimitCPU=18446744073709551615
LimitCPUSoft=18446744073709551615
LimitFSIZE=18446744073709551615
LimitFSIZESoft=18446744073709551615
LimitDATA=18446744073709551615
LimitDATASoft=18446744073709551615
LimitSTACK=18446744073709551615
LimitSTACKSoft=8388608
LimitCORE=18446744073709551615
LimitCORESoft=18446744073709551615
LimitRSS=18446744073709551615
LimitRSSSoft=18446744073709551615
LimitNOFILE=18446744073709551615
LimitNOFILESoft=18446744073709551615
LimitAS=18446744073709551615
LimitASSoft=18446744073709551615
LimitNPROC=18446744073709551615
LimitNPROCSoft=18446744073709551615
LimitMEMLOCK=65536
LimitMEMLOCKSoft=65536
LimitLOCKS=18446744073709551615
LimitLOCKSSoft=18446744073709551615
LimitSIGPENDING=7912
LimitSIGPENDINGSoft=7912
LimitMSGQUEUE=819200
LimitMSGQUEUESoft=819200
LimitNICE=0
LimitNICESoft=0
LimitRTPRIO=0
LimitRTPRIOSoft=0
LimitRTTIME=18446744073709551615
LimitRTTIMESoft=18446744073709551615

@seemethere
Copy link

@thaJeztah is this still an issue?

@thaJeztah
Copy link
Member Author

thaJeztah commented Jan 18, 2018

Actually, yes, this is still an issue: forgot I opened this one, but this is what @kolyshkin ran into while working on these for Docker EE

We should also review the other options, and see if we should use those

@jmarcos-cano
Copy link

jmarcos-cano commented Jul 13, 2018

This might not be the right place to ask, but im currently facing some problems with my nodejs app in Docker.

According to this article the way to enable core dumps is by setting ulimit -c unlimited, apparently Docker for CentOS 7 turns on this by default (according to its unit file 'infinity').

Regarding to Docker is it safe to turn LimitCORE=0 ?

Im afraid of :

Having non-zero Limit*s causes performance problems due to accounting overhead in the kernel. We recommend using cgroups to do container-local accounting.
We set LimitNOFILE, LimitNPROC and LimitCORE to infinity to prevent overhead due to accounting

@thaJeztah
Copy link
Member Author

Looks like you want to enable this for a specific container; in that case, use the --ulimit option on docker run; https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container---ulimit

@jmarcos-cano
Copy link

jmarcos-cano commented Jul 13, 2018

sorry no mentioned it earlier, im on swarm mode sir.

@thaJeztah
Copy link
Member Author

Nope, that option isn't available (yet?) for swarm mode; the problem is not reproducible when starting a container manually? (docker run?) Setting the daemon option will affect every container, so I doubt that's what you want.

@jmarcos-cano
Copy link

yep it is not reproducible in single docker run, yep I can disable that feature for every container, however i'm just afraid of

Having non-zero Limit*s causes performance problems due to accounting overhead in the kernel. > We recommend using cgroups to do container-local accounting.
We set LimitNOFILE, LimitNPROC and LimitCORE to infinity to prevent overhead due to accounting

what are the performance problems (kernel overhead)?

@thaJeztah
Copy link
Member Author

performance will be that the kernel enables accounting when setting those options: I'm not sure if the overhead has been benchmarked, and how much it is

@thaJeztah
Copy link
Member Author

Wondering though; we currently set these options on dockerd because (before Docker 18.09), containerd, and thus the containers were started as child-process of dockerd.

With 18.09, containerd is started as a standalone unit, and no longer a child-process of dockerd, so I think we need to tweak those limit with in mind that they only apply to dockerd itself 🤔

@thaJeztah
Copy link
Member Author

Also, some options that I didn't know about in traefik/traefik#4302, but they may not be applicable for our daemons

@polarathene
Copy link

polarathene commented Mar 6, 2023

UPDATE: With the exception of some talk below about accounting, the bulk of this has been revised into a much lengthier document: containerd/containerd#7566 (comment)


Personal experience

FWIW, this was the cause of some difficult to debug behaviour with a container that ran a daemonized process which initialized by iterating through the range of FDs inherited and closing them. A fairly common practice, although we now have more modern syscalls to do that efficiently (which the program has since fixed upstream).

For the maintainers of a Docker image that used that program, this delayed the program start up by 8 minutes (over a billion FDs to loop through), but due to different distros used by maintainers, it originally was not clear how to reproduce until this issue was identified.

We've been using --ulimit "nofile=$(ulimit -Sn):$(ulimit -Hn)" with docker run so that running our test suite locally no longer runs into that issue calling tests to fail from time outs.


I recently saw the comment in the docker.service file (which lead me here from google) and wasn't sure on what the advice was advocating, but now I see in this issue that I should enable various accounting features? (bit unclear since the comment says accounting in the kernel is the perf issue?)

I'm not sure what the actual perf impact is otherwise that the config tries to warn against. But I do know that these infinity defaults have caused surprises twice when running third-party software (postsrsd and fail2ban) in containers that would otherwise be fine on the host because the limits are not excessive there. Both projects have since addressed the issue on their end (although some distros like Debian still has not pulled in the update with the fix for postsrsd).

When it's not practical to adjust docker.service, should we advise users to add --ulimit "nofile=$(ulimit -Sn):$(ulimit -Hn)"? That inherits the limits as if running on the host, but I do recall reading that dockerd may need higher limits than what the host defaults are to support many containers or specific workloads (but how often do they really need limits over a million?).


Impact of cited kernel accounting overhead is ambiguous

When looking into the CPUAccounting, I am wondering if things have changed since the issue opened in 2017?

History of changes with limits to docker.service

Looking at history of docker.service (also summed up well by @thaJeztah in Oct 2022 here):

  • May 2021: LimitNOFILE=infinity + LimitNPROC=infinity brought back into docker.service to sync with Docker CE's equivalent config.
    • This PR was a merge commit of this one from Sep 2018 (commit history interleaved ranges from 2017-2021).
    • As the PR main diff shows, LimitNPROC=infinity was already added, and LimitNOFILE=1048576 was changed to infinity by the PR merge (initially confused since I'm using git blame on the master branch).
  • July 2016: LimitNOFILE=infinity changed to LimitNOFILE=1048576 (this number is 2^20).
    • Discussion references a 2009 StackOverflow answer about infinity being capped to 2^20 in a specific distro release / kernel. On some systems today, that ceiling can be 1024 times higher (2^30 == 1073741816, over 1 billion).
  • July 2016: LimitNOFILE and LimitNPROC changed from 1048576 to infinity.
    • This PR reverted the LimitNOFILE change shortly after as described above.
  • March 2014: Original LimitNOFILE + LimitNPROC added with 1048576.
    • Linked PR comment mentions that this 2^20 value is already higher than Docker needs:

      there's nothing special about 1048576. In fact it's higher than is actually necessary (at the moment). It is however what we've tested with

    • It appears it was later changed to infinity to improve CI times where a smaller limit was applied (like this comment about Ubuntu 14.04 adjusting any limit exceeding 2^20 to 2^10?).
    • PR also referenced relevant systemd docs (which may have changed since 2014):

      LimitNOFILE: Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux.
      Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults

      Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, (i.e. do not use select(2)).
      Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit.

I've seen a project where hard limit needed to be raised from an Amazon image to roughly 100k, well below the 2^20 hard-coded limit set historically (The AWS team had their own issues with limits being set excessively low instead).

There's the mentioned docker-library/mysql issue above my comment where memory consumption can be ridiculous. Similar scenarios mentioned in the related Oct 2022 containerd discussion.

So is it still relevant to follow the linked advice to enable CPU and Memory accounting? I have not checked, but assume it doesn't workaround the problems caused with excessive ulimit values in docker.service?


Advice for non-zero limits seems contradictory and can result in far worse side-effects

It seems that infinity is more problematic / surprising than avoiding any limit config in docker.service? Comment them out if you think it's worth retaining for some Docker deployments (and to prevent future regressions).

This seems more like a problem that affects large scale enterprise deployments where they should already have the expertise to know or be informed to adjust limits to infinity or appropriate value in docker.service, not a sane default to ship?

Docker images that actually require it, can document that, which seems far more relevant than expecting images to document lowering limits (or running containers with sane limits), especially when it's due to third-party programs daemonizing or interacting with the limits in other ways like mysql that behave appropriately with sane limits instead of the maximum possible via infinity? (extra frustrating when some environments like Debian was limited to 2^20, despite docker.service configured with LimitNOFILE=infinity)

In these situations, the inline documentation about overhead concerns to justify infinity is conflicting, since it causes much higher resource usage that is severe in impact vs the concerns of overhead I sourced from investigating above.

@polarathene
Copy link

polarathene commented Oct 6, 2023

Memory accounting overhead resource

I have a WIP document from earlier in the year when I was active here, that was documenting references regarding accounting overhead, IIRC. I may not get around to resuming work on that to publish it here, but I recently saw this article for the upcoming 6.7 kernel that might be a helpful reference.

In particular it demonstrates memory accounting overhead is now minimized for root cgroups, while user cgroups still aren't quite there, but a 30% reduction for both:

image

That is from a micro-benchmark designed to stress such overhead, so in real-world usage the overhead is probably significantly less still?

If you are interested in the draft resource I had been working on previously, I could dig it up and provide as-is. Probably not necessary as it's better to have known problems that workarounds can reference, which IIRC wasn't the case for the config concerns here? (status in 2019 was unsure if config overhead disclaimer remained accurate)


Overhead disclaimer

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.

FWIW for anybody landing here and reading the above earlier discussion on LimitNOFILE=infinity, both moby and containerd have now merged a fix that removes the explicit LimitNOFILE line to instead rely on the implicit 1024:524288 limits:

In these situations, the inline documentation about overhead concerns to justify infinity is conflicting, since it causes much higher resource usage that is severe in impact vs the concerns of overhead I sourced from investigating above. - End of prior comment

This observation of mine may only have been due to LimitNOFILE=infinity.

The now reasonable 1024:524288 limits still easily permit running 65k busybox containers:

Additionally with the earlier 2019 comment in this discussion regarding dockerd no longer starting containerd as a child process. Other changes and improvements may have happened over the years too.

LimitNPROC=infinity opinion shared in Feb 2023 that suggests leaving it as-is is probably fine for now (until there are reports similar to RLIMIT_NOFILE to suggest change is warranted).

LimitCORE=infinity has a known problem (reported July 2021):

Several customers had full disks because of too many core dumps.
There is no reason to allow more than a few MB. (the number is in 1K units)

I found that this is the issue causing our devices to brick themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants