Skip to content

Commit

Permalink
v20.10 docs for cgroup v2 and rootless
Browse files Browse the repository at this point in the history
* Docker now supports cgroup v2 (both rootful and rootless)
* Rootless mode graduated from experimental
* New storage driver: fuse-overlayfs

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
  • Loading branch information
AkihiroSuda committed Oct 29, 2020
1 parent ea91172 commit b52599b
Show file tree
Hide file tree
Showing 5 changed files with 134 additions and 48 deletions.
58 changes: 57 additions & 1 deletion config/containers/runmetrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,18 @@ $ grep cgroup /proc/mounts

### Enumerate cgroups

The file layout of cgroups is significantly different between v1 and v2.

If `/sys/fs/cgroup/cgroup.controllers` is present on your system, you are using v2,
otherwise you are using v1.
Refer to the subsection that corresponds to your cgroup version.

> **Note**
>
> As of 2020, Fedora is the only well-known Linux distributon that uses cgroup v2 by default.
> Fedora uses cgroup v2 by default since Fedora 31.
#### cgroup v1
You can look into `/proc/cgroups` to see the different control group subsystems
known to the system, the hierarchy they belong to, and how many groups they contain.

Expand All @@ -64,6 +76,41 @@ the hierarchy mountpoint. `/` means the process has not been assigned to a
group, while `/lxc/pumpkin` indicates that the process is a member of a
container named `pumpkin`.

#### cgroup v2

On cgroup v2 hosts, the content of `/proc/cgroups` isn't meaningful.
See `/sys/fs/cgroup/cgroup.controllers` to the available controllers.

### Changing cgroup version

Changing cgroup version requires rebooting the entire system.

On systemd-based systems, cgroup v2 can be enabled by adding `systemd.unified_cgroup_hierarchy=1`
to the kernel cmdline.
To revert the cgroup version to v1, you need to set `systemd.unified_cgroup_hierarchy=0` instead.

If `grubby` command is available on your system (e.g. on Fedora), the cmdline can be modified as follows:

```console
$ sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"
```

If `grubby` command is not available, edit the `GRUB_CMDLINE_LINUX` line in `/etc/default/grub`
and run `sudo update-grub`.

### Running Docker on cgroup v2

Docker supports cgroup v2 experimentally since Docker 20.10.
Running Docker on cgroup v2 also requires the following conditions to be satisfied:
* containerd: v1.4 or later
* runc: v1.0.0-rc91 or later
* Kernel: v4.15 or later (v5.2 or later is recommended)

Note that the cgroup v2 mode behaves slightly different from the cgroup v1 mode:
* The default cgroup driver (`dockerd --exec-opt native.cgroupdriver`) is "systemd" on v2, "cgroupfs" on v1.
* The default cgroup namespace mode (`docker run --cgroupns`) is "private" on v2, "host" on v1.
* The `docker run` flags `--oom-kill-disable` and `--kernel-memory` are discarded on v2.

### Find the cgroup for a given container

For each container, one cgroup is created in each hierarchy. On
Expand All @@ -78,10 +125,19 @@ in `docker ps`, its long ID might be something like
look it up with `docker inspect` or `docker ps --no-trunc`.

Putting everything together to look at the memory metrics for a Docker
container, take a look at `/sys/fs/cgroup/memory/docker/<longid>/`.
container, take a look at the following paths:
- `/sys/fs/cgroup/memory/docker/<longid>/` on cgroup v1, `cgroupfs` driver
- `/sys/fs/cgroup/memory/system.slice/docker-<longid>.scope/` on cgroup v1, `systemd` driver
- `/sys/fs/cgroup/docker/<longid/>` on cgroup v2, `cgroupfs` driver
- `/sys/fs/cgroup/system.slice/docker-<longid>.scope/` on cgroup v2, `systemd` driver

### Metrics from cgroups: memory, CPU, block I/O

> **Note**
>
> This section is not yet updated for cgroup v2.
> For further information about cgroup v2, refer to [the kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html).
For each subsystem (memory, CPU, and block I/O), one or
more pseudo-files exist and contain statistics.

Expand Down
13 changes: 2 additions & 11 deletions engine/install/fedora.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,22 +160,13 @@ $ sudo dnf config-manager \

Docker is installed but not started. The `docker` group is created, but no users are added to the group.

3. Cgroups Exception:
For Fedora 31 and higher, you need to enable the [backward compatibility for Cgroups](https://fedoraproject.org/wiki/Common_F31_bugs#Other_software_issues).

```bash
$ sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"
```

After running the command, you must reboot for the changes to take effect.

4. Start Docker.
3. Start Docker.

```bash
$ sudo systemctl start docker
```

5. Verify that Docker Engine is installed correctly by running the `hello-world`
4. Verify that Docker Engine is installed correctly by running the `hello-world`
image.

```bash
Expand Down
99 changes: 63 additions & 36 deletions engine/security/rootless.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,8 @@ the container runtime.
Rootless mode does not require root privileges even during the installation of
the Docker daemon, as long as the [prerequisites](#prerequisites) are met.

Rootless mode was introduced in Docker Engine v19.03.

> **Note**
>
> Rootless mode is an experimental feature and has some limitations. For details,
> see [Known limitations](#known-limitations).
Rootless mode was introduced in Docker Engine v19.03 as an experimental feature.
Rootless mode graduated from experimental in Docker Engine v20.10.

## How it works

Expand Down Expand Up @@ -78,35 +74,35 @@ testuser:231072:65536

#### Arch Linux

- Installing `fuse-overlayfs` is recommended. Run `sudo pacman -S fuse-overlayfs`.

- Add `kernel.unprivileged_userns_clone=1` to `/etc/sysctl.conf` (or
`/etc/sysctl.d`) and run `sudo sysctl --system`

#### openSUSE

- Installing `fuse-overlayfs` is recommended. Run `sudo zypper install -y fuse-overlayfs`.

- `sudo modprobe ip_tables iptable_mangle iptable_nat iptable_filter` is required.
This might be required on other distros as well depending on the configuration.

- Known to work on openSUSE 15.

#### Fedora 31 and later

- Fedora 31 uses cgroup v2 by default, which is not yet supported by the containerd runtime.
Run `sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"`
to use cgroup v1.
- You might need `sudo dnf install -y iptables`.
#### CentOS 8 and Fedora

#### CentOS 8
- Installing `fuse-overlayfs` is recommended. Run `sudo dnf install -y fuse-overlayfs`.

- You might need `sudo dnf install -y iptables`.

- Known to work on CentOS 8 and Fedora 32.

#### CentOS 7

- Add `user.max_user_namespaces=28633` to `/etc/sysctl.conf` (or
`/etc/sysctl.d`) and run `sudo sysctl --system`.

- `systemctl --user` does not work by default.
Run the daemon directly without systemd:
`dockerd-rootless.sh --experimental --storage-driver vfs`
Run `dockerd-rootless.sh` directly without systemd.

- Known to work on CentOS 7.7. Older releases require additional configuration
steps.
Expand All @@ -118,10 +114,12 @@ testuser:231072:65536

## Known limitations

- Only `vfs` graphdriver is supported. However, on Ubuntu and Debian 10,
`overlay2` and `overlay` are also supported.
- Only the following storage drivers are supported:
- `overlay2` (only on Ubuntu and Debian 10 hosts)
- `fuse-overlayfs` (only if running with kernel 4.18 or later, and `fuse-overlayfs` is installed)
- `vfs`
- Cgroup is supported only when running with cgroup v2 and systemd. See [Limiting resources](#limiting-resources).
- Following features are not supported:
- Cgroups (including `docker top`, which depends on the cgroups)
- AppArmor
- Checkpoint
- Overlay network
Expand Down Expand Up @@ -206,16 +204,8 @@ $ sudo loginctl enable-linger $(whoami)
To run the daemon directly without systemd, you need to run
`dockerd-rootless.sh` instead of `dockerd`:

```console
$ dockerd-rootless.sh --experimental --storage-driver vfs
```

As Rootless mode is experimental, you need to run
`dockerd-rootless.sh` with `--experimental`.

You also need `--storage-driver vfs` unless you are using Ubuntu or Debian 10
kernel. You don't need to care about these flags if you manage the daemon using
systemd, as these flags are automatically added to the systemd unit file.
On Docker 19.03, you had to run `dockerd-rootless.sh` with `--experimental`.
The `--experimental` flag is no longer needed since Docker 20.10.

Remarks about directory paths:

Expand All @@ -232,7 +222,6 @@ Other remarks:
and network namespaces. You can enter the namespaces by running
`nsenter -U --preserve-credentials -n -m -t $(cat $XDG_RUNTIME_DIR/docker.pid)`.
- `docker info` shows `rootless` in `SecurityOptions`
- `docker info` shows `none` as `Cgroup Driver`

### Client

Expand Down Expand Up @@ -265,13 +254,19 @@ To run Rootless Docker inside "rootful" Docker, use the `docker:<version>-dind-r
image instead of `docker:<version>-dind`.

```console
$ docker run -d --name dind-rootless --privileged docker:19.03-dind-rootless --experimental
$ docker run -d --name dind-rootless --privileged docker:20.10-dind-rootless
```

The `docker:<version>-dind-rootless` image runs as a non-root user (UID 1000).
However, `--privileged` is required for disabling seccomp, AppArmor, and mount
masks.

To run Docker 19.03 in Docker, the `--experimental` flag is needed:

```console
$ docker run -d --name dind-rootless --privileged docker:19.03-dind-rootless --experimental
```

### Expose Docker API socket through TCP

To expose the Docker API socket through TCP, you need to launch `dockerd-rootless.sh`
Expand Down Expand Up @@ -314,11 +309,39 @@ Or add `net.ipv4.ip_unprivileged_port_start=0` to `/etc/sysctl.conf` (or
`/etc/sysctl.d`) and run `sudo sysctl --system`.

### Limiting resources
Limiting resources with cgroup-related `docker run` flags such as `--cpus`, `--memory`, --pids-limit`
is supported only when running with cgroup v2 and systemd.
See [Changing cgroup version](../../config/containers/runmetrics.md) to enable cgroup v2.

If `docker info` shows `none` as `Cgroup Driver`, the conditions are not satisfied.
When these conditions are not satisfied, rootless mode ignores the cgroup-related `docker run` flags.
See [Limiting resources without cgroup](#limiting-resources-without-cgroup) for workarounds.

If `docker info` shows `systemd` as `Cgroup Driver`, the conditions are satisfied.
However, typically, only `memory` and `pids` controllers are delegated to non-root users by default.

```console
$ cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers
memory pids
```

In Docker 19.03, rootless mode ignores cgroup-related `docker run` flags such as
`--cpus`, `--memory`, --pids-limit`.
To allow delegation of all controllers, you need to change the systemd configuration as follows:

However, you can still use the traditional `ulimit` and [`cpulimit`](https://github.com/opsengine/cpulimit),
```console
# mkdir -p /etc/systemd/system/user@.service.d
# cat > /etc/systemd/system/user@.service.d/delegate.conf << EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
# systemctl daemon-reload
```

> **Note**
>
> Delegating `cpuset` requires systemd 244 or later.
#### Limiting resources without cgroup
Even when cgroup is not available, you can still use the traditional `ulimit` and [`cpulimit`](https://github.com/opsengine/cpulimit),
though they work in process-granularity rather than in container-granularity,
and can be arbitrarily disabled by the container process.

Expand Down Expand Up @@ -388,7 +411,7 @@ On a non-systemd host, you need to create a directory and then set the path:
$ export XDG_RUNTIME_DIR=$HOME/.docker/xrd
$ rm -rf $XDG_RUNTIME_DIR
$ mkdir -p $XDG_RUNTIME_DIR
$ dockerd-rootless.sh --experimental
$ dockerd-rootless.sh
```

> **Note**:
Expand Down Expand Up @@ -420,9 +443,11 @@ up automatically. See [Usage](#usage).

**`dockerd` fails with "rootless mode is supported only when running in experimental mode"**

This error occurs when the daemon is launched without the `--experimental` flag.
This error occurs when the daemon is launched without the `--experimental` flag on Docker 19.03.
See [Usage](#usage).

The `--experimental` flag is no longer needed since Docker 20.10.

### `docker pull` errors

**docker: failed to register layer: Error processing tar file(exit status 1): lchown &lt;FILE&gt;: invalid argument**
Expand All @@ -436,7 +461,9 @@ images. However, 65,536 entries are sufficient for most images. See

**`--cpus`, `--memory`, and `--pids-limit` are ignored**

This is an expected behavior in Docker 19.03. For more information, see [Limiting resources](#limiting-resources).
This is an expected behavior on cgroup v1 mode.
To use these flags, the host needs to be configured for enabling cgroup v2.
For more information, see [Limiting resources](#limiting-resources).

**Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown.**

Expand Down
2 changes: 2 additions & 0 deletions storage/storagedriver/overlayfs-driver.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ storage driver as `overlay` or `overlay2`.
> For more information about differences between `overlay` vs `overlay2`, check
> [Docker storage drivers](select-storage-driver.md).
> **Note**: For `fuse-overlayfs` driver, check [Rootless mode documentation](../../engine/security/rootless.md).
## Prerequisites

OverlayFS is the recommended storage driver, and supported if you meet the following
Expand Down
10 changes: 10 additions & 0 deletions storage/storagedriver/select-storage-driver.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ Docker supports the following storage drivers:
Linux distributions, and requires no extra configuration.
* `aufs` was the preferred storage driver for Docker 18.06 and older, when
running on Ubuntu 14.04 on kernel 3.13 which had no support for `overlay2`.
* `fuse-overlayfs` is preferred only for running Rootless Docker
on a host that does not provide support for rootless `overlay2`.
On Ubuntu and Debian 10, the `fuse-overlayfs` driver does not need to be
used `overlay2` works even in rootless mode.
See [Rootless mode documentation](../../engine/security/rootless.md).
* `devicemapper` is supported, but requires `direct-lvm` for production
environments, because `loopback-lvm`, while zero-configuration, has very
poor performance. `devicemapper` was the recommended storage driver for
Expand Down Expand Up @@ -98,6 +103,10 @@ release. It is recommended that users of the `overlay` storage driver migrate to
release. It is recommended that users of the `devicemapper` storage driver migrate
to `overlay2`.

> **Note**
>
> The comparison table above is not applicable for Rootless mode.
> For the drivers available in Rootless mode, see [the Rootless mode documentation](../../engine/security/rootless.md).
When possible, `overlay2` is the recommended storage driver. When installing
Docker for the first time, `overlay2` is used by default. Previously, `aufs` was
Expand Down Expand Up @@ -147,6 +156,7 @@ backing filesystems.
| Storage driver | Supported backing filesystems |
|:----------------------|:------------------------------|
| `overlay2`, `overlay` | `xfs` with ftype=1, `ext4` |
| `fuse-overlayfs` | any filesystem |
| `aufs` | `xfs`, `ext4` |
| `devicemapper` | `direct-lvm` |
| `btrfs` | `btrfs` |
Expand Down

0 comments on commit b52599b

Please sign in to comment.