Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker restarts containers before containerd socket is opened #203

Closed
bersace opened this issue Oct 6, 2020 · 5 comments
Closed

docker restarts containers before containerd socket is opened #203

bersace opened this issue Oct 6, 2020 · 5 comments
Labels
good first issue Get started with Flatcar contribution with this issue. kind/bug Something isn't working

Comments

@bersace
Copy link

bersace commented Oct 6, 2020

Hi,

Description

For a few days, my flatcar server doesn't reboot containers (set to always).

Impact

Since updates triggers reboot, I often have all my services down :-((

Environment and steps to reproduce

  1. Create a container with restart policy as always
  2. reboot

Expected behavior

I expect the containers to successfuly restart.

Additional information

My server is migrated from coreos. I disabled docker.socket and enabled docker.service according to #175, using systemctl enable --now docker.service and systemctal disable --now docker.socket.

# cat /etc/os-release 
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=2605.6.0
VERSION_ID=2605.6.0
BUILD_ID=2020-09-28-2140
PRETTY_NAME="Flatcar Container Linux by Kinvolk 2605.6.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar-linux.org/"
BUG_REPORT_URL="https://issues.flatcar-linux.org"
FLATCAR_BOARD="amd64-usr"
journalctl --boot -u docker
-- Logs begin at Tue 2020-10-06 15:49:25 UTC, end at Tue 2020-10-06 17:44:38 UTC. --
Oct 06 17:38:09 mantienne systemd[1]: Starting Docker Application Container Engine...
Oct 06 17:38:09 mantienne dockerd[682]: time="2020-10-06T17:38:09.644756869Z" level=info msg="Starting up"
Oct 06 17:38:09 mantienne dockerd[682]: time="2020-10-06T17:38:09.699907778Z" level=info msg="libcontainerd: started new containerd process" pid=738
Oct 06 17:38:09 mantienne dockerd[682]: time="2020-10-06T17:38:09.701025860Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Oct 06 17:38:09 mantienne dockerd[682]: time="2020-10-06T17:38:09.701144665Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Oct 06 17:38:09 mantienne dockerd[682]: time="2020-10-06T17:38:09.701262568Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0  }] }" modul>
Oct 06 17:38:09 mantienne dockerd[682]: time="2020-10-06T17:38:09.701387933Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Oct 06 17:38:09 mantienne dockerd[738]: time="2020-10-06T17:38:09.828401077Z" level=info msg="starting containerd" revision=40779f9760e207feb7ff24cf21236bf5e63b2b17 version=1.3.7
...
Oct 06 17:39:49 mantienne dockerd[682]: time="2020-10-06T17:39:42.841452604Z" level=error msg="Failed to start container 4ed4990aae075c3f833eb6e333c6ab36baee711ba44717a7489cf2ccb20a4a39: OCI runtime create failed> ...
...
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:15.945040333Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:15.951512208Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:15.951589580Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:15.951642001Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:15.951676107Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:16.012045326Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:16.015254719Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 06 17:40:16 mantienne dockerd[682]: time="2020-10-06T17:40:16.019335322Z" level=error msg="stream copy error: reading from a closed fifo"
...
Oct 06 17:40:21 mantienne dockerd[682]: time="2020-10-06T17:40:21.834153935Z" level=info msg="Daemon has completed initialization"
Oct 06 17:40:21 mantienne systemd[1]: Started Docker Application Container Engine.
Oct 06 17:40:21 mantienne dockerd[682]: time="2020-10-06T17:40:21.890268502Z" level=info msg="API listen on [::]:2376"
Oct 06 17:40:21 mantienne dockerd[682]: time="2020-10-06T17:40:21.890467617Z" level=info msg="API listen on /var/run/docker.sock"

Do you have some clue about this ?

@pothos
Copy link
Member

pothos commented Oct 7, 2020

Seems like a race condition because the containerd service unit treats the services as ready as soon as it runs but it should only be ready if it is actually able to accept requests on the socket. It needs to be changed to Type=notify but this requires containerd support for a sd_notify call after it set up the socket. It seems that it is supported and we can try to use it: https://github.com/containerd/containerd/blob/master/containerd.service

@bersace
Copy link
Author

bersace commented Oct 7, 2020

Hi @photos, thanks for the quick answer :-)

Here is the docker.service unit content :

# systemctl cat docker.service
# /run/systemd/system/docker.service
[Unit]
Requires=torcx.target
After=torcx.target
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=containerd.service docker.socket network-online.target
Wants=network-online.target
Requires=containerd.service docker.socket

[Service]
EnvironmentFile=/run/metadata/torcx
Environment=TORCX_IMAGEDIR=/docker
Type=notify
EnvironmentFile=-/run/flannel/flannel_docker_opts.env
Environment=DOCKER_SELINUX=--selinux-enabled=true

# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/env PATH=${TORCX_BINDIR}:${PATH} ${TORCX_BINDIR}/dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock $DOCKER_SELINUX $DOCKER_OPTS $DOCKER_CGROUPS $DOCKER_OPT_>
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/docker.service.d/10-machine.conf
[Service]
Environment=TMPDIR=/var/tmp
ExecStart=
ExecStart=/usr/lib/coreos/dockerd  --host=unix:///var/run/docker.sock --host=tcp://0.0.0.0:2376 --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --la>
Environment=

Why does docker.service runs two services : dockerd and containerd ?

@bersace
Copy link
Author

bersace commented Oct 7, 2020

Here is the tree of processes in docker.service:

* docker.service - Docker Application Container Engine
     Loaded: loaded (/run/systemd/system/docker.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/docker.service.d
             `-10-machine.conf
     Active: active (running) since Tue 2020-10-06 17:40:21 UTC; 13h ago
       Docs: http://docs.docker.com
   Main PID: 682 (dockerd)
      Tasks: 417
     Memory: 276.5M
     CGroup: /system.slice/docker.service
             |- 682 /run/torcx/bin/dockerd --host=unix:///var/run/docker.sock --host=tcp://0.0.0.0:2376 --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.>
             |- 738 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
...
             |-3327 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/73b4f2747a169d9c5f5c122f0e93997858711f4989be413e5fb78218397198f8 -address /var/ru>
             |-3362 /run/torcx/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.18.0.5 -container-port 443
             |-5687 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/b020549628ed66f4bd882a913e4230ace964be0ab1bad529a054e92a347ddc64 -address /var/ru>
...
             `-5704 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/6fcba295ec14069b8c2c495a2d383a78b81c690845cc43ac42338574708027e2 -address /var/ru>

@pothos
Copy link
Member

pothos commented Jan 28, 2022

Action to be done here: We are still using Type=simple and should migrate to Type=notify in https://github.com/flatcar-linux/coreos-overlay/blob/main/app-emulation/containerd/files/containerd.service#L6

@pothos pothos added good first issue Get started with Flatcar contribution with this issue. kind/bug Something isn't working labels Jan 28, 2022
@krishjainx
Copy link

Action to be done here: We are still using Type=simple and should migrate to Type=notify in https://github.com/flatcar-linux/coreos-overlay/blob/main/app-emulation/containerd/files/containerd.service#L6

Done @ flatcar/scripts#866 @pothos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Get started with Flatcar contribution with this issue. kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants