Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On server reboot, container exits with code 128, won't retry #293

Open
1 of 3 tasks
Enderer opened this issue May 1, 2018 · 29 comments
Open
1 of 3 tasks

On server reboot, container exits with code 128, won't retry #293

Enderer opened this issue May 1, 2018 · 29 comments

Comments

@Enderer
Copy link

Enderer commented May 1, 2018

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

Actual behavior

After rebooting the server the container does not start back up. The container tries to start but exists with code 128. This looks like its due to the network volume not being available at the time of startup. It takes a few seconds before the volume is ready. The message "no such device" appears in the error log. Manually starting the container works because the network volume is then available.

The container is set to restart=always but Docker does not attempt to restart the container. RestartCount is 0.

Here is the docker command:

docker run -d \
--name=plex \
--net=host \
--restart=always \
-v /home/user/plex/config:/config \
-v /home/user/plex/transcode:/transcode \
-v /mnt/tanagra/public:/tanagra/public \
linuxserver/plex

Here is the error message from docker inspect:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/mnt/tanagra/public\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged/tanagra/public\\\\\\\" caused \\\\\\\"no such device\\\\\\\"\\\"\": unknown",

Output of docker version:

Client:                                    
 Version:      18.03.1-ce                  
 API version:  1.37                        
 Go version:   go1.9.5                     
 Git commit:   9ee9f40                     
 Built:        Thu Apr 26 07:17:20 2018    
 OS/Arch:      linux/amd64                 
 Experimental: false                       
 Orchestrator: swarm                       
                                           
Server:                                    
 Engine:                                   
  Version:      18.03.1-ce                 
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5                    
  Git commit:   9ee9f40                    
  Built:        Thu Apr 26 07:15:30 2018   
  OS/Arch:      linux/amd64                
  Experimental: false                      

Output of docker info:

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 10
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-6-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.54GiB
Name: risa
ID: LFCE:TKPE:JDFJ:MZ4E:JDRJ:4HCN:BO2D:SBBT:2HGF:KCDW:OROP:RZWZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: enderer
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
@andymadge
Copy link

andymadge commented Jun 14, 2018

I'm seeing the same issue, also with a Plex container, and I'm also bind-mounting a network share.

There are a few differences in my situation - I'm using the official Plex docker image, I'm using macvlan network, and I'm running it with docker-compose.

I'm seeing exactly the same symptoms though.

There are no application logs at all inside the container and no entries in the container logs either (docker-compose logs)

The container starts normally if I do docker-compose up.
The container also starts normally if I restart the docker daemon.
The issue only occurs at boot.

If I remove the bind-mounted network share, the container starts normally at boot, so it seems that the issue is the container tries to start before the network share has been mounted.

Therefore I'm not sure whether this constitutes a Docker bug to be honest.

Excerpt from my docker-compose.yml

version: '3.1'
services:
  plex:
    image: plexinc/pms-docker:plexpass
    restart: unless-stopped
    networks:
      physical:
        ipv4_address: 192.168.20.208
    hostname: pms-docker
    volumes:
      - plex-config:/config
      - plex-temp:/transcode
      - /mnt/qnap2/multimedia:/media
    devices:
      - /dev/dri:/dev/dri

networks:
  physical:
    external: true

volumes:
  plex-config:
  plex-temp:
$ docker-compose ps
Name   Command    State     Ports
---------------------------------
plex   /init     Exit 128    

Error from docker inspect is the same as above:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/mnt/qnap2/multimedia\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged/mnt/qnap2/multimedia\\\\\\\" caused \\\\\\\"no such device\\\\\\\"\\\"\": unknown",
            "StartedAt": "2018-06-14T15:43:24.199564037Z",
            "FinishedAt": "2018-06-14T15:49:17.387003284Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 0,
            }
        }

I'm on a slightly later docker version and I'm on Ubuntu 18.04 LTS

$ docker version
Client:
 Version:      18.05.0-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   f150324
 Built:        Wed May  9 22:16:13 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.05.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   f150324
  Built:        Wed May  9 22:14:23 2018
  OS/Arch:      linux/amd64
  Experimental: false

@andymadge
Copy link

andymadge commented Jun 14, 2018

This issue can be reproduced with this basic container which bind-mounts a network share:

docker container run -d \
--restart=always \
--name testmount \
-v /mnt/qnap2/multimedia:/media \
busybox ping 8.8.8.8

It gives the same behaviour and same error after reboot.

$ docker inspect testmount -f '{{ .State.Error }}'
OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/mnt/qnap2/multimedia\\\" to rootfs \\\"/var/lib/docker/overlay2/4be5925b0d17e6c9c03ddf70ad7108ca184f3f7456599cdb6cfa08713a2af0f2/merged\\\" at \\\"/var/lib/docker/overlay2/4be5925b0d17e6c9c03ddf70ad7108ca184f3f7456599cdb6cfa08713a2af0f2/merged/media\\\" caused \\\"no such device\\\"\"": unknown

If I remove the network bind-mount, then it works and starts correctly after reboot:

docker container run -d \
--restart=always \
--name testnomount \
-v /tmp:/media \
busybox ping 8.8.8.8

Therefore the issue is simply that Docker is attempting to start the container before the mount has completed.

I don't think this can be considered a Docker bug - how is Docker daemon supposed to know to wait for the network mount?

I suspect the fix on a case by case basis is to add an After= rule to the systemd docker.service file.

@andymadge
Copy link

andymadge commented Jun 14, 2018

Fix for this is to add x-systemd.after=docker.service to the fstab entry. This tells systemd that docker.service shouldn't be started until after the mount has been done.

If the mount fails, the docker server will start as normal.

~Just for info my full working entry from `/etc/fstab/ is:~~

//qnap2/multimedia /mnt/qnap2/multimedia cifs uid=andym,x-systemd.automount,x-systemd.after=docker.service,credentials=/home/username/.smbcredentials,iocharset=utf8 0 0

I spoke too soon. The above does allow the container to start, but the share isn't actually mounted. The above should not be used.

A working fix is to modify the docker /lib/systemd/system/docker.service file. Add RequiresMountsFor=/mnt/qnap2/multimedia to the [Unit] section.

See https://www.freedesktop.org/software/systemd/man/systemd.unit.html#RequiresMountsFor=

This is not ideal since it requires modifying the Docker service each time a container needs a mount, but it does the job.

@andymadge
Copy link

It seems this is actually a recurrence of a previous issue moby/moby#17485

Repro steps are nearly identical, apart from different mount type.

@irsl
Copy link

irsl commented Aug 14, 2018

I encounter the same issue. Even though the restart policy of my containers is set to unless-stopped, they don't come up if one of the prerequisite mount points are not available at the time Docker attemtps to start them. The retry logic (which otherwise works fine) is not executed. The status is:

            "ExitCode": 255,
            "Error": "OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container ini
t caused \\\"rootfs_linux.go:58: mounting \\\\\\\"...\\\\\\\" to rootfs \\\\\\\"/var/
lib/docker/overlay2/.../merged\\\\\\\" at \\\\\\\"...\\\\\\\" ca
used \\\\\\\"stat ...: no such file or directory\\\\\\\"\\\"\": unknown",

@simonk83
Copy link

Yep, struggling with this as well at the moment. NFS mount is not setup before Docker starts, so the container doesn't work as expected.

@xardalph
Copy link

hello,
same here, but I only use docker volume on the same server with docker-compose, need to restart every project each time.

@rishiloyola
Copy link

Why docker is not trying to restart this container?

@vishalmalli
Copy link

Same issue here. If the CIFS share is not mounted, container exits and does not attempt to restart. Container will start fine when started manually once the network share is available.

@iroes
Copy link

iroes commented Apr 22, 2019

Something similar happens in my case. I've got an encrypted folder in Synology, with automount enabled. Since it's not mounted yet when the docker service starts, it doesn't start until I manually do it with docker-compose up or using the Synology UI. It doesn't retry even with restart: always set.

Result from docker-compose ps:

  Name                Command                State     Ports 
------------------------------------------------------------
test_bck   /entry.sh supervisord -n - ...   Exit 128         

This is really annoying, since I only use my Synology NAS several hours a day... and I need to start some docker services automatically.

@tmtron
Copy link

tmtron commented Jun 16, 2019

I see the same issue. My docker-paths are directly mapped to the filesystem of locally attched SSDs.
And in some cases after reboot the containers show Exit 128 and docker does not try to restart them, although restart: always is used.

When I check systemctl status docker, I can see that the docker service is running, but reports "id already in use"

Docker version 18.09.1, build 4c52b90
docker-compose version 1.23.2, build 1110ad01

Is there a way to force docker to restart the services in this case?

@alno74d
Copy link

alno74d commented Nov 25, 2019

How is this not fixed??? This is extremely annoying, isn't it?

@alno74d
Copy link

alno74d commented Nov 25, 2019

I gathered extra information for my case:
My docker-compose file:

plex:
    image: linuxserver/plex
    container_name: plex
    runtime: nvidia
    environment:
...

Ths output of docker inspect:

[
    {
        "Id": "c12c4d426f8f36848fbe1e4807a46cbd570be56b2534768cfc75e76e03b0e083",
        "Created": "2019-11-24T19:53:46.006747643Z",
        "Path": "/init",
        "Args": [],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "error gathering device information while adding custom device \"/dev/nvidia-modeset\": no such file or directory",
            "StartedAt": "2019-11-25T08:48:31.115776398Z",
            "FinishedAt": "2019-11-25T08:55:31.358738772Z"
        },
...

And my /lib/systemd/system/docker.service :

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket
RequiresMountsFor=/zdata/media /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools /dev/nvidia-modeset

Is there a way to wait for the nvidia driver to be properly loaded other than with "RequiresMountsFor"??

@Exadra37
Copy link

Exadra37 commented May 6, 2020

Same issue occurred today after running yum update in an AWS server, but unfortunately I already started the container, thus I cannot inspect it anymore to see more details.

In my case the container is from the official image for Traefik and was having restart set to always, and also some volumes, being on of them /var/run/docker.sock:

version: '2'

services:
  traefik:
    image: traefik:1.7
    restart: always
    ports:
      - 80:80
      - 443:443
    networks:
      - traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/traefik/traefik.toml:/traefik.toml
      - /opt/traefik/acme.json:/acme.json
    container_name: traefik

networks:
  traefik:
    external: true

Anyone from docker can comment on this issue?

Maybe @andrewhsu, @tiborvass, @thaJeztah or @duglin can help in pointing this issue to anyone that can give a hand here.

@cherouvim
Copy link

cherouvim commented Nov 4, 2020

I had this exact situation. I start my containers using --restart unless-stopped. At some point I updated/upgraded the server (Ubuntu) and then rebooted it. A couple of hours after the reboot, most containers stopped, with Exited (128).

$ docker container list --all
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS                     PORTS                                      NAMES
9f843f571a17        jrcs/letsencrypt-nginx-proxy-companion   "/bin/bash /app/entr…"   5 months ago        Up 4 hours                                                            letsencrypt
2e2daceaa70b        proxy                                    "/app/docker-entrypo…"   5 months ago        Up 4 hours                 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   proxy
5882d5240bbe        foo3                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo5
ace272f67536        foo3                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo4
f89af68a44d6        foo3                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo3
42be6050e8f2        foo2                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo2
5043b220370f        foo1                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo1

After another reboot everything was fixed. Any ideas on why did this happen, or where should I look at to debug the situation?

@thaJeztah
Copy link
Member

A quick glance at the errors mentioned, it looks like all cases are trying to do a bind-mount of an extra disk that is not available yet the moment that docker starts, as commented above as well #293 (comment)

runtime create failed: container_linux.go:348:
starting container process caused process_linux.go:402:
container init caused "rootfs_linux.go:58:
mounting "/mnt/tanagra/public"
to rootfs "/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged"
at "/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged/tanagra/public"
caused "no such device" "": unknown"
OCI runtime create failed: container_linux.go:348:
starting container process caused process_linux.go:402:
container init caused "rootfs_linux.go:58:
mounting "/mnt/qnap2/multimedia"
to rootfs "/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged"
at "/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged/mnt/qnap2/multimedia"
caused "no such device" "": unknown"

I think the reason the daemon might not continue trying is that it requires the container to start successfully "once", before it will start monitoring the container (to handle restarting the container once it exits). I seem to recall this was done to prevent situations where (e.g. similar to what's discussed here) a "broken" container configuration causing a DOS of the whole daemon.

Perhaps the best solution is to create a systemd drop-in file to delay starting the docker service until after the required mounts are present,
similar to containerd/containerd#3741

I see this thread on reddit https://www.reddit.com/r/linuxadmin/comments/5z819x/how_to_have_a_systemd_service_wait_for_a_network/ also mentions global.mount and remote-fs.target, which may be relevant for the NFS shares

@thaJeztah
Copy link
Member

Some details in https://www.freedesktop.org/software/systemd/man/systemd.mount.html

@Apollo3zehn
Copy link

Apollo3zehn commented Nov 6, 2020

My "solution" so far is to create a cron job and let that restart the container until the mounted drive is available:

SHELL=/snap/bin/pwsh

@reboot root <path>/autorestart.ps1

copy that script to /etc/cron.d.

autorestart.ps1 is a poweshell script but that may be replaced easily by another script. The content is:

$isRunning = (docker inspect -f '{{.State.Running}}' <mycontainer>) | Out-String

while ($isRunning.TrimEnd() -ne "true")
{
    "Container is not running. Starting container ..."
    docker container start <mycontainer>
    Start-Sleep -Seconds 10
    $isRunning = (docker inspect -f '{{.State.Running}}' <mycontainer>) | Out-String
}

"Done."

@mattdale77
Copy link

mattdale77 commented Jun 16, 2021

I am experiencing this same issue on Ubuntu 20.04 (and just upgraded to 21, same issue) using systemd.
The shares in question are from virtualbox. My containers start up fine as they have access to their application configuration on /home, however they cannot access the shares for the data they need to function. The container actually bind to the directory under the mount point and use up ghost space on root (which was very tricky to track down).

I have tried the RequiresMountsFor directive but it does not resolve the issue.

@kkretsch
Copy link

I had the same trouble with a simple docker compose file for loki without any remote folders. It semed to fail for just mounting a local file quoting something about mounting through proc.

I therefore created my own systemd startup file for docker, which seems to run now even iv rebooted:

I changed/added these two lines:

Requires=docker.socket containerd.service local-fs.target
RequiresMountsFor=/proc

Full file for reference is here:

root@logger:/etc/systemd/system# cat docker.service 
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service local-fs.target
RequiresMountsFor=/proc

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3

# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity

# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

# kill only the docker process, not all processes in the cgroup
KillMode=process
OOMScoreAdjust=-500

[Install]
WantedBy=multi-user.target

@GeorgeMacFly
Copy link

systemctl status docker

you resolved these issue ???

comment please

@mattdale77
Copy link

I don't remember the details as In no longer use virtualboz but I solved this by changing the systemd priorities. I think I held docker back until the auto mount was complete or I put a sleep in a startup script. I'm sorry In can't remember the details but the solution lies in systemd

@Majestic7979
Copy link

I have this issue on a local bind mount, not a network share, so it's definitely not just that situation. Only one container does this. I'm not sure why. I have restart=always on it, still doesn't retry.

@zapotocnylubos
Copy link

Experiencing the same problem with linuxserver/tvheadend, just a local bind volume for recordings.
Ubuntu 22.04.3 LTS

@tmeuze
Copy link

tmeuze commented Jan 31, 2024

Having the same issue on Debian 12 and Vaultwarden - local binds only. Unfortunately, the fix suggested by @kkretsch did not work.

Oddly, I have both Vaultwarden and vaultwarden-backup in the same compose file, binding the same local directory (vaultwarden-backup has two additional unrelated binds) - yet only Vaultwarden 128s every reboot; the other container start up just fine.

On a separate host (Debian 11), I'm having the same issue with Traefik (sporadically, by contrast) . In this case as well, multiple additional containers are sharing a common local bind. However, testing without multiple containers binding a common directory yields inconsistent results for me.

@dodancs
Copy link

dodancs commented Feb 1, 2024

Ubuntu 22.04, Docker 25.0.0, build e758fe5, this is still an issue. For me it happens with any container, that has restart=always.

@vimoxshah
Copy link

vimoxshah commented Mar 12, 2024

I have the same issue with Ubuntu 22.04.1 Docker Version 24.0.5. Any solution?

@NAM1025
Copy link

NAM1025 commented Mar 16, 2024

Just going to throw my "I have the same issue" out there. This is incredibly frustrating...

I've also tried mounting the drive via /etc/fstab, but if a docker container references it, even with RequiresMountsFor=/some/path in the systemd config, it causes the drive mount to fail. I've confirmed this by removing the container and rebooting, and the drive will mount fine, but restarting the container and rebooting, it fails to mount again. I'm at a complete loss....

The only work-around I have found is to delay docker from starting.

sudo systemctl edit docker.service

Add

### Editing /etc/systemd/system/docker.service.d/override.conf
### Anything between here and the comment below will become the new contents of the file

[Service]
ExecStartPre=/bin/sleep 30

....

This isn't a foolproof fix though, there is definitely still a chance things will fail to load properly.

@FaySmash
Copy link

FaySmash commented Jun 1, 2024

That's how I solved it: https://gitlab.com/-/snippets/3715249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests