Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "Network sandbox join failed: could not get network sandbox (oper true): failed get network namespace "": no such file or directory" #25215

Closed
sebi-hgdata opened this issue Jul 29, 2016 · 48 comments
Assignees
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Milestone

Comments

@sebi-hgdata
Copy link

sebi-hgdata commented Jul 29, 2016

Output of docker version:

$ sudo docker version
Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:00:36 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:00:36 2016
 OS/Arch:      linux/amd64

Output of docker info:

$ sudo docker info
Containers: 6
 Running: 0
 Paused: 0
 Stopped: 6
Images: 4
Server Version: 1.12.0
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay null bridge host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 4.3.0-040300-generic
Operating System: Ubuntu 14.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 14.94 GiB
Name: ip-10-3-0-92
ID: NQRY:TQDU:MZ7P:242T:S24G:6PNJ:I3HH:OTVY:IAHK:O5GY:2OVY:P7KP
Docker Root Dir: /home/ubuntu/hgdata/deployments/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 19
 Goroutines: 51
 System Time: 2016-07-29T09:49:26.373617108Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Cluster Store: consul://localhost:8500
Cluster Advertise: 10.3.0.92:2375
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

8 machine Consul cluster in AWS runnning Docker 1.11.2.

Steps to reproduce the issue:

  1. All machines have running containers with restart policies set (unless-stopped or always)
  2. Concurrent upgrade to 1.12 for all machines (we use Ansible to run the following command sudo apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::="--force-confdef" install --reinstall -y docker-engine ``` on all machines )

Describe the results you received:

Got the following error network sandbox join failed: could not get network sandbox (oper true): failed get network namespace \"\": no such file or directory for all containers from just a single host

Describe the results you expected:

No errors during container restarts

Additional information you deem important (e.g. issue happens only occasionally):

It happened only on a single node.

$ sudo docker ps -a
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS                           PORTS               NAMES
220e9b95844c        hgdata1/modsecurity:65f576adca8f   "./trap.sh"              20 hours ago        Exited (0) 20 hours ago                              modsecurity
65e7a632bbd8        hgdata1/haproxy:44c12be8862e       "/docker-entrypoint.s"   44 hours ago        Exited (128) About an hour ago                       haproxy-ops-o1-b
c98ab340c9e0        hgdata1/api:44c12be8862e           "./trap.sh"              44 hours ago        Exited (128) About an hour ago                       api-ops-o1-b-blue
02c3138ebbbd        hgdata1/api:44c12be8862e           "./trap.sh"              45 hours ago        Exited (0) 44 hours ago                              api-ops-o1-b-green_previous
d962e63be086        hgdata1/api:44c12be8862e           "./trap.sh"              45 hours ago        Exited (0) 45 hours ago                              api-ops-o1-b-blue_previous
65ff1b44494a        hgdata1/httpd:44c12be8862e         "./trap.sh"              45 hours ago        Exited (128) About an hour ago                       httpd-b

$ sudo docker inspect haproxy-ops-o1-b
[
    {
        "Id": "65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f",
        "Created": "2016-07-27T13:08:51.696950169Z",
        "Path": "/docker-entrypoint.sh",
        "Args": [
            "haproxy",
            "-f",
            "/usr/local/etc/haproxy/haproxy_global.cfg",
            "-f",
            "/usr/local/etc/haproxy/api.cfg",
            "-f",
            "/usr/local/etc/haproxy/ldap.cfg",
            "-f",
            "/usr/local/etc/haproxy/ui.cfg",
            "-f",
            "/usr/local/etc/haproxy/db.cfg"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "network sandbox join failed: could not get network sandbox (oper true): failed get network namespace \"\": no such file or directory",
            "StartedAt": "2016-07-29T08:44:28.800313984Z",
            "FinishedAt": "2016-07-29T08:57:28.24702013Z"
        },
        "Image": "sha256:8dfa093839496da3025f1ab0e4492f0cb43823a79879e1a4f30075a1449775d9",
        "ResolvConfPath": "/home/ubuntu/hgdata/deployments/docker/containers/65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f/resolv.conf",
        "HostnamePath": "/home/ubuntu/hgdata/deployments/docker/containers/65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f/hostname",
        "HostsPath": "/home/ubuntu/hgdata/deployments/docker/containers/65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f/hosts",
        "LogPath": "/home/ubuntu/hgdata/deployments/docker/containers/65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f/65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f-json.log",
        "Name": "/haproxy-ops-o1-b",
        "RestartCount": 0,
        "Driver": "overlay",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/home/ubuntu/hgdata/deployments/ops/o1/haproxy:/usr/local/etc/haproxy/"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "backbone2",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "always",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DiskQuota": 0,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": -1,
            "OomKillDisable": false,
            "PidsLimit": 0,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0
        },
        "GraphDriver": {
            "Name": "overlay",
            "Data": {
                "LowerDir": "/home/ubuntu/hgdata/deployments/docker/overlay/40a16bae1f1360008ecb289cc9d9994d6c101fdb3abf74b3ec49e4d874cd98c0/root",
                "MergedDir": "/home/ubuntu/hgdata/deployments/docker/overlay/4e29d7ecebfd17faf2f6a7d8a830220fc2ed58acd03ba50f8688ae06031d5352/merged",
                "UpperDir": "/home/ubuntu/hgdata/deployments/docker/overlay/4e29d7ecebfd17faf2f6a7d8a830220fc2ed58acd03ba50f8688ae06031d5352/upper",
                "WorkDir": "/home/ubuntu/hgdata/deployments/docker/overlay/4e29d7ecebfd17faf2f6a7d8a830220fc2ed58acd03ba50f8688ae06031d5352/work"
            }
        },
        "Mounts": [
            {
                "Source": "/home/ubuntu/hgdata/deployments/ops/o1/haproxy",
                "Destination": "/usr/local/etc/haproxy",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Hostname": "65e7a632bbd8",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": true,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "HAPROXY_MAJOR=1.6",
                "HAPROXY_VERSION=1.6.5",
                "HAPROXY_MD5=5290f278c04e682e42ab71fed26fc082"
            ],
            "Cmd": [
                "haproxy",
                "-f",
                "/usr/local/etc/haproxy/haproxy_global.cfg",
                "-f",
                "/usr/local/etc/haproxy/api.cfg",
                "-f",
                "/usr/local/etc/haproxy/ldap.cfg",
                "-f",
                "/usr/local/etc/haproxy/ui.cfg",
                "-f",
                "/usr/local/etc/haproxy/db.cfg"
            ],
            "Image": "hgdata1/haproxy:44c12be8862e",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/docker-entrypoint.sh"
            ],
            "OnBuild": null,
            "Labels": {
                "counter": "b",
                "service": "haproxy-ops-o1"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "bc9b69b5db3bd8f4e11c3c64c781ae108e5a55e84c68d3309422c4dc4bbbb34e",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": null,
            "SandboxKey": "/var/run/docker/netns/bc9b69b5db3b",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "backbone2": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": [
                        "haproxy-ops-o1",
                        "65e7a632bbd8"
                    ],
                    "NetworkID": "61c181cca3cf90c428b3360c503398c587395bf16ec3f314ecb734240250f203",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": ""
                }
            }
        }
    }
]

$ sudo docker start haproxy-ops-o1-b
Error response from daemon: network sandbox join failed: could not get network sandbox (oper true): failed get network namespace "": no such file or directory
Error: failed to start containers: haproxy-ops-o1-b

Daemon log:

time="2016-07-29T08:57:36.410353378Z" level=debug msg="Using default logging driver json-file" 
time="2016-07-29T08:57:36.410425943Z" level=debug msg="Golang's threads limit set to 109980" 
time="2016-07-29T08:57:36.575057676Z" level=debug msg="[graphdriver] trying provided driver \"overlay\"" 
time="2016-07-29T08:57:36.582073752Z" level=debug msg="Using graph driver overlay" 
time="2016-07-29T08:57:36.598917317Z" level=debug msg="Max Concurrent Downloads: 3" 
time="2016-07-29T08:57:36.599150282Z" level=debug msg="Max Concurrent Uploads: 5" 
time="2016-07-29T08:57:36.615731333Z" level=info msg="Graph migration to content-addressability took 0.00 seconds" 
time="2016-07-29T08:57:36.615982284Z" level=debug msg="Initializing discovery service" name=consul uri="localhost:8500" 
time="2016-07-29T08:57:36.616014306Z" level=info msg="Initializing discovery without TLS" 
time="2016-07-29T08:57:36.616397192Z" level=warning msg="Your kernel does not support swap memory limit." 
time="2016-07-29T08:57:36.617982419Z" level=debug msg="Loaded container 02c3138ebbbd586daa043eca74a53029b281734237cf217dc97393268989c245" 
time="2016-07-29T08:57:36.618516963Z" level=debug msg="Loaded container 220e9b95844c132f5f672fc59402e5383d3b1f47535b2864b09b052583d8d23e" 
time="2016-07-29T08:57:36.619125163Z" level=debug msg="Loaded container 65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f" 
time="2016-07-29T08:57:36.619813526Z" level=debug msg="Loaded container 65ff1b44494ae51257a684169336db7b0085c51bc74cfb0a6e5fbe2e2474042c" 
time="2016-07-29T08:57:36.620683163Z" level=debug msg="Loaded container c98ab340c9e07355c89f2237f64ab03c9e2faae03720db2f2b3a556152e17034" 
time="2016-07-29T08:57:36.621685691Z" level=debug msg="Loaded container d962e63be0861a68ea7b2c7db109825428eb8fe8b4ccb2151349c8d634b9a83c" 
time="2016-07-29T08:57:36.621983909Z" level=debug msg="Option DefaultDriver: bridge" 
time="2016-07-29T08:57:36.622105987Z" level=debug msg="Option DefaultNetwork: bridge" 
time="2016-07-29T08:57:36.622202965Z" level=debug msg="Option OptionKVProvider: consul" 
time="2016-07-29T08:57:36.622312818Z" level=debug msg="Option OptionKVProviderURL: localhost:8500" 
time="2016-07-29T08:57:36.636492609Z" level=info msg="Firewalld running: false" 
time="2016-07-29T08:57:36.638227012Z" level=debug msg="/sbin/iptables, [--wait --version]" 
time="2016-07-29T08:57:36.643047095Z" level=debug msg="/sbin/iptables, [--wait -t nat -D PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.646917498Z" level=debug msg="/sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL ! --dst 127.0.0.0/8 -j DOCKER]" 
time="2016-07-29T08:57:36.650951552Z" level=debug msg="/sbin/iptables, [--wait -t nat -D OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.652516386Z" level=debug msg="/sbin/iptables, [--wait -t nat -D PREROUTING]" 
time="2016-07-29T08:57:36.658954104Z" level=debug msg="/sbin/iptables, [--wait -t nat -D OUTPUT]" 
time="2016-07-29T08:57:36.662946484Z" level=debug msg="/sbin/iptables, [--wait -t nat -F DOCKER]" 
time="2016-07-29T08:57:36.666921197Z" level=debug msg="/sbin/iptables, [--wait -t nat -X DOCKER]" 
time="2016-07-29T08:57:36.670918488Z" level=debug msg="/sbin/iptables, [--wait -t filter -F DOCKER]" 
time="2016-07-29T08:57:36.674921400Z" level=debug msg="/sbin/iptables, [--wait -t filter -X DOCKER]" 
time="2016-07-29T08:57:36.678938066Z" level=debug msg="/sbin/iptables, [--wait -t filter -F DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.682921996Z" level=debug msg="/sbin/iptables, [--wait -t filter -X DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.685393359Z" level=debug msg="/sbin/iptables, [--wait -t nat -n -L DOCKER]" 
time="2016-07-29T08:57:36.690918247Z" level=debug msg="/sbin/iptables, [--wait -t nat -N DOCKER]" 
time="2016-07-29T08:57:36.694912483Z" level=debug msg="/sbin/iptables, [--wait -t filter -n -L DOCKER]" 
time="2016-07-29T08:57:36.698923382Z" level=debug msg="/sbin/iptables, [--wait -t filter -n -L DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.702952501Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -j RETURN]" 
time="2016-07-29T08:57:36.710969379Z" level=debug msg="/sbin/iptables, [--wait -I DOCKER-ISOLATION -j RETURN]" 
time="2016-07-29T08:57:36.725502370Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE]" 
time="2016-07-29T08:57:36.728255150Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -m addrtype --src-type LOCAL -o docker_gwbridge -j MASQUERADE]" 
time="2016-07-29T08:57:36.734972259Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -i docker_gwbridge -o docker_gwbridge -j ACCEPT]" 
time="2016-07-29T08:57:36.738958189Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:36.742937228Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT]" 
time="2016-07-29T08:57:36.746938064Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]" 
time="2016-07-29T08:57:36.750964920Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.752778672Z" level=debug msg="/sbin/iptables, [--wait -t nat -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.754479724Z" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.756029444Z" level=debug msg="/sbin/iptables, [--wait -t nat -A OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.757627864Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker_gwbridge -j DOCKER]" 
time="2016-07-29T08:57:36.759032660Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker_gwbridge -j DOCKER]" 
time="2016-07-29T08:57:36.760451130Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.761837268Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.763308962Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.764978658Z" level=debug msg="Network (3430034) restored" 
time="2016-07-29T08:57:36.765392252Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:36.766939063Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -m addrtype --src-type LOCAL -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:36.768556631Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -i docker0 -o docker0 -j DROP]" 
time="2016-07-29T08:57:36.770024198Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:36.771496595Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 ! -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:36.772825742Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]" 
time="2016-07-29T08:57:36.774341646Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.775939945Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.777579301Z" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.779122874Z" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:36.780659511Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:36.782060396Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:36.783532237Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.784960011Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.786639322Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:36.788174851Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:36.789858465Z" level=debug msg="/sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:36.791309541Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" 
time="2016-07-29T08:57:36.792668345Z" level=debug msg="/sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" 
time="2016-07-29T08:57:36.794173671Z" level=debug msg="Network (a67d42e) restored" 
time="2016-07-29T08:57:36.910315146Z" level=debug msg="Watch triggered with 8 nodes" discovery=consul 
time="2016-07-29T08:57:36.911244077Z" level=info msg="2016/07/29 08:57:36 [INFO] serf: EventMemberJoin: ip-10-3-0-92 10.3.0.92\n" 
time="2016-07-29T08:57:36.912673342Z" level=debug msg="Allocating IPv4 pools for network docker_gwbridge (3430034048bbfaf02b5033d74e97ef652df97997b7b2c16f1394461b2cd9c150)" 
time="2016-07-29T08:57:36.912742600Z" level=debug msg="RequestPool(LocalDefault, 172.18.0.0/16, , map[], false)" 
time="2016-07-29T08:57:36.912817756Z" level=debug msg="RequestAddress(LocalDefault/172.18.0.0/16, 172.18.0.1, map[RequestAddressType:com.docker.network.gateway])" 
time="2016-07-29T08:57:36.913046459Z" level=debug msg="2016/07/29 08:57:36 [DEBUG] memberlist: Failed to join 10.3.0.127: dial tcp 10.3.0.127:7946: getsockopt: connection refused\n" 
time="2016-07-29T08:57:36.913103437Z" level=error msg="joining serf neighbor 10.3.0.127 failed: Failed to join the cluster at neigh IP 10.3.0.127: 1 error(s) occurred:\n\n* Failed to join 10.3.0.127: dial tcp 10.3.0.127:7946: getsockopt: connection refused" 
time="2016-07-29T08:57:36.913561284Z" level=debug msg="2016/07/29 08:57:36 [DEBUG] memberlist: Failed to join 10.3.0.217: dial tcp 10.3.0.217:7946: getsockopt: connection refused\n" 
time="2016-07-29T08:57:36.913644304Z" level=error msg="joining serf neighbor 10.3.0.217 failed: Failed to join the cluster at neigh IP 10.3.0.217: 1 error(s) occurred:\n\n* Failed to join 10.3.0.217: dial tcp 10.3.0.217:7946: getsockopt: connection refused" 
time="2016-07-29T08:57:36.914133128Z" level=debug msg="2016/07/29 08:57:36 [DEBUG] memberlist: Failed to join 10.3.0.226: dial tcp 10.3.0.226:7946: getsockopt: connection refused\n" 
time="2016-07-29T08:57:36.914200121Z" level=error msg="joining serf neighbor 10.3.0.226 failed: Failed to join the cluster at neigh IP 10.3.0.226: 1 error(s) occurred:\n\n* Failed to join 10.3.0.226: dial tcp 10.3.0.226:7946: getsockopt: connection refused" 
time="2016-07-29T08:57:36.914857855Z" level=debug msg="2016/07/29 08:57:36 [DEBUG] memberlist: Failed to join 10.3.0.227: dial tcp 10.3.0.227:7946: getsockopt: connection refused\n" 
time="2016-07-29T08:57:36.914945163Z" level=error msg="joining serf neighbor 10.3.0.227 failed: Failed to join the cluster at neigh IP 10.3.0.227: 1 error(s) occurred:\n\n* Failed to join 10.3.0.227: dial tcp 10.3.0.227:7946: getsockopt: connection refused" 
time="2016-07-29T08:57:36.915607010Z" level=debug msg="2016/07/29 08:57:36 [DEBUG] memberlist: Failed to join 10.3.0.235: dial tcp 10.3.0.235:7946: getsockopt: connection refused\n" 
time="2016-07-29T08:57:36.915652188Z" level=error msg="joining serf neighbor 10.3.0.235 failed: Failed to join the cluster at neigh IP 10.3.0.235: 1 error(s) occurred:\n\n* Failed to join 10.3.0.235: dial tcp 10.3.0.235:7946: getsockopt: connection refused" 
time="2016-07-29T08:57:36.916108148Z" level=debug msg="2016/07/29 08:57:36 [DEBUG] memberlist: Initiating push/pull sync with: 10.3.0.32:7946\n" 
time="2016-07-29T08:57:36.917181600Z" level=info msg="2016/07/29 08:57:36 [INFO] serf: EventMemberJoin: ip-10-3-0-32 10.3.0.32\n" 
time="2016-07-29T08:57:36.922811728Z" level=debug msg="Allocating IPv4 pools for network bridge (a67d42ee838952d681d8ede5cd9ebed5d809866c376fe840a55c6779a2b8ec9c)" 
time="2016-07-29T08:57:36.922862840Z" level=debug msg="RequestPool(LocalDefault, 172.17.42.1/16, , map[], false)" 
time="2016-07-29T08:57:36.922961802Z" level=debug msg="RequestAddress(LocalDefault/172.17.0.0/16, 172.17.42.1, map[RequestAddressType:com.docker.network.gateway])" 
time="2016-07-29T08:57:36.989982425Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:36.992319457Z" level=debug msg="/sbin/iptables, [--wait -t nat -D POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:36.994097159Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -m addrtype --src-type LOCAL -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:36.995846566Z" level=debug msg="/sbin/iptables, [--wait -t nat -D POSTROUTING -m addrtype --src-type LOCAL -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:36.997624561Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:36.999143919Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -i docker0 -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.000744761Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 ! -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.002269216Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -i docker0 ! -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.003906492Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]" 
time="2016-07-29T08:57:37.006340380Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]" 
time="2016-07-29T08:57:37.008118305Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:37.009768137Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:37.011396881Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:37.013091354Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:37.014700528Z" level=debug msg="/sbin/iptables, [--wait -D DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:37.026987917Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" 
time="2016-07-29T08:57:37.035027625Z" level=debug msg="/sbin/iptables, [--wait -D DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" 
time="2016-07-29T08:57:37.047514046Z" level=debug msg="releasing IPv4 pools from network bridge (a67d42ee838952d681d8ede5cd9ebed5d809866c376fe840a55c6779a2b8ec9c)" 
time="2016-07-29T08:57:37.047735557Z" level=debug msg="ReleaseAddress(LocalDefault/172.17.0.0/16, 172.17.42.1)" 
time="2016-07-29T08:57:37.047870412Z" level=debug msg="ReleasePool(LocalDefault/172.17.0.0/16)" 
time="2016-07-29T08:57:37.064822845Z" level=debug msg="Allocating IPv4 pools for network bridge (b8be3ccbc25a3e6b536ba4a09d16212bd616cbb5fce2f79b4d2fb925ec8b1c8c)" 
time="2016-07-29T08:57:37.065034400Z" level=debug msg="RequestPool(LocalDefault, 172.17.42.1/16, , map[], false)" 
time="2016-07-29T08:57:37.065222185Z" level=debug msg="RequestAddress(LocalDefault/172.17.0.0/16, 172.17.42.1, map[RequestAddressType:com.docker.network.gateway])" 
time="2016-07-29T08:57:37.065676816Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:37.067846625Z" level=debug msg="/sbin/iptables, [--wait -t nat -I POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:37.069406523Z" level=debug msg="/sbin/iptables, [--wait -t nat -C POSTROUTING -m addrtype --src-type LOCAL -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:37.071046040Z" level=debug msg="/sbin/iptables, [--wait -t nat -I POSTROUTING -m addrtype --src-type LOCAL -o docker0 -j MASQUERADE]" 
time="2016-07-29T08:57:37.072690892Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -i docker0 -o docker0 -j DROP]" 
time="2016-07-29T08:57:37.074266451Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.086907858Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -i docker0 -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.094938035Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -i docker0 ! -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.097563215Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -i docker0 ! -o docker0 -j ACCEPT]" 
time="2016-07-29T08:57:37.099139530Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]" 
time="2016-07-29T08:57:37.100784038Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]" 
time="2016-07-29T08:57:37.102417256Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:37.104077999Z" level=debug msg="/sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:37.105806238Z" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:37.107424313Z" level=debug msg="/sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER]" 
time="2016-07-29T08:57:37.108986292Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:37.110494135Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -o docker0 -j DOCKER]" 
time="2016-07-29T08:57:37.112072970Z" level=debug msg="/sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:37.113563442Z" level=debug msg="/sbin/iptables, [--wait -D FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:37.115234272Z" level=debug msg="/sbin/iptables, [--wait -I FORWARD -j DOCKER-ISOLATION]" 
time="2016-07-29T08:57:37.116867000Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:37.118341490Z" level=debug msg="/sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]" 
time="2016-07-29T08:57:37.120445515Z" level=debug msg="/sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" 
time="2016-07-29T08:57:37.123246878Z" level=debug msg="/sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" 
time="2016-07-29T08:57:37.132411557Z" level=debug msg="Watch triggered with 8 nodes" discovery=consul 
time="2016-07-29T08:57:37.147044574Z" level=debug msg="2016/07/29 08:57:37 [DEBUG] serf: messageJoinType: ip-10-3-0-92\n" 
time="2016-07-29T08:57:37.159530405Z" level=debug msg="Starting container c98ab340c9e07355c89f2237f64ab03c9e2faae03720db2f2b3a556152e17034" 
time="2016-07-29T08:57:37.159686605Z" level=debug msg="Starting container 65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f" 
time="2016-07-29T08:57:37.159920100Z" level=debug msg="Starting container 65ff1b44494ae51257a684169336db7b0085c51bc74cfb0a6e5fbe2e2474042c" 
time="2016-07-29T08:57:37.171579661Z" level=debug msg="container mounted via layerStore: /home/ubuntu/hgdata/deployments/docker/overlay/4e29d7ecebfd17faf2f6a7d8a830220fc2ed58acd03ba50f8688ae06031d5352/merged" 
time="2016-07-29T08:57:37.172686652Z" level=debug msg="container mounted via layerStore: /home/ubuntu/hgdata/deployments/docker/overlay/940b01df699784d9ff0a66d9b04560a78d7ca5c82fa4a0bf13b7b0931aa2d90c/merged" 
time="2016-07-29T08:57:37.179409896Z" level=debug msg="container mounted via layerStore: /home/ubuntu/hgdata/deployments/docker/overlay/52650b1bf550396875643f5772399750c6e12eddbc8b4ef008351333548aa846/merged" 
time="2016-07-29T08:57:37.215739479Z" level=debug msg="Assigning addresses for endpoint httpd-b's interface on network backbone2" 
time="2016-07-29T08:57:37.215845287Z" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, <nil>, map[])" 
time="2016-07-29T08:57:37.217445392Z" level=debug msg="Assigning addresses for endpoint haproxy-ops-o1-b's interface on network backbone2" 
time="2016-07-29T08:57:37.217537454Z" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, <nil>, map[])" 
time="2016-07-29T08:57:37.225728621Z" level=debug msg="Assigning addresses for endpoint api-ops-o1-b-blue's interface on network backbone2" 
time="2016-07-29T08:57:37.225776649Z" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, <nil>, map[])" 
time="2016-07-29T08:57:37.234783387Z" level=debug msg="Assigning addresses for endpoint httpd-b's interface on network backbone2" 
time="2016-07-29T08:57:37.240746864Z" level=debug msg="Assigning addresses for endpoint api-ops-o1-b-blue's interface on network backbone2" 
time="2016-07-29T08:57:37.257392189Z" level=debug msg="Assigning addresses for endpoint haproxy-ops-o1-b's interface on network backbone2" 
time="2016-07-29T08:57:37.347124347Z" level=debug msg="2016/07/29 08:57:37 [DEBUG] serf: messageJoinType: ip-10-3-0-92\n" 
time="2016-07-29T08:57:37.370529481Z" level=debug msg="Releasing addresses for endpoint httpd-b's interface on network backbone2" 
time="2016-07-29T08:57:37.370691252Z" level=debug msg="ReleaseAddress(GlobalDefault/10.0.0.0/24, 10.0.0.2)" 
time="2016-07-29T08:57:37.401833645Z" level=debug msg="Releasing addresses for endpoint haproxy-ops-o1-b's interface on network backbone2" 
time="2016-07-29T08:57:37.401903119Z" level=debug msg="ReleaseAddress(GlobalDefault/10.0.0.0/24, 10.0.0.4)" 
time="2016-07-29T08:57:37.408772774Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /home/ubuntu/hgdata/deployments/docker/containers/65ff1b44494ae51257a684169336db7b0085c51bc74cfb0a6e5fbe2e2474042c/shm: invalid argument" 
time="2016-07-29T08:57:37.416758906Z" level=debug msg="Releasing addresses for endpoint api-ops-o1-b-blue's interface on network backbone2" 
time="2016-07-29T08:57:37.416850279Z" level=debug msg="ReleaseAddress(GlobalDefault/10.0.0.0/24, 10.0.0.3)" 
time="2016-07-29T08:57:37.423211346Z" level=error msg="Failed to start container 65ff1b44494ae51257a684169336db7b0085c51bc74cfb0a6e5fbe2e2474042c: network sandbox join failed: could not get network sandbox (oper true): failed get network namespace \"\": no such file or directory" 
time="2016-07-29T08:57:37.425044859Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /home/ubuntu/hgdata/deployments/docker/containers/65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f/shm: invalid argument" 
time="2016-07-29T08:57:37.431759608Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /home/ubuntu/hgdata/deployments/docker/containers/c98ab340c9e07355c89f2237f64ab03c9e2faae03720db2f2b3a556152e17034/shm: invalid argument" 
time="2016-07-29T08:57:37.434924754Z" level=error msg="Failed to start container 65e7a632bbd8813ad9b18b18dfd75579e1afb57809e13a7fcb3fb4804bfcfd6f: network sandbox join failed: could not get network sandbox (oper true): failed get network namespace \"\": no such file or directory" 
time="2016-07-29T08:57:37.442971692Z" level=error msg="Failed to start container c98ab340c9e07355c89f2237f64ab03c9e2faae03720db2f2b3a556152e17034: network sandbox join failed: could not get network sandbox (oper true): failed get network namespace \"\": no such file or directory" 
time="2016-07-29T08:57:37.443081837Z" level=info msg="Daemon has completed initialization" 
time="2016-07-29T08:57:37.443102973Z" level=info msg="Docker daemon" commit=8eab29e graphdriver=overlay version=1.12.0 

@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/networking labels Jul 29, 2016
@thaJeztah
Copy link
Member

ping @mavenugo @mrjana PTAL

@KramKroc
Copy link

KramKroc commented Aug 3, 2016

Any updates here? Seeing something along the same lines, i.e.

  • upgrade to 1.12
  • consul
  • network sandbox join failed: could not get network sandbox

@thaJeztah
Copy link
Member

I see there's a work-in-progress pull request linked; moby/libnetwork#1369

@thaJeztah thaJeztah added this to the 1.12.1 milestone Aug 3, 2016
@sebi-hgdata
Copy link
Author

Seems like if you get an error like this on a node, the workaround that I found for it is to recreate the overlay network... which you cant really do in a prod env... it's an upgrade showstopper for us

@KramKroc
Copy link

Have to agree with @sebi-hgdata that a robust, non-service affecting workaround is needed here. We would like to migrate to 1.12 for the live-restore feature, but cannot do so as a result of this bug.

@randunel
Copy link

We are getting the same issue in production. None of the containers on one of the hosts can start.

docker start <container-id>

Error response from daemon: Error response from daemon: network sandbox join failed: could not get network sandbox (oper true): failed get network namespace "": no such file or directory
Error: failed to start containers: <container-id>

This doesn't happen often, but it never happened to us in 1.11.x.

Any production worthy workarounds?

@sanimej
Copy link

sanimej commented Aug 11, 2016

@randunel Can you give some details about your setup and when this problem started ? Did it happen after upgrading to 1.12 ? Was there any change on one of the nodes in which you see this error ?

@randunel
Copy link

randunel commented Aug 12, 2016

We have a swarm setup with 1 manager host running consul and 2 other containers, and 7 other hosts. All the hosts are using different certificates signed by the same CA. Consul uses different certificates, signed by the same CA.

We "upgraded" from 1.11 to 1.12 by creating a new infrastructure (new machines), so this can't be an "upgrade" problem.

We keep having other network problems (at random times, one random container in host node-* simply cannot reach one of the containers in the manager host, host unreachable, while reaching all other containers, including in manager). This happens every other week, so we have an automated script to recreate the whole infrastructure every week as a workaround. This particular infrastructure is 2 days old.

This particular issue is new to us. It started in workers-3 not being able to reach node-2 by simply having Destination Host Unreachable:

docker exec 887 ping api

PING api (10.0.0.3) 56(84) bytes of data.
From 8875c7cc6ba4 (10.0.0.19) icmp_seq=1 Destination Host Unreachable

All 7 containers in workers-3 were exhibiting this behaviour which started at the same time, they could reach all the other containers except one. We rebooted the workers-3 host, that's when we noticed this new error when starting containers:

docker start <container-id>

Error response from daemon: Error response from daemon: network sandbox join failed: could not get network sandbox (oper true): failed get network namespace "": no such file or directory
Error: failed to start containers: <container-id>

Details about the 2 days old infrastructure:

docker info

Containers: 52
 Running: 51
 Paused: 0
 Stopped: 1
Images: 166
Server Version: swarm/1.2.4
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 8
 manager: EDITED
  └ ID: NPAQ:3TMC:CSTH:RKA5:GRB5:T3EB:IZNP:UFSR:3UIT:KKIU:QJXS:ZIFC
  └ Status: Healthy
  └ Containers: 6 (6 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 1.465 GiB / 2.051 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:50:30Z
  └ ServerVersion: 1.12.0
 node-1: EDITED
  └ ID: UVQX:4H4A:KXVE:N4DJ:RYFS:HG7I:OW4I:HIJZ:3JV2:6RXL:MVCO:MZGP
  └ Status: Healthy
  └ Containers: 5 (5 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 400 MiB / 4.052 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:50:27Z
  └ ServerVersion: 1.12.0
 node-2: EDITED
  └ ID: Q53K:V3QQ:GAZJ:I4XX:XHOP:JI7P:26BP:IPDV:GO4P:EOFU:P3J2:7NXG
  └ Status: Healthy
  └ Containers: 5 (5 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 1.293 GiB / 4.052 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:50:10Z
  └ ServerVersion: 1.12.0
 node-3: EDITED
  └ ID: 26YF:3QPV:K2SS:Q35O:PVHL:PBYI:LEMJ:V544:5HE3:6CSR:J5RP:YLTS
  └ Status: Healthy
  └ Containers: 5 (5 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 700 MiB / 4.052 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:49:51Z
  └ ServerVersion: 1.12.0
 workers-1: EDITED
  └ ID: GDJO:VMGP:VHSL:PDEN:2ZL2:HH72:VIGA:NHEP:OWGX:3LA6:LV2W:EJ32
  └ Status: Healthy
  └ Containers: 6 (6 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 3.418 GiB / 4.052 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:49:45Z
  └ ServerVersion: 1.12.0
 workers-2: EDITED
  └ ID: AD63:YCAD:TX5W:YZMD:JRNC:YZKN:KALM:Z5BA:LZWJ:QN7V:OAM2:MEIX
  └ Status: Healthy
  └ Containers: 6 (6 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 3.418 GiB / 4.052 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:50:19Z
  └ ServerVersion: 1.12.0
 workers-3: EDITED <<------ THIS WORKER'S CONTAINERS FAIL TO START WITH THE ABOVE ERROR MESSAGE
  └ ID: Y2FM:U4YE:CFTC:GXGC:WGNA:OVMB:DCOO:OBM4:IPRB:KZRX:TNCT:2ZGB
  └ Status: Healthy
  └ Containers: 6 (5 Running, 0 Paused, 1 Stopped)
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 3.418 GiB / 4.052 GiB
  └ Labels: kernelversion=4.4.0-34-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:49:49Z
  └ ServerVersion: 1.12.0
 workers-4: EDITED
  └ ID: NTQS:IBHH:OLEY:QAXS:H7CL:FMBA:7N5K:TKI3:IQHP:OJHQ:GSM5:ONNJ
  └ Status: Healthy
  └ Containers: 13 (13 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 8.203 GiB / 8.186 GiB
  └ Labels: kernelversion=4.4.0-31-generic, operatingsystem=Ubuntu 16.04.1 LTS, provider=digitalocean, storagedriver=aufs
  └ UpdatedAt: 2016-08-12T06:49:55Z
  └ ServerVersion: 1.12.0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.4.0-31-generic
Operating System: linux
Architecture: amd64
CPUs: 18
Total Memory: 34.55 GiB
Name: EDITED
Docker Root Dir: 
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support

Other details (might be relevant or not), we provision the machines with --swarm-discovery consul://manager:8500, --swarm-opt="discovery-opt=\"kv.{cacertfile,certfile,keyfile}=/path/to/certificate\"" and --engine-opt="cluster-store=..., --engine-opt="cluster-advertise=eth0:port".

@sebi-hgdata
Copy link
Author

@randunel of topic... when you have the networking issues, can you try setting up serf and doing a reachability test... seen that it resolves the issue (probably forcing a serf state sync) and you can also check your cluster membership:

wget https://releases.hashicorp.com/serf/0.7.0/serf_0.7.0_linux_amd64.zip
unzip serf_0.7.0_linux_amd64.zip
./serf agent -bind=$YOUR_IP:7947 -node=serf_tester &
./serf join $YOUR_IP:7946 # join docker  serf cluster
./serf reachability -verbose # you should see an ack for all nodes in your cluster 

@vingrad
Copy link

vingrad commented Aug 20, 2016

I have this error after upgrading to 1.12 too. I use composer with several oveflow networks and docker swarm. This error appears by only one network. Several other overflow networks work well on the same setup.

@vingrad
Copy link

vingrad commented Aug 20, 2016

I have this problem with the version 1.12.1 too

@groyee
Copy link

groyee commented Aug 20, 2016

I have the same issue

@vingrad
Copy link

vingrad commented Aug 22, 2016

@groyee , do you use swarm with consul or just new docker 1.12 with internal discovery?

@groyee
Copy link

groyee commented Aug 22, 2016

Swarm with Consul

@gfyrag
Copy link

gfyrag commented Aug 23, 2016

Same issue.
No swarm, just two machines with consul on each, docker on top using Consul as K/V storage, an overlay network and some containers.
After a hard reboot of a machine, no container can be started on this network.

PR moby/libnetwork#1369 works for me :
Aug 23 16:09:17 runner docker[880]: time="2016-08-23T16:09:17.739501019+02:00" level=warning msg="Failure during overlay endpoints restore: restore network sandbox failed: could not get network sandbox (oper true): failed get network namespace "": no such file or directory"
Aug 23 16:09:17 runner docker[880]: time="2016-08-23T16:09:17.739536673+02:00" level=info msg="resetting init error and once variable for network 65483310fc01c510be6e26f47cba5a9253d91b776c4748c33113471257ccf4d8 after unsuccesful endpoint restore: could not get network sandbox (oper true): failed get network namespace "": no such file or directory"

Any chance to get this merged soon?

@sanimej
Copy link

sanimej commented Aug 23, 2016

@gfyrag Did you update the unit file with docker 1.12 and then rebooted the machine ? Were there containers still running when you rebooted ? What was the restart policy of the containers ?

@gfyrag
Copy link

gfyrag commented Aug 24, 2016

@sanimej No need to update the unit file (The installed version was the v1.12.1, so i keep the same unit file). I just recompile the "dockerd" binary, replace the original binary and restart docker. After that, the overlay network start again to work. Then i also hard reboot the machine (the case where it didn't work), and it still work.

The general scheme of my unit files is :
ExecStartPre=/usr/bin/docker pull XXX
ExecStartPre=-/usr/bin/docker rm -f YYY
ExecStart=/usr/bin/docker run --rm --name YYY XXX
ExecStop=/usr/bin/docker stop YYY

There is no restart policy, it is a clean container started each time.

@vingrad
Copy link

vingrad commented Aug 24, 2016

This is critical issue, because many containers could not be started. Please fix this asap.

@byrnedo
Copy link

byrnedo commented Aug 26, 2016

Is this fixed in 1.12.1 then?

@vingrad
Copy link

vingrad commented Aug 26, 2016

No, this bug is still present in 1.12.1

@tiborvass tiborvass modified the milestones: 1.12.1, 1.12.2 Aug 30, 2016
@gfyrag
Copy link

gfyrag commented Sep 2, 2016

Compile your own version (the repository come with the necessary tools) using the PR moby/libnetwork#1369

@sanimej
Copy link

sanimej commented Sep 2, 2016

I am able to recreate it consistently with the following steps..

  • 1.12.x daemon with live-restore flag and some containers running with no restart policy
  • hard reset the host
  • when daemon comes up after host restart, overlay driver tries to restore the saved endpoints. Since the host has been rebooted namespace paths would have been cleaned up and the restore fails.
  • any subsequent container start on that network fails because of the earlier error.

Given this, docker/libnetwork#1369 is the right fix for this issue. It will be available in the next patch release.

@vingrad
Copy link

vingrad commented Sep 6, 2016

How long to wait for the next patch release? Containers can not be started. This is a critical issue.

@aboch
Copy link
Contributor

aboch commented Sep 14, 2016

Libnetwork fix was brought to docker master via vendoring in #25962
This issue can now be closed.

@sebi-hgdata
Copy link
Author

@aboch I see that #25962 has milestone 1.13, but this issue has 1.12.2.
In what release will the fix be included?

@aboch
Copy link
Contributor

aboch commented Sep 14, 2016

@sebi-hgdata If a 1.12.2 is going to be released, we'll make sure docker/libnetwork#1369 changes are part of it.

@mavenugo
Copy link
Contributor

Closing ...

@randunel
Copy link

@mavenugo has a fix been released? Is it production ready, even by docker standards, at least present in the binaries?

@mavenugo
Copy link
Contributor

@randunel the fix (moby/libnetwork#1369) is merged into docker master. It will be cherry-picked into 1.12.2 branch when the bump branch is pulled. So yes. it will be available in the next docker release.

@randunel
Copy link

The question is "has a fix been released", and the answer "yes, it will be available in the next docker release" means "no, it hasn't been released".

You shouldn't close this issue until the fix is released (and tested). But anyway, let's continue hiding issues, it's no different from having docs from a not-released version on the website for 3 months.

@aboch
Copy link
Contributor

aboch commented Sep 15, 2016

@randunel

You shouldn't close this issue until the fix is released (and tested).

Common workflow in github is that an issue gets closed when the PR containing the fix for it has been merged. This step is even automated when the PR description contains the word "fixes" along with the issue number. Then user looks at the fix PR and derives which release contains the fix.

In this case, the PR which brought the fix in docker/docker did not have the reference to this issue. This is why I suggested to close it manually.

Regarding the testing, it is not always possible to recreate the exact scenario the user was in.
In this specific case, though, we knew from the logs that the root cause was the missing namespace path and a scenario with missing namespace path was recreated and the fix tested against it.

But anyway, let's continue hiding issues,

I understand you may feel this way because certain error messages are recurring in different issues opened at different times and across different docker versions, sometimes months apart from each other. Most of the times, at least from what I witnessed so far, they come from very different scenario, different exercised code path which happen to lead to similar (not always same) error messages.

This is so common that when developers see two issues with same error messages, they reject a priori the idea (I'd call it temptation) that the two issues have the same root cause, to focus instead on what was happening when the issue was hit.

Cheers

@icecrime icecrime added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Sep 19, 2016
@schmunk42
Copy link
Contributor

Releasing a fix for this issue is also getting really urgent for us. We have this on a newly setup staging swarm with overlay networking. When the above error occurs the whole swarm is getting into an uncontrollable state - the only workaround so far is to restart all engines in the swarm. Absolutely unusable for production.

@groyee
Copy link

groyee commented Sep 27, 2016

Can anyone provide some workaround until this fix is released?

This happens for us all the time in production. Absolutely frustrating. Currently the only option I know is physically remove the VM from the cloud and create a new one.

@thaJeztah
Copy link
Member

The change was cherry-picked into the 1.12.x branch (https://github.com/docker/docker/blob/1.12.x/vendor/src/github.com/docker/libnetwork/drivers/overlay/overlay.go#L97-L106) through #26879. We will be releasing docker 1.12.2-rc1 today or tomorrow for testing

@groyee
Copy link

groyee commented Sep 27, 2016

Thank You! Can't wait for this release :-)

@thaJeztah
Copy link
Member

1.12.2-rc1 was released for testing; https://github.com/docker/docker/releases/tag/v1.12.2-rc1

@schmunk42
Copy link
Contributor

schmunk42 commented Sep 30, 2016

I gave upgrading a try...

I now ran into issues that overlay networks were created multiple times - with the same name, but different IDs.
docker network rm NAME was not able to delete the networks, unless I specified all IDs manually.
Also also needed to docker-compose down && docker-compose up -d for the stack which threw these errors.

@thaJeztah
Copy link
Member

@schmunk are you using compose with swarm mode (and to create the overlay networks)?

@schmunk42
Copy link
Contributor

@thaJeztah No, we're using docker/swarm machines created with docker-machine

@sanimej
Copy link

sanimej commented Sep 30, 2016

@schmunk42 I have to recreate the duplicate overlay network creation issue to identify the root cause. You had 1.12.1 daemon running with some overlay networks and restarting the daemon with 1.12.2-RC1 resulted in all overlay networks created again with different id ?

@sanimej
Copy link

sanimej commented Sep 30, 2016

Also, after the upgrade were you able to confirm if the Network sandbox join failed errors are not happening any more ? Earlier what was the trigger that resulted in the error ?

@schmunk42
Copy link
Contributor

@sanimej Yes, but not all overlay networks were multiple times, but one a co-worker was redeploying.

I haven't seen the Network sandbox join failed issues so far.

@sanimej
Copy link

sanimej commented Sep 30, 2016

@schmunk42 Is there any difference between the overlay network that got recreated vs the ones that didn't ? may be there were containers with restart policy on one but not the other ?

@sanimej
Copy link

sanimej commented Sep 30, 2016

@groyee Can you try the 1.12.2-RC1 and confirm if this issue has been fixed in your setup ?

@groyee
Copy link

groyee commented Sep 30, 2016

I am gonna give it a try now

@groyee
Copy link

groyee commented Sep 30, 2016

OK.

So, I am editing my previous post.

Here are the results:

1.12.2-RC1 defiantly fixed the issue. I don't see this error anymore.

There is only one big issue. It takes almost 2 minutes for every container to start. I looked at the logs and most of the time it is spending trying to connect to the overlay network.

This command takes the same amount of time to return:

sudo docker network ls

What is going on there? Why is taking so long? It used to be a matter of seconds, even less.

If it makes any difference I have about 150 containers connected to this overlay network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Projects
None yet
Development

No branches or pull requests