Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker daemon crashes/logs full of network not found #20694

Closed
vlerenc opened this issue Feb 25, 2016 · 13 comments · Fixed by #21019
Closed

Docker daemon crashes/logs full of network not found #20694

vlerenc opened this issue Feb 25, 2016 · 13 comments · Fixed by #21019

Comments

@vlerenc
Copy link

vlerenc commented Feb 25, 2016


BUG REPORT INFORMATION (may be related to #19988)

Output of docker version:

Docker version 1.9.0, build 76d6bc9
# and also
Docker version 1.10.2, build c3959b1

Output of docker info:

# none, I can't reanimate the daemon anymore and all my 230+ containers are basically down

Provide additional environment details (AWS, VirtualBox, physical, etc.):

OpenStack VM, 4 cores, 8 GB mem, Ubuntu 14.04.4 LTS, Linux 2b320682-1875-4548-a1e2-29d15b11b5a8 3.19.0-49-generic #55~14.04.1hf1533043v20160201b1-Ubuntu SMP Mon Feb 1 20:41:00 UT x86_64 x86_64 x86_64 GNU/Linux.

List the steps to reproduce the issue:

Started by script since months without problems, but now failing. Also fails after Docker update and OS update. Here the launch command we use (230+ containers all started with restart policy = always):

docker daemon --api-cors-header=false --debug=false --disable-legacy-registry=false --group vcap --graph /var/vcap/store/docker/docker --host unix:///var/vcap/sys/run/docker/docker.sock --icc=true --ip-forward=true --ip-masq=true --iptables=true --ipv6=false --log-level=info --pidfile /var/vcap/sys/run/docker/docker.pid --selinux-enabled=false --host tcp://0.0.0.0:4243 --tls=false --userland-proxy=true

Describe the results you received:

INFO[0000] API listen on /var/vcap/sys/run/docker/docker.sock 
INFO[0000] API listen on [::]:4243                      
INFO[0000] [graphdriver] using prior storage driver "aufs" 
ERRO[0000] getNetworkFromStore for nid d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 failed while trying to build sandbox for cleanup: network d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 not found 
ERRO[0000] getEndpointFromStore for eid 8fc41dcb6bf31938e79b193d01ce84acd7f660d8ae82afb6f31c8bf191e1663f failed while trying to build sandbox for cleanup: could not find endpoint 8fc41dcb6bf31938e79b193d01ce84acd7f660d8ae82afb6f31c8bf191e1663f: Key not found in store 
WARN[0000] Failed detaching sandbox 569c2a0eefb84fb14b3e3c6cbb554163b480331b996e2ef9f7b2fa9a4e033672 from endpoint c71f3a95931e93968f28f0ab27bef203d10aa1d48b2c18b36ca56eaf023fcc10: failed to get network from store during leave: network d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 not found

WARN[0000] Failed deleting endpoint c71f3a95931e93968f28f0ab27bef203d10aa1d48b2c18b36ca56eaf023fcc10: failed to get network during Delete: network d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 not found

WARN[0000] Failed detaching sandbox 569c2a0eefb84fb14b3e3c6cbb554163b480331b996e2ef9f7b2fa9a4e033672 from endpoint 8fc41dcb6bf31938e79b193d01ce84acd7f660d8ae82afb6f31c8bf191e1663f: failed to get endpoint from store during leave: could not find endpoint 8fc41dcb6bf31938e79b193d01ce84acd7f660d8ae82afb6f31c8bf19
1e1663f: Key not found in store

WARN[0000] Failed deleting endpoint 8fc41dcb6bf31938e79b193d01ce84acd7f660d8ae82afb6f31c8bf191e1663f: failed to get endpoint from store during Delete: could not find endpoint 8fc41dcb6bf31938e79b193d01ce84acd7f660d8ae82afb6f31c8bf191e1663f: Key not found in store

ERRO[0000] failed to delete sandbox 569c2a0eefb84fb14b3e3c6cbb554163b480331b996e2ef9f7b2fa9a4e033672 while trying to cleanup: could not cleanup all the endpoints in container 27f42f1c8f2792639d4c96f621408b1551a9121c184f2721b6aa17444c96177f / sandbox 569c2a0eefb84fb14b3e3c6cbb554163b480331b996e2ef9f7b2fa9a4e03
3672 

...

ERRO[0025] getNetworkFromStore for nid d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 failed while trying to build sandbox for cleanup: network d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 not found                                                                 [18/6605]
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x168 pc=0x87f3d9]

goroutine 1 [running]:
github.com/docker/libnetwork.epHeap.Less(0xc20864d570, 0x2, 0x2, 0x1, 0x0, 0xc208033a70)
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:968 +0x159
github.com/docker/libnetwork.(*epHeap).Less(0xc208854630, 0x1, 0x0, 0x2)
        <autogenerated>:50 +0xb5
container/heap.up(0x7f835b7a4438, 0xc208854630, 0x1)
        /usr/local/go/src/container/heap/heap.go:93 +0x5f
container/heap.Push(0x7f835b7a4438, 0xc208854630, 0x13ce9e0, 0xc208033a70)
        /usr/local/go/src/container/heap/heap.go:54 +0x7e
github.com/docker/libnetwork.(*controller).sandboxCleanup(0xc2080a0370)
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/sandbox_store.go:226 +0xc06
github.com/docker/libnetwork.New(0xc208765bc0, 0x5, 0x8, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/controller.go:200 +0x6cb
github.com/docker/docker/daemon.(*Daemon).initNetworkController(0xc208071040, 0xc2080eb180, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/daemon_unix.go:364 +0xfd
github.com/docker/docker/daemon.NewDaemon(0xc2080eb180, 0xc20803a430, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/daemon.go:770 +0x2149
main.(*DaemonCli).CmdDaemon(0xc2084f5240, 0xc20800a020, 0x16, 0x16, 0x0, 0x0)
        /go/src/github.com/docker/docker/docker/daemon.go:269 +0x1722
reflect.callMethod(0xc2084d7f50, 0xc2084dbce8)
        /usr/local/go/src/reflect/value.go:605 +0x179
reflect.methodValueCall(0xc20800a020, 0x16, 0x16, 0x1, 0xc2084d7f50, 0x0, 0x0, 0xc2084d7f50, 0x0, 0x4dcb9f, ...)
        /usr/local/go/src/reflect/asm_amd64.s:29 +0x36
github.com/docker/docker/cli.(*Cli).Run(0xc2084d7e90, 0xc20800a010, 0x17, 0x17, 0x0, 0x0)
        /go/src/github.com/docker/docker/cli/cli.go:89 +0x38e
main.main()
        /go/src/github.com/docker/docker/docker/docker.go:65 +0x418

goroutine 5 [syscall]:
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
        /usr/local/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 12 [chan receive]:
github.com/docker/docker/pkg/listenbuffer.(*defaultListener).Accept(0xc208504380, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/pkg/listenbuffer/buffer.go:73 +0xb6
net/http.(*Server).Serve(0xc2080428a0, 0x7f835b79a568, 0xc208504380, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1728 +0x92
github.com/docker/docker/api/server.(*HTTPServer).Serve(0xc208504460, 0x0, 0x0)
        /go/src/github.com/docker/docker/api/server/server.go:113 +0x4d
github.com/docker/docker/api/server.func·006(0xc208504460)
        /go/src/github.com/docker/docker/api/server/server.go:86 +0x163
created by github.com/docker/docker/api/server.(*Server).ServeAPI
        /go/src/github.com/docker/docker/api/server/server.go:90 +0x144

goroutine 10 [chan receive]:
github.com/docker/docker/api/server.(*Server).ServeAPI(0xc208037900, 0x0, 0x0)
        /go/src/github.com/docker/docker/api/server/server.go:94 +0x1b6
main.func·007()
        /go/src/github.com/docker/docker/docker/daemon.go:255 +0x3b
created by main.(*DaemonCli).CmdDaemon
        /go/src/github.com/docker/docker/docker/daemon.go:261 +0x1571

goroutine 11 [chan receive]:
github.com/docker/docker/daemon.func·030()
        /go/src/github.com/docker/docker/daemon/debugtrap_unix.go:17 +0x5c
created by github.com/docker/docker/daemon.setupDumpStackTrap
        /go/src/github.com/docker/docker/daemon/debugtrap_unix.go:20 +0x18e

goroutine 13 [chan receive]:
github.com/docker/docker/pkg/listenbuffer.(*defaultListener).Accept(0xc2085044a0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/pkg/listenbuffer/buffer.go:73 +0xb6
net/http.(*Server).Serve(0xc208042a20, 0x7f835b79a568, 0xc2085044a0, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:1728 +0x92
github.com/docker/docker/api/server.(*HTTPServer).Serve(0xc208504580, 0x0, 0x0)
        /go/src/github.com/docker/docker/api/server/server.go:113 +0x4d
github.com/docker/docker/api/server.func·006(0xc208504580)
        /go/src/github.com/docker/docker/api/server/server.go:86 +0x163
created by github.com/docker/docker/api/server.(*Server).ServeAPI
        /go/src/github.com/docker/docker/api/server/server.go:90 +0x144

goroutine 14 [select]:
github.com/docker/libnetwork.(*controller).watchLoop(0xc2080a0370)
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/store.go:399 +0x13f
created by github.com/docker/libnetwork.(*controller).startWatch
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/store.go:413 +0xf0

goroutine 39 [select]:
github.com/docker/libnetwork/osl.removeUnusedPaths()
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/osl/namespace_linux.go:73 +0x48b
created by github.com/docker/libnetwork/osl.createBasePath
        /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/osl/namespace_linux.go:58 +0xb1

Describe the results you expected:

Daemon starts, all containers start.

Provide additional info you think is important:

We haven't restarted the Docker daemon for two weeks. Maybe there are too many containers deployed now? But we had already daemons with 300+ containers running.
Maybe something went wrong during shutdown/the VM was terminated? The VM gets automatically recreated/reprovisioned and the former volume gets reattached. Could there by some issues with a hard kill?
In any case, the crash when relaunching the daemon doesn't look too good/like a real issue.

Do you have short term remedy proposals? Is there a way to force a cleanup? Or some way to start the daemon, but not all the containers (all were started with restart policy = always)?

In short, we see tons of:

* getNetworkFromStore for nid failed while trying to build sandbox for cleanup: network not found 
* failed getting network for ep during sandbox delete: network not found 
* failed to delete sandbox while trying to cleanup: could not cleanup all the endpoints in container / sandbox

... and then it crashes in:

  • github.com/docker/libnetwork.epHeap.Less (Docker v1.9) or
  • github.com/docker/libnetwork.(*sandbox).clearNetworkResources (Docker 1.10)
@mavenugo
Copy link
Contributor

@vlerenc it seems like the local db (/var/lib/docker/network/files/local-kv.db) is corrupted. Could you pls take a backup of that file (and share it with us ?). once its backed up, pls remove this file and restart the daemon.
We have to identify how and when the file is corrupted.

@vlerenc
Copy link
Author

vlerenc commented Feb 25, 2016

Yes, it seems so. After removing the file, I could restart the daemon again.

You don't know how much you saved my day. Thank you very much!

What's the consequences of the removed file? Does it impact anything?

The second question is why it happened out of the sudden? The VMs were regularly updated, i.e. the daemon was shutdown every couple of weeks. Usually, with that many containers, the daemon kills most of the containers,right? Most likely, the VM updater even killed the daemon itself. Docker was (for the past couple of months) on the same 1.9 release, so no change there.

Besides the question on the impact of the now deleted network DB file, how can I make the corrupted file available to you?

And thank you again for your prompt help!

@mavenugo
Copy link
Contributor

What's the consequences of the removed file? Does it impact anything?

If you are using custom bridge networks (other than the default docker0 network), those needs to be recreated manually. But it seems in your case, that is not the case. So, you wont see any functionality issues. Other than that, this file holds the resources to be cleaned up if it was not cleaned up properly when the daemon is brought down ungracefully (which seems to be your case).

The second question is why it happened out of the sudden?

Its a file corruption due to some event (could be timing related). The best way to know that would be to look at the daemon logs (preferably with debugs enabled with -D flag) Thanks to the config reload feature in 1.10, one can enable the daemon debug flags for a running daemon. PTAL : https://github.com/docker/docker/blob/master/docs/reference/commandline/daemon.md#configuration-reloading & https://github.com/docker/docker/blob/master/docs/reference/commandline/daemon.md#daemon-configuration-file

You don't know how much you saved my day. Thank you very much!

You don't know how much you are helping us by reporting the issue so that we can resolve it in next release. Thank you very much. OSS FTW !

@thaJeztah
Copy link
Member

@mavenugo @vlerenc shall we close this issue, or do you want to keep this opened for tracking? (I think it's already on your radar, but just to be sure)

@mavenugo
Copy link
Contributor

@vlerenc

how can I make the corrupted file available to you?

Easiest would be to put it in dropbox/box or other file sharing site and share the link here (or) send the link to my email madhu@docker.com

Also, can you please confirm if you have created a custom docker network using docker network create and using it in your containers ?

@mavenugo
Copy link
Contributor

@thaJeztah this is already in the radar yes. but we don't have a solid use-case to reproduce it yet. We can close this but lets dupe to the other open issue appropriately.

@vlerenc
Copy link
Author

vlerenc commented Feb 26, 2016

@mavenugo I sent you the link to your e-mail address. Thank you for looking into it!

@mavenugo
Copy link
Contributor

@vlerenc can you pls answer this question as well

Also, can you please confirm if you have created a custom docker network using docker network create and using it in your containers ?

@vlerenc
Copy link
Author

vlerenc commented Feb 26, 2016

@mavenugo No, we haven't.

@mavenugo
Copy link
Contributor

mavenugo commented Mar 3, 2016

@vlerenc thanks for sharing the corrupted local-kv.db. I analyzed it and found that there are/were bunch of containers run in overlay network. Now that you have replaced the local-kv.db, you may not see that. But, I see a bunch of sandboxes attached to an overlay network and a corresponding KV-Store.
All the network d48a6206205146ee2e1ca9ae0ce6f5ea1c6b2b1a64d739f771bccfd0ac5970e1 not found
errors are expected if the daemon doesnt have connectivity to external KVStore you might have once used. The panic is a different story though.
Atleast, it helps answer my question :

Also, can you please confirm if you have created a custom docker network using docker network create and using it in your containers ?

and the answer is yes. Can you please confirm if there is a disconnect ?

@mavenugo
Copy link
Contributor

mavenugo commented Mar 3, 2016

@vlerenc can you pls attach the traceback for

github.com/docker/libnetwork.(*sandbox).clearNetworkResources (Docker 1.10) ?

Easiest thing to do is to

  1. stop the daemon.
  2. take a backup of your /var/lib/docker/network/files/local-kv.db.
  3. Replace it with the corrupted one that you sent me
  4. restart the daemon & capture the tracelog.
  5. replace the backed up local-kv.db
  6. restart the daemon.

@mavenugo
Copy link
Contributor

mavenugo commented Mar 3, 2016

@vlerenc never mind. found the issue and am fixing it now.

For confirmation, this is not a file corruption issue. But, this is one such scenario of using a global kv-store to try out multi-host networking followed by an unclean daemon shutdown and restarting the daemon with the kv-store down/not configured.

Thanks for reporting and sharing all the requested information.

@vlerenc
Copy link
Author

vlerenc commented Mar 4, 2016

@mavenugo Glad you found it! As for your question, I am sorry. We are using https://github.com/cloudfoundry-community/cf-containers-broker which is using https://github.com/swipely/docker-api and when checking I didn't see any "special network" code in https://github.com/cloudfoundry-community/cf-containers-broker/blob/master/app/models/docker_manager.rb. So I am actually puzzled by your finding. Anyhow, we have hundreds of users and containers and after deletion of the DB and restart, nobody complained/the existing and now restarted containers run just fine. They all run on their own reached only through the exposed port by the applications that use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants