Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Swarm in 1.12 - panic: runtime error: index out of range #25608

Closed
b0ch3nski opened this issue Aug 11, 2016 · 33 comments · Fixed by #25987 or #31091
Closed

Native Swarm in 1.12 - panic: runtime error: index out of range #25608

b0ch3nski opened this issue Aug 11, 2016 · 33 comments · Fixed by #25987 or #31091
Assignees
Labels
area/networking area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12
Milestone

Comments

@b0ch3nski
Copy link

Output of docker version:

Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 2
Server Version: 1.12.0
Storage Driver: devicemapper
 Pool Name: docker-253:0-1970762-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 10.46 GB
 Data Space Total: 107.4 GB
 Data Space Available: 28.07 GB
 Metadata Space Used: 6.238 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.141 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /mnt/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /mnt/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay bridge null host
Swarm: active
 NodeID: 2vyiapx1ngzr5s0lj87x3nqxd
 Is Manager: false
 Node Address: 172.28.128.18
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.22.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 993 MiB
Name: dt-dev-2
ID: 7IDM:SHKD:MMTJ:USU6:U7BD:CZAK:P3QO:IXFM:UCM3:S2LB:OIJP:GW6P
Docker Root Dir: /mnt/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

CentOS7 @ VirtualBox (via Vagrant)

Additional information:

I had Swarm setup with 2 nodes (and 2 different services started). I've restarted one of nodes and I was surprised that now the restarted node hosts both services (they were auto-balanced before).
Turns out that Docker at machine that was NOT restarted died - this is the stacktrace:

Aug 11 06:13:28 dt-dev-2 dockerd[17100]: time="2016-08-11T06:13:28.138941967Z" level=warning msg="cannot find proper key indices while processing key upda
Aug 11 08:31:24 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:24.552409257Z" level=error msg="agent: session failed" error="rpc error: code = 13 desc = 
Aug 11 08:31:24 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:24Z" level=info msg="Firewalld running: false"
Aug 11 08:31:24 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:24Z" level=info msg="Firewalld running: false"
Aug 11 08:31:27 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:27.274591339Z" level=info msg="2016/08/11 08:31:27 [INFO] memberlist: Suspect dt-dev-1 has
Aug 11 08:31:29 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:29.274701535Z" level=info msg="2016/08/11 08:31:29 [INFO] memberlist: Suspect dt-dev-1 has
Aug 11 08:31:29 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:29.274859714Z" level=error msg="periodic bulk sync failure for network 1ms79jc1lu6z29l9ra0
Aug 11 08:31:29 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:29.625202513Z" level=error msg="agent: session failed" error="session initiation timed out
Aug 11 08:31:31 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:31.274195993Z" level=info msg="2016/08/11 08:31:31 [INFO] memberlist: Suspect dt-dev-1 has
Aug 11 08:31:32 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:32.274488016Z" level=warning msg="cannot find proper key indices while processing key upda
Aug 11 08:31:32 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:32.275195889Z" level=info msg="2016/08/11 08:31:32 [INFO] memberlist: Suspect dt-dev-1 has
Aug 11 08:31:32 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:32.275233408Z" level=info msg="2016/08/11 08:31:32 [INFO] memberlist: Marking dt-dev-1 as 
Aug 11 08:31:32 dt-dev-2 dockerd[17100]: time="2016-08-11T08:31:32.286804614Z" level=error msg="container status unavailable" error="context canceled" mod
Aug 11 08:43:35 dt-dev-2 dockerd[17100]: time="2016-08-11T08:43:35.005839511Z" level=error msg="agent: session failed" error="rpc error: code = 13 desc = 
Aug 11 08:43:40 dt-dev-2 dockerd[17100]: time="2016-08-11T08:43:40.990620593Z" level=error msg="agent: session failed" error="session initiation timed out
Aug 11 08:43:42 dt-dev-2 dockerd[17100]: time="2016-08-11T08:43:42.520821682Z" level=warning msg="cannot find proper key indices while processing key upda
Aug 11 08:43:42 dt-dev-2 dockerd[17100]: time="2016-08-11T08:43:42.555826038Z" level=error msg="container status unavailable" error="context canceled" mod
Aug 11 08:47:25 dt-dev-2 dockerd[17100]: time="2016-08-11T08:47:25.336033855Z" level=error msg="agent: session failed" error="rpc error: code = 13 desc = 
Aug 11 08:47:31 dt-dev-2 dockerd[17100]: time="2016-08-11T08:47:31.229495064Z" level=error msg="agent: session failed" error="session initiation timed out
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: panic: runtime error: index out of range
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: goroutine 870 [running]:
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: panic(0x1a6fa00, 0xc82000e020)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /usr/local/go/src/runtime/panic.go:481 +0x3e6
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork/drivers/overlay.updateNodeKey(0xc820c300f0, 0x10, 0x10, 0xc820c30100, 0x10, 0x10, 0x
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:496 +0x1
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork/drivers/overlay.(*driver).updateKeys.func1(0xc820970e20, 0xd, 0xc820be9400, 0x3, 0x3
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:444 +0x1
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork/drivers/overlay.(*driver).secMapWalk(0xc820082e00, 0xc821b9cfc0, 0x0, 0x0)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:382 +0x1
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork/drivers/overlay.(*driver).updateKeys(0xc820082e00, 0xc820aba1e0, 0xc820aba200, 0xc82
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:445 +0x8
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork/drivers/overlay.(*driver).DiscoverNew(0xc820082e00, 0x4, 0x1b9a4e0, 0xc821442c00, 0x
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/overlay.go:351 +0xbbc
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork.(*controller).handleKeyChange.func1(0x1d53098, 0x7, 0x7f7542cd07e8, 0xc820082e00, 0x
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/agent.go:151 +0x94
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork/drvregistry.(*DrvRegistry).WalkDrivers(0xc8201cf1d0, 0xc821b9d6e0)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drvregistry/drvregistry.go:107 +0x2ce
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork.(*controller).handleKeyChange(0xc8203f43c0, 0xc8210b9840, 0x6, 0x8, 0x0, 0x0)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/agent.go:156 +0x836
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/libnetwork.(*controller).SetKeys(0xc8203f43c0, 0xc8210b9840, 0x6, 0x8, 0x0, 0x0)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/controller.go:275 +0x35f
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/docker/daemon.(*Daemon).SetNetworkBootstrapKeys(0xc8203656c0, 0xc8210b9840, 0x6, 0x8, 0x0, 0x0)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/network.go:185 +0x6d
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/docker/daemon/cluster/executor/container.(*executor).SetNetworkBootstrapKeys(0xc820c10fb0, 0xc8
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/cluster/executor/container/executo
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/swarmkit/agent.(*Agent).handleSessionMessage(0xc8209c6850, 0x7f7542cc6a98, 0xc820bb9dd0, 0xc820
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:272 +0x5ca
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: github.com/docker/swarmkit/agent.(*Agent).run(0xc8209c6850, 0x7f7542cc6a98, 0xc820bb9dd0)
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:170 +0xbe6
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: created by github.com/docker/swarmkit/agent.(*Agent).Start
Aug 11 08:47:32 dt-dev-2 dockerd[17100]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:81 +0x37d
Aug 11 08:47:32 dt-dev-2 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 11 08:47:32 dt-dev-2 systemd[1]: Unit docker.service entered failed state.
Aug 11 08:47:32 dt-dev-2 systemd[1]: docker.service failed.
@thaJeztah
Copy link
Member

Thanks for reporting

ping @LK4D4 @cpuguy83 PTAL

@thaJeztah thaJeztah added this to the 1.12.1 milestone Aug 11, 2016
@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. area/swarm labels Aug 11, 2016
@thaJeztah
Copy link
Member

/cc @mrjana see libnetwork in there as well

@sanimej
Copy link

sanimej commented Aug 11, 2016

@b0ch3nski Did you create the cluster with the 1.12 Docker or you had an earlier 1.12 RC running and upgraded to the 1.12 release binary ?

@dperny
Copy link
Contributor

dperny commented Aug 11, 2016

Yeah, this is a known bug in the upgrade from 1.12 RC(3? I think? Maybe 4?) to 1.12 GA.

@b0ch3nski
Copy link
Author

@sanimej @dperny - no, this was a brand new cluster initialized on clean install of Docker 1.12 mainline version on fresh CentOS 7.
My test environment setup is done by Vagrant + Ansible combo, so I can easily recreate it if you need additional information.

@aboch
Copy link
Contributor

aboch commented Aug 13, 2016

@b0ch3nski
If possible, can you please reproduce after starting the daemon with the -D option. I need to check few related DEBUG level logs, to get more context. Also please make sure the log lines are not truncated when you post again.

A previous log show something that should not have happened:

Aug 11 08:43:42 dt-dev-2 dockerd[17100]: time="2016-08-11T08:43:42.520821682Z" level=warning msg="cannot find proper key indices while processing key upda

It seems related to what @sanimej and @dperny have pointed, but you confirmed this is from a fresh install. I am not sure, maybe some stale states from previous run rc images were held in /var/lib/docker/swarm directory.

After you reproduce, it could be worth to check if the issue is still reproducible starting the cluster after first removing the above directory on all the cluster nodes.

Thanks.

@b0ch3nski
Copy link
Author

@aboch I will try to reproduce tomorrow, with -D set.
I'm sorry for the truncated logs - I didn't notice that. It was a copy-paste from journalctl -u docker.

@b0ch3nski
Copy link
Author

@aboch Well, I've recreated the same environment with debugging enabled - I've tried restarting Docker service on both machines, joining/leaving Swarm, etc. but I couldn't reproduce that crash...
Do you possibly have an idea how could I trigger it again? I'm 100% sure that env is the same, as it was setup by Ansible + Vagrant.

@tiborvass tiborvass modified the milestones: 1.12.1, 1.12.2 Aug 30, 2016
@aboch
Copy link
Contributor

aboch commented Aug 31, 2016

@b0ch3nski
I tried multiple times today with the debug logs enabled and was not able to reproduce.
But eventually I hit the issue when I was not expecting it, I had no debug logs enabled.
I am testing with latest code with a change in keymanager which increases the key rotation frequency.
ATM, I still don't know how to trigger the issue.

@b0ch3nski
Copy link
Author

@aboch
Unfortunately, I also was not able to reproduce the issue anymore.
Now, I keep all my environments on debug, hoping to see it again.

@fabiop
Copy link

fabiop commented Sep 8, 2016

Hi, not sure if this is related or should be filed elsewhere.

I'm running a 1.12.1 swarm cluster with 5 managers and 2 nodes, hosting one service.

Containers are normally deployed all across the cluster, between managers and nodes.

After a random number of minutes Docker on one or both of the nodes is doing down. The rest of the cluster (meaning all the managers) is fine, the containers that were hosted on the dead node(s) are being rescheduled on the managers, so the total number of containers in the cluster remains the same (after the moved ones become available).

'docker node ls' on the managers indicates the node(s) state as "Down".

root@xxxxxxx-public-manager-i-77acb8fb:~# docker node ls
ID                           HOSTNAME                           STATUS  AVAILABILITY  MANAGER STATUS
3s0u1btbzvh9qcoesxiwforff    xxxxxxx-public-manager-i-f2a5007f  Ready   Active        Reachable
7ycxht9cas1lhqai20deg70r5    xxxxxxx-public-worker-i-15862298   Down    Active
9c3eyuqde1gxz1ubfwl4pogx6    xxxxxxx-public-manager-i-47c86dca  Ready   Active        Reachable
b39d59z7g6m0dz00fep2jxbwx    xxxxxxx-public-worker-i-0d5cbd3c   Down    Active
bftd5su9tn5stcyxv2z7q67zr    xxxxxxx-public-manager-i-0ba1403a  Ready   Active        Reachable
c7sz4lhvauybbihhd3bg8yfzn *  xxxxxxx-public-manager-i-77acb8fb  Ready   Active        Leader
ex9c362g7q9ed5vb0hnf1oqht    xxxxxxx-public-manager-i-8daf4ebc  Ready   Active        Reachable
root@xxxxxxx-public-manager-i-77acb8fb:~#

ssh into the nodes and doing a 'docker ps' will hang for a couple of seconds, then that will awaken Docker again and output the docker ps header with no containers.

The outcome above has been replicated a number of times, on brand new deployed clusters with different AWS instances (same config scripted via terraform)

The context is:
AWS, 1 ASG for managers, 1 ASG for nodes, Ubuntu 16.04.1 LTS on all instances.

Log output and debug info:

This is the log on one of the nodes after going down:

root@xxxxxxx-public-worker-i-d3cb6e5e:~# journalctl -u docker
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: panic: runtime error: slice bounds out of range
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: goroutine 64552 [running]:
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: panic(0x1a84380, 0xc82000e050)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/local/go/src/runtime/panic.go:481 +0x3e6
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: bytes.(*Buffer).grow(0xc821601980, 0x1d9, 0xc821e3bcb0)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/local/go/src/bytes/buffer.go:112 +0x141
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: bytes.(*Buffer).Write(0xc821601980, 0xc8211ec000, 0x1d9, 0x8000, 0x19, 0x0, 0x0)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/local/go/src/bytes/buffer.go:134 +0x4b
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: github.com/docker/docker/daemon.(*limitedBuffer).Write(0xc821601980, 0xc8211ec000, 0x1d9, 0x8000, 0x1d9, 0x0, 0x0)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/health.go:283 +0xae
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: io.copyBuffer(0x7f4c2dd09270, 0xc821601980, 0x7f4c2dd08d68, 0xc820def540, 0xc8211ec000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/local/go/src/io/io.go:382 +0x2c9
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: io.Copy(0x7f4c2dd09270, 0xc821601980, 0x7f4c2dd08d68, 0xc820def540, 0xc820def540, 0x0, 0x0)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/local/go/src/io/io.go:350 +0x64
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: github.com/docker/docker/container.AttachStreams.func2(0x1d6a480, 0x6, 0x7f4c2dd09270, 0xc821601980, 0x7f4c2dd08d38, 0xc820def540)
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/src/docker/.gopath/src/github.com/docker/docker/container/container.go:433 +0x1cc
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]: created by github.com/docker/docker/container.AttachStreams
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c dockerd[5511]:         /usr/src/docker/.gopath/src/github.com/docker/docker/container/container.go:450 +0x37e
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c systemd[1]: docker.service: Unit entered failed state.
Sep 08 10:21:46 xxxxxxx-public-worker-i-0d5cbd3c systemd[1]: docker.service: Failed with result 'exit-code'.

Docker version:

root@xxxxxxx-public-worker-i-0d5cbd3c:~# docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64
root@xxxxxxx-public-worker-i-0d5cbd3c:~#

Docker info:

root@xxxxxxx-public-worker-i-0d5cbd3c:~# docker info
Containers: 9
 Running: 0
 Paused: 0
 Stopped: 9
Images: 1
Server Version: 1.12.1
Storage Driver: devicemapper
 Pool Name: docker-202:1-183116-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 138 MB
 Data Space Total: 107.4 GB
 Data Space Available: 7.866 GB
 Metadata Space Used: 962.6 kB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.147 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge overlay host
Swarm: active
 NodeID: b39d59z7g6m0dz00fep2jxbwx
 Is Manager: false
 Node Address: <redacted>
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-36-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.303 GiB
Name: xxxxxxx-public-worker-i-0d5cbd3c
ID: OQVN:WWEN:54EJ:CMHS:DSHA:CW6E:IS7J:LD5H:HLHF:HSWU:H3BH:EFN7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8
root@xxxxxxx-public-worker-i-0d5cbd3c:~#

Please let me know if any other debug info are needed and thanks for helping!

@cpuguy83
Copy link
Member

cpuguy83 commented Sep 8, 2016

@fabiop Totally different. See #26384

@fabiop
Copy link

fabiop commented Sep 8, 2016

@cpuguy83 thanks

@aluzzardi
Copy link
Member

Hey @aboch - assigning to you since you have the most context on this one

@mmh36
Copy link

mmh36 commented Sep 19, 2016

Hi,

Have seen a similar error in our swarm cluster. Please point me in the right direction if this is not related.

The context is:
Running two nodes, both of which are managers. Both servers are running rhel 7. The server that crashed is not the leader

$ docker node ls 
ID                           HOSTNAME    STATUS  AVAILABILITY  MANAGER STATUS
44nxys0oe1n1uoxl3bld7e283 *  *****04173  Ready   Active        Leader
4gtiufe3o778iw9sc1lgxf3gr    *****04175  Ready   Active        Unreachable

Linux kernel:

$ uname -a 
Linux deala04175 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        
 OS/Arch:      linux/amd64

docker info:

$ docker info 
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 10
Server Version: 1.12.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host overlay
Swarm: active
 NodeID: 4gtiufe3o778iw9sc1lgxf3gr
 Is Manager: true
 ClusterID: e6resiaf7j120vggk7iz2w5nj
 Managers: 2
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.41.72.186
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.22.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51 GiB
Name: deala04175
ID: ZB4V:EKRK:73ZA:RJY7:EQKZ:G6WP:J2UN:IK4M:LMNM:N4TZ:M4QL:QFPH
Docker Root Dir: /******/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

Last few lines from journalctl -u docker

Sep 16 12:35:58 *****04175 docker[19668]: time="2016-09-16T12:35:58.842968592+01:00" level=error msg="Bulk sync to node *****04173 timed out"
Sep 16 12:36:28 *****04175 docker[19668]: time="2016-09-16T12:36:28.843918642+01:00" level=error msg="Bulk sync to node *****04173 timed out"
Sep 16 12:36:58 *****04175 docker[19668]: time="2016-09-16T12:36:58.844953459+01:00" level=error msg="Bulk sync to node *****04173 timed out"
Sep 16 12:37:10 *****04175 docker[19668]: panic: runtime error: index out of range
Sep 16 12:37:10 *****04175 docker[19668]: goroutine 320 [running]:
Sep 16 12:37:10 *****04175 docker[19668]: panic(0x1a7edc0, 0xc82000e020)
Sep 16 12:37:10 *****04175 docker[19668]: /usr/local/go/src/runtime/panic.go:481 +0x3e6
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork/drivers/overlay.updateNodeKey(0xc821d85bd0, 0x10, 0x10, 0xc821d85be0, 0x10, 0x10, 0xc8219be930, 0x5, 0x5, 0xc821c9b680, ...)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:496 +0x13a1
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork/drivers/overlay.(*driver).updateKeys.func1(0xc822d9e850, 0xc, 0xc8219be930, 0x5, 0x5, 0x0, 0x0, 0x0, 0x66e466)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:444 +0x174
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork/drivers/overlay.(*driver).secMapWalk(0xc82044e700, 0xc82116ed98, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:382 +0x140
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork/drivers/overlay.(*driver).updateKeys(0xc82044e700, 0xc82376fbe0, 0xc82376fc00, 0xc82376fd00, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:445 +0x863
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork/drivers/overlay.(*driver).DiscoverNew(0xc82044e700, 0x4, 0x1ba9ca0, 0xc821c9b620, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/overlay.go:351 +0xbbc
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork.(*controller).handleKeyChange.func1(0x1d62ff8, 0x7, 0x7f2ae71792a0, 0xc82044e700, 0x1d605d8, 0x6, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/agent.go:157 +0x94
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork/drvregistry.(*DrvRegistry).WalkDrivers(0xc8203bdf80, 0xc82116f4d0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drvregistry/drvregistry.go:107 +0x2ce
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork.(*controller).handleKeyChange(0xc8204faff0, 0xc822250540, 0x6, 0x8, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/agent.go:162 +0x8a6
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/libnetwork.(*controller).SetKeys(0xc8204faff0, 0xc822250540, 0x6, 0x8, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/controller.go:287 +0x755
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/docker/daemon.(*Daemon).SetNetworkBootstrapKeys(0xc82035ba00, 0xc822250540, 0x6, 0x8, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/network.go:185 +0x6d
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/docker/daemon/cluster/executor/container.(*executor).SetNetworkBootstrapKeys(0xc820b63570, 0xc822127dc0, 0x6, 0x8, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/cluster/executor/container/executor.go:144 +0x2e1
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/swarmkit/agent.(*Agent).handleSessionMessage(0xc82042ee70, 0x7f2ae71c5d70, 0xc820bea870, 0xc821102b90, 0x0, 0x0)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:268 +0x5ca
Sep 16 12:37:10 *****04175 docker[19668]: github.com/docker/swarmkit/agent.(*Agent).run(0xc82042ee70, 0x7f2ae71c5d70, 0xc820bea870)
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:170 +0xbe6
Sep 16 12:37:10 *****04175 docker[19668]: created by github.com/docker/swarmkit/agent.(*Agent).Start
Sep 16 12:37:10 *****04175 docker[19668]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:81 +0x37d
Sep 16 12:37:10 *****04175 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 16 12:37:10 *****04175 systemd[1]: Unit docker.service entered failed state.
Sep 16 12:37:10 *****04175 systemd[1]: docker.service failed.
Sep 16 22:20:50 *****04175 systemd[1]: Starting Docker Application Container Engine...
Sep 16 22:20:50 *****04175 docker[12668]: time="2016-09-16T22:20:50.335424793+01:00" level=info msg="libcontainerd: previous instance of containerd still alive (19677)"

Service status before it was restarted

$ sudo systemctl status docker 
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─docker.conf
   Active: activating (start) since Fri 2016-09-16 22:20:50 BST; 2 days ago
     Docs: https://docs.docker.com
 Main PID: 12668 (dockerd)
   Memory: 7.6M
   CGroup: /system.slice/docker.service
           └─12668 dockerd -s overlay --graph=/*****/docker

Sep 16 22:20:50 *****04175 systemd[1]: Starting Docker Application Container Engine...
Sep 16 22:20:50 *****04175 docker[12668]: time="2016-09-16T22:20:50.335424793+01:00" level=info msg="libcontainerd: previous instance of container... (19677)"
Hint: Some lines were ellipsized, use -l to show in full.

Please let me know if you require any further information

@aboch
Copy link
Contributor

aboch commented Sep 19, 2016

@mmh36 @evanp If you still have the daemon logs, can you please verify the presence of logs like
cannot find proper key indices [...] and post the logs before and after those lines ? Thanks.

I found an issue and I was finally able to find a way to reproduce the panic, working to the fix now.
But I am trying to understand if your problem falls in the same category.

My issue arises when a node leaves the cluster but it is not shutdown. Then it reconnects to the cluster after the cluster has rotated the encryption keys. Then a service task is being placed on this node on an encrypted network. At next (or next next) cluster encryption key rotation this node will panic.

@evanp
Copy link

evanp commented Sep 20, 2016

@aboch it was in the issue I posted before.

https://github.com/docker/docker/files/479181/dockerfail.txt

@b0ch3nski
Copy link
Author

Well, it appears that the same issue came back to me today, so I think we cen reopen this thread...

Output of docker version:

Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 4
 Running: 2
 Paused: 0
 Stopped: 2
Images: 3
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: docker-253:0-202363761-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 548.8 MB
 Data Space Total: 107.4 GB
 Data Space Available: 38.1 GB
 Metadata Space Used: 1.561 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.146 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /mnt/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /mnt/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null overlay host bridge
Swarm: active
 NodeID: 5fv1tcpiq2k5g9f6fbt0s3ryu
 Error: rpc error: code = 2 desc = raft: no elected cluster leader
 Is Manager: true
 ClusterID: 
 Managers: 0
 Nodes: 0
 Orchestration:
  Task History Retention Limit: 0
 Raft:
  Snapshot Interval: 0
  Heartbeat Tick: 0
  Election Tick: 0
 Dispatcher:
  Heartbeat Period: Less than a second
 CA Configuration:
  Expiry Duration: Less than a second
 Node Address: 10.10.10.10
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.36.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 993 MiB
Name: dt-dev-1
ID: A4GT:Q55F:N6YM:4JWP:QM75:ZFSM:AZLW:KQV6:PXD6:QR7T:QR45:ME6O
Docker Root Dir: /mnt/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 52
 Goroutines: 132
 System Time: 2016-11-23T17:46:22.940399748+01:00
 EventsListeners: 2
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

The issue log (not truncated this time, with debug enabled):

Nov 23 17:34:47 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:47.608149188+01:00" level=info msg="2016/11/23 17:34:47 [INFO] memberlist: Marking dt-dev-1-c72d7b17a7d2 as failed, suspect timeout reached\n"
Nov 23 17:34:47 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:47.686187695+01:00" level=debug msg="2016/11/23 17:34:47 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.10.10:7946\n"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.219604088+01:00" level=info msg="7e17076cacfa6748 [term: 11] received a MsgVote message with higher term from 448f5c508a943760 [term: 12]"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.219671643+01:00" level=info msg="7e17076cacfa6748 became follower at term 12"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.219705584+01:00" level=info msg="7e17076cacfa6748 [logterm: 11, index: 1146, vote: 0] rejected vote from 448f5c508a943760 [logterm: 11, index: 1145] at term 12"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.219727221+01:00" level=info msg="raft.node: 7e17076cacfa6748 lost leader 25d57d20c23c3a8 at term 12"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.350927299+01:00" level=error msg="agent: session failed" error="rpc error: code = 13 desc = transport is closing" module=agent
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.351049511+01:00" level=debug msg="agent: rebuild session" module=agent
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.382878993+01:00" level=debug msg="(*session).start" module=agent
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.438804084+01:00" level=error msg="agent: session failed" error="rpc error: code = 8 desc = more than one redirect to leader from: [10.10.10.11:39464]" module=agent
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.438964412+01:00" level=debug msg="agent: rebuild session" module=agent
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.485737131+01:00" level=debug msg="Initiating bulk sync with node dt-dev-1-c72d7b17a7d2 after reconnect"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.485775163+01:00" level=debug msg="dt-dev-2-f43ab5ac0b83: Initiating bulk sync with nodes [dt-dev-1-c72d7b17a7d2]"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.485790495+01:00" level=debug msg="dt-dev-2-f43ab5ac0b83: Initiating unsolicited bulk sync for networks [1h4seoywxmyz7wcq0vbqb0aq4 5w1gmnd45fa16l3c4alvehbik 61ggflb4vns5t94hlx9s09614] with node dt-dev-1-c72d7b17a7d2"
Nov 23 17:34:48 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:48.730998585+01:00" level=debug msg="(*session).start" module=agent
Nov 23 17:34:49 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:49.185940403+01:00" level=debug msg="2016/11/23 17:34:49 [DEBUG] memberlist: Failed UDP ping: dt-dev-1-c72d7b17a7d2 (timeout reached)\n"
Nov 23 17:34:49 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:49.686162656+01:00" level=warning msg="2016/11/23 17:34:49 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.10.10.11:41330->10.10.10.10:7946: i/o timeout\n"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.648925328+01:00" level=info msg="7e17076cacfa6748 is starting a new election at term 12"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.649162745+01:00" level=info msg="7e17076cacfa6748 became candidate at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.649222741+01:00" level=info msg="7e17076cacfa6748 received vote from 7e17076cacfa6748 at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.649267758+01:00" level=info msg="7e17076cacfa6748 [logterm: 11, index: 1146] sent vote request to 448f5c508a943760 at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.649300426+01:00" level=info msg="7e17076cacfa6748 [logterm: 11, index: 1146] sent vote request to 25d57d20c23c3a8 at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.784164719+01:00" level=info msg="7e17076cacfa6748 received vote from 25d57d20c23c3a8 at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.784199919+01:00" level=info msg="7e17076cacfa6748 [quorum:2] has received 2 votes and 0 vote rejections"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.784224705+01:00" level=info msg="7e17076cacfa6748 became leader at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.784241513+01:00" level=info msg="raft.node: 7e17076cacfa6748 elected leader 7e17076cacfa6748 at term 13"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.801180908+01:00" level=debug msg="RequestPool(GlobalDefault, 10.255.0.0/16, , map[], false)"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.801273062+01:00" level=debug msg="RequestAddress(GlobalDefault/10.255.0.0/16, 10.255.0.1, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.846292340+01:00" level=debug msg="Root CA updated successfully" cluster.id=7mggmaddjbg30ur9wx6mpinv9 method="(*Server).updateCluster" module=ca
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.873088170+01:00" level=debug msg="RequestPool(GlobalDefault, 10.0.0.0/24, , map[], false)"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.873180594+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, 10.0.0.1, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.887202908+01:00" level=debug msg="RequestPool(GlobalDefault, 11.11.11.0/27, , map[], false)"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.887294309+01:00" level=debug msg="RequestAddress(GlobalDefault/11.11.11.0/27, 11.11.11.1, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.898967071+01:00" level=debug msg="7e17076cacfa6748 received msgApp rejection(lastindex: 1145) from 448f5c508a943760 for index 1146"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.899168621+01:00" level=debug msg="7e17076cacfa6748 decreased progress of 448f5c508a943760 to [next = 1146, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.903147367+01:00" level=debug msg="RequestAddress(GlobalDefault/10.255.0.0/16, 10.255.0.4, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.905588555+01:00" level=error msg="agent: session failed" error="rpc error: code = 13 desc = transport is closing" module=agent
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.905803629+01:00" level=debug msg="agent: rebuild session" module=agent
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.906787475+01:00" level=debug msg="2016/11/23 17:34:51 [DEBUG] memberlist: TCP connection from=10.10.10.10:52154\n"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.916101024+01:00" level=debug msg="checkEncryption(5w1gmnd, 10.10.10.12, 256, false)"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.917915624+01:00" level=debug msg="dt-dev-2-f43ab5ac0b83: Bulk sync to node dt-dev-1-c72d7b17a7d2 took 3.42983902s"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.934073209+01:00" level=debug msg="RequestAddress(GlobalDefault/10.255.0.0/16, 10.255.0.3, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.952329222+01:00" level=debug msg="RequestAddress(GlobalDefault/10.255.0.0/16, 10.255.0.5, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.971631061+01:00" level=debug msg="RequestAddress(GlobalDefault/11.11.11.0/27, 11.11.11.2, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.971718293+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, 10.0.0.2, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.991867326+01:00" level=debug msg="RequestAddress(GlobalDefault/10.255.0.0/16, 10.255.0.2, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.992110826+01:00" level=debug msg="RequestAddress(GlobalDefault/11.11.11.0/27, 11.11.11.5, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.992153799+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, 10.0.0.5, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.992311130+01:00" level=debug msg="RequestAddress(GlobalDefault/11.11.11.0/27, 11.11.11.3, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.992341911+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, 10.0.0.3, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.992503424+01:00" level=debug msg="RequestAddress(GlobalDefault/11.11.11.0/27, 11.11.11.6, map[])"
Nov 23 17:34:51 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:51.992531447+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.0.0/24, 10.0.0.6, map[])"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.333474224+01:00" level=debug msg="(*session).start" module=agent
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.453527191+01:00" level=debug method="(*Dispatcher).Tasks" node.id=231a1htt5qlhcpulvxdxrse8o node.session=bhlx6etuh5ibmpeisajc4awxh
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.601988188+01:00" level=debug msg="(*session).listen" module=agent session.id=6kms48vcs1gsv6kes26xnf82z
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602376457+01:00" level=debug msg="Adding key 5c2cb"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602405081+01:00" level=debug msg="Primary Key 8ea6c"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602414418+01:00" level=debug msg="Remove Key 67c07"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602441568+01:00" level=debug msg="Updating Keys. New: (key: 3d97f, tag: 0xf92d), Primary: (key: 626f8, tag: 0xf92b), Pruned: (key: 0858f, tag: 0xf927)"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602536009+01:00" level=debug msg="Current: [(key: 43747, tag: 0xf921) (key: ea3d6, tag: 0xf91f) (key: 0858f, tag: 0xf927) (key: 6d689, tag: 0xf929) (key: 626f8, tag: 0xf92b)]"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602563103+01:00" level=debug msg="Updating keys for node: 10.10.10.10 (5,4,2)"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602577078+01:00" level=debug msg="Current: [SPI(FWD: 0xa0ff04a4, REV: 0x8453e7b0) SPI(FWD: 0xac52a172, REV: 0xed43f876) SPI(FWD: 0xd63c3dca, REV: 0x446f36fe) SPI(FWD: 0xb601eefc, REV: 0xc3610b58)]"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.605748653+01:00" level=debug msg="Removing: rSA{Dst: 10.10.10.11, Src: 10.10.10.10, Proto: esp, Mode: transport, SPI: 0x446f36fe, ReqID: 0x0, ReplayWindow: 0, Mark: <nil>, Auth: <nil>, Crypt: <nil>, Aead: <nil>,Encap: <nil>}"
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.605874114+01:00" level=debug msg="(*Agent).run exited" module=agent
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: panic: runtime error: index out of range
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: goroutine 180 [running]:
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: panic(0x1a90b60, 0xc82000e020)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /usr/local/go/src/runtime/panic.go:481 +0x3e6
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork/drivers/overlay.updateNodeKey(0xc820899330, 0x10, 0x10, 0xc820899340, 0x10, 0x10, 0xc821dbe840, 0x4, 0x4, 0xc8219b4ba0, ...)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:495 +0x13a1
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork/drivers/overlay.(*driver).updateKeys.func1(0xc8218d4b60, 0xb, 0xc821dbe840, 0x4, 0x4, 0x0, 0x0, 0x0, 0x66e616)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:443 +0x174
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork/drivers/overlay.(*driver).secMapWalk(0xc82000bb20, 0xc821008d68, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:382 +0x140
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork/drivers/overlay.(*driver).updateKeys(0xc82000bb20, 0xc821dbe460, 0xc821dbe480, 0xc821dbe4a0, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/encryption.go:444 +0x772
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork/drivers/overlay.(*driver).DiscoverNew(0xc82000bb20, 0x4, 0x1bbc680, 0xc8208fede0, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/overlay/overlay.go:366 +0xcb5
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork.(*controller).handleKeyChange.func1(0x1d76a60, 0x7, 0x7f727fca89b8, 0xc82000bb20, 0x1d74028, 0x6, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/agent.go:158 +0x94
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork/drvregistry.(*DrvRegistry).WalkDrivers(0xc820699830, 0xc8210094d0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drvregistry/drvregistry.go:107 +0x2ce
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork.(*controller).handleKeyChange(0xc8201ec690, 0xc821577480, 0x6, 0x8, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/agent.go:163 +0x8a6
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/libnetwork.(*controller).SetKeys(0xc8201ec690, 0xc821577480, 0x6, 0x8, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/controller.go:290 +0x755
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/docker/daemon.(*Daemon).SetNetworkBootstrapKeys(0xc82030fa00, 0xc821577480, 0x6, 0x8, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/network.go:185 +0x6d
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/docker/daemon/cluster/executor/container.(*executor).SetNetworkBootstrapKeys(0xc820af2730, 0xc821576f40, 0x6, 0x8, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/cluster/executor/container/executor.go:144 +0x2e1
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/swarmkit/agent.(*Agent).handleSessionMessage(0xc820ad1f10, 0x7f727fc9ead0, 0xc8209f1440, 0xc821531c70, 0x0, 0x0)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:267 +0x5ca
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: github.com/docker/swarmkit/agent.(*Agent).run(0xc820ad1f10, 0x7f727fc9ead0, 0xc8209f1440)
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:170 +0xbe6
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: created by github.com/docker/swarmkit/agent.(*Agent).Start
Nov 23 17:34:52 dt-dev-2 dockerd[5277]: /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/agent.go:81 +0x37d
Nov 23 17:34:52 dt-dev-2 systemd[1]: docker.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 23 17:34:52 dt-dev-2 systemd[1]: Unit docker.service entered failed state.
Nov 23 17:34:52 dt-dev-2 systemd[1]: docker.service failed.

@thaJeztah
Copy link
Member

Let me reopen

@aboch
Copy link
Contributor

aboch commented Nov 23, 2016

Thanks @b0ch3nski
The crash happens on dataplane key rotation triggered by the manager node being elected leader.
It should not make a difference which event triggers the rotation, but we need to see what could have messed up the dataplane key set. I see you have the debug logs, it would be great if you could attach all of them to this issue.

When I say the dataplane key set is incorrect, I refer to the following:

Nov 23 17:34:52 dt-dev-2 dockerd[5277]: time="2016-11-23T17:34:52.602536009+01:00" level=debug msg="Current: [(key: 43747, tag: 0xf921) (key: ea3d6, tag: 0xf91f) (key: 0858f, tag: 0xf927) (key: 6d689, tag: 0xf929) (key: 626f8, tag: 0xf92b)]"

because it should never contain more than 3 keys

@aboch
Copy link
Contributor

aboch commented Nov 23, 2016

@b0ch3nski Please also provide some more context, for example how you got into a non-leader cluster.

 Error: rpc error: code = 2 desc = raft: no elected cluster leader

Most importantly, I need to see the full daemon log, I need to understand what triggered the spurious dataplane key updates.

@b0ch3nski
Copy link
Author

@aboch I will do my best to describe the context.

I'm running 3 VMs on my local machine, each has 1 CPU and 1 GB RAM, with CentOS and Docker in native Swarm mode. All those nodes are in manager role.

Right before this issue, I've created my service that was a little bit too memory-greedy :)

This caused OOM on machines - the first symptom was unresponsive docker command and after 2-3 minutes I've noticed that on 2 of those 3 machines docker service was down. One node has survived, but after ~30 minutes it also died for the same reason.

I could provide complete logs from those machines, but I have --log-driver=journald parameter enabled (all logs from my services are in journal), so I'd have to figure out how to filter those...

@aboch
Copy link
Contributor

aboch commented Nov 24, 2016

Thanks @b0ch3nski for the extra info.
So IIUC, first two mgrs died because of OOM, 3rd mgr ended in no leader mode and after ~30 minutes it eventually crashed because of (you say same reason) the key rotation panic. Is my understading correct ?

If that is the case, I'd mostly need the logs for the 3rd mgr, the one which experienced the key rotation panic. Later you can attach the logs for the other one.

Thanks for your help.

@b0ch3nski
Copy link
Author

@aboch Sorry for late response - I was quite busy lately...
I've figured out a way to filter the logs - it also might be useful for someone else:
journalctl -u docker --since "2016-11-23 16:00:00" --until "2016-11-23 18:10:00" -o json --no-tail --no-pager | grep -v CONTAINER_ID | jq '.MESSAGE' > /tmp/$(hostname).txt

Here are the logs from all 3 hosts @ my Google Drive

@aboch
Copy link
Contributor

aboch commented Nov 30, 2016

Thank you @b0ch3nski for the logs. They are very helpful.
Likely because of the mgrs crashes, there is a lot of unusual things going on. We are trying to make a sense of it. Thanks.

@b0ch3nski
Copy link
Author

@aboch Good to hear that!
If there is anything else I might be able to clarify, feel free to ping me.

@sanimej
Copy link

sanimej commented Dec 5, 2016

@b0ch3nski Thanks for the logs. All the managers nodes going into OOM state can cause cluster availability issues because the OOM killer might terminate the daemon process on all or majority nodes. From the logs we can see multiple raft leader changes happening frequently. That combined with unpredictable state after OOM handling leads to key handling going out of sync. Can you try setting up a cluster with some worker and manager nodes and schedule the tasks only on workers ? Please see https://docs.docker.com/engine/swarm/admin_guide/.

@mavenugo
Copy link
Contributor

mavenugo commented Dec 5, 2016

@sanimej @aboch @b0ch3nski since the original problem that this issue was opened for is resolved via #26879 and the new issue is very specific to the OOM case in multiple managers, I think we should reduce the priority if this issue.

If you can reproduce this issue by scheduling the resource intensive services in worker nodes and leaving the managers to manage the cluster, then we can increase the priority again. In order to try this one, pls drain the managers as suggested by @sanimej

@mavenugo mavenugo removed the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Dec 5, 2016
@b0ch3nski
Copy link
Author

@aboch @sanimej @mavenugo I'm aware of the concept of Swarm managers vs. workers - environment where this issue happened is my local machine with small VMs set just for simple testing, so OOM was not really unexpected problem. I've never seen this issue on environments with plenty of resources and correct balance between workers and managers, so I'm completely fine with reducing the priority.

@mavenugo
Copy link
Contributor

mavenugo commented Dec 6, 2016

Thanks @b0ch3nski We will work on a reasonable solution for a graceful handling of this scenario.

@mavenugo mavenugo assigned sanimej and unassigned aboch Dec 12, 2016
yongtang added a commit to yongtang/docker that referenced this issue Feb 16, 2017
This fix tries to update the SwarmKit from
ed384f3
to
6bc357e

The following is the list of docker related changes:
1. Took long time for Docker Swarm service turn desired state from Ready to Running (Issue moby#28291)
2. Native Swarm in 1.12 - panic: runtime error: index out of range (Issue moby#25608)
3. Global mode target replicas keep increasing (Issue moby#30854)
4. Creating service with publish mode=host and without published port crashes swarm manager (Issue moby#30938)
5. Define signals used to stop containers for updates (Issue moby#25696) (PR moby#30754)

This fix fixes moby#28291, moby#25608, moby#30854, moby#30938.
This fix is required by PR moby#30754.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12
Projects
None yet