New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native Swarm in 1.12 - panic: runtime error: index out of range #25608
Comments
/cc @mrjana see libnetwork in there as well |
@b0ch3nski Did you create the cluster with the 1.12 Docker or you had an earlier 1.12 RC running and upgraded to the 1.12 release binary ? |
Yeah, this is a known bug in the upgrade from 1.12 RC(3? I think? Maybe 4?) to 1.12 GA. |
@b0ch3nski A previous log show something that should not have happened:
It seems related to what @sanimej and @dperny have pointed, but you confirmed this is from a fresh install. I am not sure, maybe some stale states from previous run rc images were held in After you reproduce, it could be worth to check if the issue is still reproducible starting the cluster after first removing the above directory on all the cluster nodes. Thanks. |
@aboch I will try to reproduce tomorrow, with |
@aboch Well, I've recreated the same environment with debugging enabled - I've tried restarting Docker service on both machines, joining/leaving Swarm, etc. but I couldn't reproduce that crash... |
@b0ch3nski |
@aboch |
Hi, not sure if this is related or should be filed elsewhere. I'm running a 1.12.1 swarm cluster with 5 managers and 2 nodes, hosting one service. Containers are normally deployed all across the cluster, between managers and nodes. After a random number of minutes Docker on one or both of the nodes is doing down. The rest of the cluster (meaning all the managers) is fine, the containers that were hosted on the dead node(s) are being rescheduled on the managers, so the total number of containers in the cluster remains the same (after the moved ones become available). 'docker node ls' on the managers indicates the node(s) state as "Down".
ssh into the nodes and doing a 'docker ps' will hang for a couple of seconds, then that will awaken Docker again and output the docker ps header with no containers. The outcome above has been replicated a number of times, on brand new deployed clusters with different AWS instances (same config scripted via terraform) The context is: Log output and debug info: This is the log on one of the nodes after going down:
Docker version:
Docker info:
Please let me know if any other debug info are needed and thanks for helping! |
@cpuguy83 thanks |
Hey @aboch - assigning to you since you have the most context on this one |
Hi, Have seen a similar error in our swarm cluster. Please point me in the right direction if this is not related. The context is:
Linux kernel:
docker version:
docker info:
Last few lines from journalctl -u docker
Service status before it was restarted
Please let me know if you require any further information |
@mmh36 @evanp If you still have the daemon logs, can you please verify the presence of logs like I found an issue and I was finally able to find a way to reproduce the panic, working to the fix now. My issue arises when a node leaves the cluster but it is not shutdown. Then it reconnects to the cluster after the cluster has rotated the encryption keys. Then a service task is being placed on this node on an encrypted network. At next (or next next) cluster encryption key rotation this node will panic. |
@aboch it was in the issue I posted before. https://github.com/docker/docker/files/479181/dockerfail.txt |
Well, it appears that the same issue came back to me today, so I think we cen reopen this thread... Output of
Output of
The issue log (not truncated this time, with debug enabled):
|
Let me reopen |
Thanks @b0ch3nski When I say the dataplane key set is incorrect, I refer to the following:
because it should never contain more than 3 keys |
@b0ch3nski Please also provide some more context, for example how you got into a non-leader cluster.
Most importantly, I need to see the full daemon log, I need to understand what triggered the spurious dataplane key updates. |
@aboch I will do my best to describe the context. I'm running 3 VMs on my local machine, each has 1 CPU and 1 GB RAM, with CentOS and Docker in native Swarm mode. All those nodes are in manager role. Right before this issue, I've created my service that was a little bit too memory-greedy :) This caused OOM on machines - the first symptom was unresponsive I could provide complete logs from those machines, but I have |
Thanks @b0ch3nski for the extra info. If that is the case, I'd mostly need the logs for the 3rd mgr, the one which experienced the key rotation panic. Later you can attach the logs for the other one. Thanks for your help. |
@aboch Sorry for late response - I was quite busy lately... Here are the logs from all 3 hosts @ my Google Drive |
Thank you @b0ch3nski for the logs. They are very helpful. |
@aboch Good to hear that! |
@b0ch3nski Thanks for the logs. All the managers nodes going into OOM state can cause cluster availability issues because the OOM killer might terminate the daemon process on all or majority nodes. From the logs we can see multiple raft leader changes happening frequently. That combined with unpredictable state after OOM handling leads to key handling going out of sync. Can you try setting up a cluster with some worker and manager nodes and schedule the tasks only on workers ? Please see https://docs.docker.com/engine/swarm/admin_guide/. |
@sanimej @aboch @b0ch3nski since the original problem that this issue was opened for is resolved via #26879 and the new issue is very specific to the OOM case in multiple managers, I think we should reduce the priority if this issue. If you can reproduce this issue by scheduling the resource intensive services in worker nodes and leaving the managers to manage the cluster, then we can increase the priority again. In order to try this one, pls drain the managers as suggested by @sanimej |
@aboch @sanimej @mavenugo I'm aware of the concept of Swarm managers vs. workers - environment where this issue happened is my local machine with small VMs set just for simple testing, so OOM was not really unexpected problem. I've never seen this issue on environments with plenty of resources and correct balance between workers and managers, so I'm completely fine with reducing the priority. |
Thanks @b0ch3nski We will work on a reasonable solution for a graceful handling of this scenario. |
This fix tries to update the SwarmKit from ed384f3 to 6bc357e The following is the list of docker related changes: 1. Took long time for Docker Swarm service turn desired state from Ready to Running (Issue moby#28291) 2. Native Swarm in 1.12 - panic: runtime error: index out of range (Issue moby#25608) 3. Global mode target replicas keep increasing (Issue moby#30854) 4. Creating service with publish mode=host and without published port crashes swarm manager (Issue moby#30938) 5. Define signals used to stop containers for updates (Issue moby#25696) (PR moby#30754) This fix fixes moby#28291, moby#25608, moby#30854, moby#30938. This fix is required by PR moby#30754. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
CentOS7 @ VirtualBox (via Vagrant)
Additional information:
I had Swarm setup with 2 nodes (and 2 different services started). I've restarted one of nodes and I was surprised that now the restarted node hosts both services (they were auto-balanced before).
Turns out that Docker at machine that was NOT restarted died - this is the stacktrace:
The text was updated successfully, but these errors were encountered: