New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc: the connection is unavailable and load balancer broken #31043
Comments
the Error response from daemon: grpc: the connection is unavailable. |
ping @mlaventure @sanimej The completion issue was a bad cherry-pick that will be fixed in .2, sorry about that. |
@toutougabi For the load balancer issue can you provider your env details ? How many manager/workers and what kind of service you are running, the clients are from inside the cluster or external ? Was everything working ok with 1.13.0 and you seeing the load balancer issues only with 1.13.1 ? |
thanks @cpuguy83 np @sanimej yes the issue are new since the 1.13.1, 1.13 was very stable as was the last stable 1.12. i have 8 nodes all managers We have 3 reverse proxy on the local network that call the different containers. so all access is internal but not to the cluster. |
as a side not, we had, earlier this week, a crash on 50% of the hosts but they rebooted fine and the cluster restarted normally when we restarted. On reboot all was good but we decided , while we had the maintenance page for our sites to go ahead with the update, so about 1hr after the restart, we cycled each host to update to 1.13.1 (from 1.13.0) and that's when the load balancer started working 50% of the time with a no route to host on the other 50% requests |
@toutougabi After half of hosts rebooted, did you get a chance to verify the LB behavior as well before upgrading to 1.13.1 ? On the 50% loss, can you narrow it down a bit, for ex: do you see requests to tasks only on certain hosts failing ? Do you see this failure for services created before upgrade or for the patter is seen for new ones as well ? |
yes the system looked stable but i have to be honest that we didn't test for long before we did the update. for the 50% the test was simple, we were noticing failed request on the reverse proxies so we went locally and did wget to the cluster. That's when we discovered the no route to host on 50% of the request. So wget x.x.x.x:xxxx/app would fail 50% of the time and hasn't recovered since. service that were not updated after the failure were stable, as soon as we pushed updates (docker service update) to get new images online for some of the services, it started doing the 50%. We noticed the issue when most of the system was updated so we had to revert to do manual docker runs on another set of host to have a stable system. |
So it look like more than the daemon upgrade to 1.13.1 its the service update that triggered the issue.. I will give it a try.. To narrow down this some more, can you try creating a new service and see if the LB works ok for that ? |
no any new service has the same issue, that was my first thought so i re-created some of the services and it didn't help... |
i'd add that we routinely updated containers on 1.13.0 without any issue |
On our end, we have 6 docker nodes using swarm on 1.13.0. and we see these errors as well on one of the node. We have 3 managers all drained and 3 workers. On worker01, we have this kind of issue
When i'm trying to jump into a container on that node, I always have this error (no matter which container, unhealthy or not, I always have this error)
I also found this error on another container, still on Worker01 : On worker02 & worker03, we have few unhealthy containers as well, but we can "docker exec " into all containers without error. And we don't have any RPC error / transport error on these two workers. Last discovery : We seem to have an issue with our available diskspace on that specific node (worker01), we are using thinpools. I just found this error :
We will get this fixed, drain it, and see the system becomes more stable. Hope this might help Edit : (What happened when we tried to rejoin the worker01 in cluster) We drained the worker01, and the containers wasn't being kicked from it. So I guess the docker service really crashed. We reloaded it, and then reset the node availability=active. Since the managers are not rebalancing the container automatically, I tried to force an update with The manager tried to deploy the container 4 times on our idle worker01, without success, then shipped it on worker03 as fallback. Error noticed on the 4 failing containers was : I guess we crashed the whole thing lol ! Edit2 : I see that these errors happened in the past on Linux Kernel below 3.15 |
@JnMik if you're still running into that issue, please open a new issue with details. W.r.t "We seem to have an issue with our available diskspace on that specific node" that can definitely be related; device mapper running out of disk space can result in lots of "interesting" things. |
@thaJeztah Hello Sir, Thanks for you comment. We tried to reload docker service, the error was still persisting. I'm not gonna open a new issue for the moment because I do not have the error to play with anymore, but isn't weird that all the networking does not recover on docker service reload ? Cheers ! |
Hi team: docker version Server: |
Still around: Server:
|
We are experiencing this issue running 1.13.1 on photon. Randomly the swarm loses connectivity to approximately half of the containers and makes the swarm useless. Let me know if I can provide anything to help with troubleshooting |
@thomsonac 17.06-rc3 is out now. Many fixes in the networking control-plane has gone into 17.06. Can you give it a try and see if it fixes the issues you are seeing ? |
Ok |
I will |
Ok |
@thomsonac @sanimej let me know if you see improvements, we reverted our deployment to a custom scheduler and custom nginx configuration until we see higher availability from the platform. |
I don't see a 17.06 release tag. Either way, this is a production cluster and not something that we're particularly keen on playing with. We've also had some issues in the past manually updating docker outside of the VMware photon repository. We have a fairly simple setup; 3 dedicated manager nodes, 6 workers, only running selenium hubs and browser nodes with approximately 100 containers total. We didn't experience this problem on an older version (IIRC 1.12.6) |
@thomsonac 17.06 is not out yet, or at least not a final. |
I am using 17.09.0-ce on Ubuntu 16.04.3 LTS and still have this issue. |
Let me close this ticket for now, as it looks like it went stale. |
Description
Since last update (docker 1.13.0 to 1.13.1) i've had major inconsistencies.
The load balancing is now broken, 50% of the resquest end with a no route to host, the other 50% work.
We often get grpc: the connection is unavailable when trying to do a docker exec on some of the containers.
Tab when looking for container names is completely broken with nonsense
Here i typed "docker service ps P" to search for containers starting with P
docker service ps P__docker_daemon_is_experimental: command not found
rod___docker_daemon_is_experimental: command not found
__docker_daemon_is_experimental: command not found
but as you can see below, the version of docker is not experimental.
Those 2 issue are new since we updated yesterday and makes the whole platform barely usable.
Steps to reproduce the issue:
Describe the results you received:
No route to host 50% of the requests
grpc: the connection is unavailable for docker exec on lots of containers
Output of
docker version
:Output of
docker info
:We are hosted in Azure
Hosts are on Ubuntu 16.04.2 LTS
kernel 4.4.0-62-generic
Fully up to date
The text was updated successfully, but these errors were encountered: