Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: the connection is unavailable and load balancer broken #31043

Closed
toutougabi opened this issue Feb 15, 2017 · 25 comments
Closed

grpc: the connection is unavailable and load balancer broken #31043

toutougabi opened this issue Feb 15, 2017 · 25 comments

Comments

@toutougabi
Copy link

Description

Since last update (docker 1.13.0 to 1.13.1) i've had major inconsistencies.

  • The load balancing is now broken, 50% of the resquest end with a no route to host, the other 50% work.

  • We often get grpc: the connection is unavailable when trying to do a docker exec on some of the containers.

  • Tab when looking for container names is completely broken with nonsense
    Here i typed "docker service ps P" to search for containers starting with P

docker service ps P__docker_daemon_is_experimental: command not found
rod___docker_daemon_is_experimental: command not found
__docker_daemon_is_experimental: command not found

but as you can see below, the version of docker is not experimental.

Those 2 issue are new since we updated yesterday and makes the whole platform barely usable.

Steps to reproduce the issue:

  1. Upgrade docker 1.13.0 to 1.13.1 in swarm mode

Describe the results you received:

  • No route to host 50% of the requests

  • grpc: the connection is unavailable for docker exec on lots of containers

Output of docker version:

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:50:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:50:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 116
 Running: 6
 Paused: 0
 Stopped: 110
Images: 133
Server Version: 1.13.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 1509
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: bk34jzemg6u4eq7bdjqsq6u69
 Is Manager: true
 ClusterID: 0apmbfyv7tr52j046zpefpgpn
 Managers: 7
 Nodes: 7
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.0.6
 Manager Addresses:
  10.0.0.10:2377
  10.0.0.11:2377
  10.0.0.4:2377
  10.0.0.6:2377
  10.0.0.7:2377
  10.0.0.8:2377
  10.0.0.9:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: N/A (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-62-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.359 GiB
Name: SWLNCANLS01
ID: 5POZ:4Q7W:OKMN:BTKQ:B7K3:UPRO:J5PA:3QMA:KMAQ:DM6L:7RDW:2LHL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: 
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 type=Small
 AzureType=D1_V2
 Name=Small01
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

We are hosted in Azure
Hosts are on Ubuntu 16.04.2 LTS
kernel 4.4.0-62-generic
Fully up to date

@toutougabi toutougabi changed the title grpc: the connection is unavailable grpc: the connection is unavailable and load balancer broken Feb 15, 2017
@toutougabi
Copy link
Author

the Error response from daemon: grpc: the connection is unavailable.
has propagated and now i cannot start containers on some of my hosts, how can this stable update break my system so much

@cpuguy83
Copy link
Member

ping @mlaventure @sanimej

The completion issue was a bad cherry-pick that will be fixed in .2, sorry about that.

@sanimej
Copy link

sanimej commented Feb 16, 2017

@toutougabi For the load balancer issue can you provider your env details ? How many manager/workers and what kind of service you are running, the clients are from inside the cluster or external ? Was everything working ok with 1.13.0 and you seeing the load balancer issues only with 1.13.1 ?

@toutougabi
Copy link
Author

thanks @cpuguy83 np

@sanimej yes the issue are new since the 1.13.1, 1.13 was very stable as was the last stable 1.12.

i have 8 nodes all managers
the deployment is quite small (all 1 cpu 3.5 gb of ram for all but 2 hosts) and usage is generally around 30%. We decided to make them all managers as in early 1.12, when in a 3 node configuration we were having an issue with 1 host, it would bring the whole cluster down (i think a raft issue that was fixed in 1.12.3 from what i remember). but like i said was pretty stable since at least 2 or 3 versions.

We have 3 reverse proxy on the local network that call the different containers. so all access is internal but not to the cluster.

@toutougabi
Copy link
Author

toutougabi commented Feb 16, 2017

as a side not, we had, earlier this week, a crash on 50% of the hosts but they rebooted fine and the cluster restarted normally when we restarted. On reboot all was good but we decided , while we had the maintenance page for our sites to go ahead with the update, so about 1hr after the restart, we cycled each host to update to 1.13.1 (from 1.13.0) and that's when the load balancer started working 50% of the time with a no route to host on the other 50% requests

@sanimej
Copy link

sanimej commented Feb 16, 2017

@toutougabi After half of hosts rebooted, did you get a chance to verify the LB behavior as well before upgrading to 1.13.1 ?

On the 50% loss, can you narrow it down a bit, for ex: do you see requests to tasks only on certain hosts failing ? Do you see this failure for services created before upgrade or for the patter is seen for new ones as well ?

@toutougabi
Copy link
Author

yes the system looked stable but i have to be honest that we didn't test for long before we did the update.

for the 50% the test was simple, we were noticing failed request on the reverse proxies so we went locally and did wget to the cluster. That's when we discovered the no route to host on 50% of the request. So wget x.x.x.x:xxxx/app would fail 50% of the time and hasn't recovered since.

service that were not updated after the failure were stable, as soon as we pushed updates (docker service update) to get new images online for some of the services, it started doing the 50%. We noticed the issue when most of the system was updated so we had to revert to do manual docker runs on another set of host to have a stable system.

@sanimej
Copy link

sanimej commented Feb 16, 2017

So it look like more than the daemon upgrade to 1.13.1 its the service update that triggered the issue.. I will give it a try.. To narrow down this some more, can you try creating a new service and see if the LB works ok for that ?

@toutougabi
Copy link
Author

no any new service has the same issue, that was my first thought so i re-created some of the services and it didn't help...

@toutougabi
Copy link
Author

i'd add that we routinely updated containers on 1.13.0 without any issue

@JnMik
Copy link

JnMik commented Mar 3, 2017

On our end, we have 6 docker nodes using swarm on 1.13.0. and we see these errors as well on one of the node.

We have 3 managers all drained and 3 workers.

On worker01, we have this kind of issue

dockerd: time="2017-03-02T16:09:15.599186507-05:00" level=error msg="agent: session failed" error="rpc error: code = 14 desc = grpc: the connection is unavailable" module="node/agent"

When i'm trying to jump into a container on that node, I always have this error (no matter which container, unhealthy or not, I always have this error)

sudo docker exec -ti 3b9664d165da bash
rpc error: code = 14 desc = grpc: the connection is unavailable

I also found this error on another container, still on Worker01 :
"starting container failed: transport is closing"

On worker02 & worker03, we have few unhealthy containers as well, but we can "docker exec " into all containers without error. And we don't have any RPC error / transport error on these two workers.

Last discovery :

We seem to have an issue with our available diskspace on that specific node (worker01), we are using thinpools.

I just found this error :

"Err": "devmapper: Thin Pool has 9440 free data blocks which is less than minimum required 9565 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior",

We will get this fixed, drain it, and see the system becomes more stable.

Hope this might help

Edit : (What happened when we tried to rejoin the worker01 in cluster)

We drained the worker01, and the containers wasn't being kicked from it. So I guess the docker service really crashed. We reloaded it, and then reset the node availability=active.

Since the managers are not rebalancing the container automatically, I tried to force an update with
sudo docker service update -f <some_service>.

The manager tried to deploy the container 4 times on our idle worker01, without success, then shipped it on worker03 as fallback.

Error noticed on the 4 failing containers was :
Err": "starting container failed: subnet sandbox join failed for "X.X.X.X/XX": error creating vxlan interface: file exists

I guess we crashed the whole thing lol !

Edit2 : I see that these errors happened in the past on Linux Kernel below 3.15
We are running 3.10.0 on CentOs
But since that patch, I guess it should not happen, right ? : moby/libnetwork#821
Whatever, I don't want to Hijack your thread with another issue lol !

@thaJeztah
Copy link
Member

@JnMik if you're still running into that issue, please open a new issue with details. W.r.t "We seem to have an issue with our available diskspace on that specific node" that can definitely be related; device mapper running out of disk space can result in lots of "interesting" things.

@JnMik
Copy link

JnMik commented Mar 6, 2017

@thaJeztah Hello Sir, Thanks for you comment.

We tried to reload docker service, the error was still persisting.
So I started reading libnetwork source code, and figured out it was probably "ip link" related.
We tried kicking out all vxlan (in ip link), the error was still persisting.
So we finally rebooted the node and it became stable again.

I'm not gonna open a new issue for the moment because I do not have the error to play with anymore, but isn't weird that all the networking does not recover on docker service reload ?

Cheers !

@zffocus
Copy link

zffocus commented Mar 15, 2017

Hi team:
I also had the same issue recently.When I tried to start a container with "# docker run -d ..."
It showed that # "docker: Error response from daemon: grpc: the connection is unavailable."

docker version
Client:
Version: 1.13.1
API version: 1.26
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:36:34 2017
OS/Arch: linux/amd64

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:36:34 2017
OS/Arch: linux/amd64
Experimental: false

@DamionWaltermeyer
Copy link

DamionWaltermeyer commented Mar 22, 2017

Still around:
Client:
Version: 17.03.0-ce
API version: 1.26
Go version: go1.7.5
Git commit: 60ccb22
Built: Thu Feb 23 10:57:47 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.0-ce
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 60ccb22
Built: Thu Feb 23 10:57:47 2017
OS/Arch: linux/amd64
Experimental: false

time="2017-03-22T17:43:14.260489169Z" level=error msg="Create container failed with error: grpc: the connection is unavailable" 
time="2017-03-22T17:43:14.514016752Z" level=error msg="Handler for POST /containers/3ce7df11cccb723a0d632acb85e7721316dfd13ae24ecd3ec705d5357e4b7f09/start returned error: grpc: the connection is unavailable" 
time="2017-03-22T17:43:26.567430637Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086" 
time="2017-03-22T17:43:26.567669433Z" level=error msg="Create container failed with error: grpc: the connection is unavailable" 
time="2017-03-22T17:43:26.567693163Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086" 
time="2017-03-22T17:43:26.810115561Z" level=error msg="Handler for POST /containers/84c166a4a4da700779e076ff800c352b90da435d3eca9c474cf9096679c4fd21/start returned error: grpc: the connection is unavailable" 
# docker info
Containers: 706
 Running: 685
 Paused: 0
 Stopped: 21
Images: 39
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 1514
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: N/A (expected: 977c511eda0925a723debdc94d09459af49d082a)
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 4.4.0-66-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.797 GiB
Name: REMOVED
ID: 2L6N:3VHZ:ZW5O:2VL4:BBP2:EKNE:33VQ:BOOB:Z7D3:PHUF:WJ77:COVB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

@thomsonac
Copy link

We are experiencing this issue running 1.13.1 on photon. Randomly the swarm loses connectivity to approximately half of the containers and makes the swarm useless. Let me know if I can provide anything to help with troubleshooting

@sanimej
Copy link

sanimej commented Jun 14, 2017

@thomsonac 17.06-rc3 is out now. Many fixes in the networking control-plane has gone into 17.06. Can you give it a try and see if it fixes the issues you are seeing ?

@cmhokej
Copy link

cmhokej commented Jun 14, 2017

Ok

@cmhokej
Copy link

cmhokej commented Jun 14, 2017

I will

@cmhokej
Copy link

cmhokej commented Jun 14, 2017

Ok

@toutougabi
Copy link
Author

@thomsonac @sanimej let me know if you see improvements, we reverted our deployment to a custom scheduler and custom nginx configuration until we see higher availability from the platform.

@thomsonac
Copy link

I don't see a 17.06 release tag. Either way, this is a production cluster and not something that we're particularly keen on playing with. We've also had some issues in the past manually updating docker outside of the VMware photon repository.

We have a fairly simple setup; 3 dedicated manager nodes, 6 workers, only running selenium hubs and browser nodes with approximately 100 containers total. We didn't experience this problem on an older version (IIRC 1.12.6)

@cpuguy83
Copy link
Member

@thomsonac 17.06 is not out yet, or at least not a final.
I would recommend upgrading to 17.03 for your production cluster, which has no new features over 1.13 but has many significant bug fixes.

@noomz
Copy link

noomz commented Oct 5, 2017

I am using 17.09.0-ce on Ubuntu 16.04.3 LTS and still have this issue.

@thaJeztah
Copy link
Member

Let me close this ticket for now, as it looks like it went stale.

@thaJeztah thaJeztah closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests