1.12.0-rc4 - Random network issues on service with published port. #24789

eran-totango · 2016-07-19T10:14:45Z

Output of docker version:

Client:
 Version:      1.12.0-rc4
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   e4a0dbc
 Built:        Wed Jul 13 03:54:54 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-rc4
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   e4a0dbc
 Built:        Wed Jul 13 03:54:54 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.12.0-rc4
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 9
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay host null
Swarm: pending
 NodeID: 5vgjx6gx8gvkotr6tx5gpndg4
 IsManager: Yes
 Managers: 0
 Nodes: 0
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-91-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.954 GiB
Name: swarm-manager-test-11998
ID: RPKB:WGGU:C5DZ:GSXI:6XLE:5IRX:F75X:J3SV:JFWW:QLRY:777H:TUP7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS EC2.

Steps to reproduce the issue:

Setup a swarm cluster.
create a network - docker network create -d overlay --subnet 172.0.0.0/24 mynet
create a service - docker service create --name my_service --network mynet --registry-auth -p 31001:3008 --replicas 1 --update-delay 10s --update-parallelism 1 <image_name>
start playing with the replicas for this service. the update command is used is :
docker service update --registry-auth --replicas X my_service

To query my_service, i have a dns entry that points to an ELB, and the ELB only points to the swarm managers. What happens is that in some scenarios, querying the service becomes really slow, or either it doesn't respond at all and i get a timeout from my_service

When i've done my tests, i created a few different clusters.
(1 manager 0 workers, 1 manager 1 worker, 1 manager 2 workers, 2 managers 2 workers, 2 managers 0 workers, 3 managers 0 workers)
On each cluster setup, i created the network and the service with 1 replica, querying the service and then increasing the replicas by one from 1 to 8 (querying the service after each change), and then decrease the replicas from 8 to 1 (also querying the service after each change).

Describe the results you received:
I was investigating it for really long, but could not find a certain pattern to reproduce the issue. I mean, i can reproduce it but i can't exactly in what scenario it happens.
That's what i find :

1 NODE : 1 Manager and 0 workers
everything works always.

2 NODES : a manager and a worker.
1 replica on manager. (answers fast)
2 replicas. 1 on manager and one on worker. (answers but slowly)
3 replicas. 1 on manager and 2 on worker. (answers but way slower)
4 replicas. 2 on manager and 2 on worker. (answers slowly but sometimes fast)
5 replicas. 3 on manager and 2 on worker. (2 fast answers and one slow answer)
6 replicas. 3 on manager and 3 on worker. (1 fast answers 2 slow answers)
7 replicas. 4 on manager 3 on worker. (no certain pattern)
8 replicas. 4 on each node. (1 fast answer 1 slow answer)
7 replicas. 4 manager 3 worker. (1 fast answer 1 slow answer)
6 replicas. 3 manager 3 worker. (most of the answers were slow)
5 replicas. 3 manager 2 worker. (most of the answers were slow)
4 replicas. 3 manager 1 worker. (no certain pattern)
3 replicas. 2 manager 1 worker. (always fast)
2 replicas. 1 manager one worker. (always fast)
1 replica. (always fast)

3 Nodes : 1 manager and 2 workers.
1 replica on manager. (answers fast)
2 replicas. 1 manager and 1 worker. (slow)
3 replicas. 1 on each node. (slow)
4 replicas. 2 on worker and one on each manager. (slow)
5 replicas. 2 manager, 2 on worker and 1 on other worker. (slow)
6 replicas. 2 on each node. (slow)
7 replicas. 3 on worker and 2 on each other node. (slow)
8 replicas. 2 on manager and 3 on each worker. (slow)
7 replicas. 2 on manager and worker, 3 on other worker. (slow)
6 replicas. (slow)
5 replicas. (slow)
4 replicas. (slow, but sometimes better)
3 replicas. (slow)
2 replicas. (really slow)
1 replica on WORKER. (timeout)
1 replica on MANAGER. (answers fast again!!)

4 NODES: 2 managers and 2 workers.
I tested this scenario twice, the first time i tested it everything worked properly. There weren't slow answers at all.
the second time i tried that i got slow answers sometimes but could not find a certain pattern for this.

2 NODES: 2 managers 0 workers.
it worked for most of my tests, expect when i decreased the replicas from 3 to 2. 
After, the service responded really slowly.
I noticed that when i started the service it was on manager A, and when i decreased from 3 to 2 replicas the replicas were only on MANAGER B.
I tried to drain MANAGER B (make the replicas move to MANAGER A) and it worked again.

I really hope this information will be useful for you to investigate this issue.
Is there any chance it's related to the network i created?

Describe the results you expected:
No slow answers at all, no matter how many replicas i have from a service, and no matter how many managers or workers i have.

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

bluepuma77 · 2016-07-19T14:46:06Z

+1 for 1.12.0-rc4

Two hardware servers, 1 overlay network, 2 services, 2 containers on each server.

docker ps does not return on the swarm master, it's just stuck forever.

Now I have removed all services, removed the overlay network, 2nd server left the swarm, still docker ps is stuck on swarm leader server.

cpuguy83 · 2016-07-19T14:50:06Z

@bluepuma77 I'm not sure how a non-responsive docker ps relates to this, can you explain further? Are you seeing slow network responses for the container?

eran-totango · 2016-07-23T15:55:08Z

@cpuguy83 Hey :)
any update about this issue?

thaJeztah · 2016-07-25T15:40:35Z

ping @sanimej do you know if this was resolved in the latest libnetwork bump?

sanimej · 2016-07-25T15:50:45Z

@thaJeztah No, there wasn't any specific change for this issue.

@eran-totango To eliminate some variables, can you confirm the slowness happens only after you do some service updates to increase or decrease the number of replicas ?

eran-totango · 2016-07-26T13:06:59Z

@sanimej i now feel the slowness even with only 1 replica on only 1 node in the swarm cluster.

after a service update, it sometimes works and sometimes doesn't.
The update is done using this command :
docker service update --registry-auth --image X <service_name>

alontorres · 2016-07-28T07:41:15Z

@sanimej is there any way to help diagnose the underlying problem? I'm guessing something in the swarm routing is the culprit.
This is a show-stopper for us. Swarm is essentially unusable for us right now due to this issue.

oraboy · 2016-07-28T14:01:22Z

Any update on this issue? We're using a pretty basic swarm configuration but are hitting this network performance inconsistency which is blocking us from moving swarm to production.

Is there a workaround of sorts we can use until the issue is looked at? Any suggestions would be much appreciated!

mavenugo · 2016-07-28T14:11:38Z

@eran-totango @alontorres @oraboy we fixed a few issues in rc5 that resolves a few ip-address conflict issues and others. were you able to give rc5 a try ?
When a timeout happens at the routing-mesh level, it typically indicates that a particular load-balancing call (towards the backend) is failing. The failure could be either IPVS misprogramming or issue in forwarding on overlay network for that particular container/ip-address.

If you get a chance to try rc5 and if it happens again, please give us the exact reproduction steps. Including if you had restarted the daemon with running services, etc....

Also start the daemon with debug logs "-D" and pass on the daemon logs.

eran-totango · 2016-08-03T08:11:27Z

@thaJeztah @mavenugo

We tried docker 1.12.0 and this is what we experience (can be reproduced):

Setup a cluster with 1 manager and 2 workers.
Set manager availability to drain
Configure an AWS ELB for the swarm workers, and configure a DNS record in route53 that points to the ELB.
Create a network with this command :
docker network create -d overlay --subnet 137.0.0.0/24 mynet
Create a service running simple nodejs code that returns the date and hostname when you curl it:
docker service create --name pingapp --network mynet -p 31001:3000 --replicas 1 erantotango/pingapp:latest
Create a basic bash script, that every changes the service scale to a random number (from 1 to 14, using this command: docker service scale pingapp=X), and then try to query the service 10 times repetitively, using the dns entry we made, in port 31001 every 5 minutes.
curl -sS --max-time 8 swarm-sndbx.<domain-name>:31001

Attached is the build history in our jenkins job that automated step 6.
You can see it's not working well. most of the times the service didn't respond.

(NOTE: When the service didn't answer we checked and we did have running containers for our pingapp)

mrjana · 2016-08-03T17:44:14Z

This should possibly be fixed by moby/libnetwork#1370 and should be available in the next patch release.

* Fixes moby#25236 * Fixes moby#24789 * Fixes moby#25340 * Fixes moby#25130 * Fixes moby/libnetwork#1387 * Fix external DNS responses > 512 bytes getting dropped * Fix crash when remote plugin returns empty address string * Make service LB work from self * Fixed a few race-conditions Signed-off-by: Madhu Venugopal <madhu@docker.com>

* Fixes moby#25236 * Fixes moby#24789 * Fixes moby#25340 * Fixes moby#25130 * Fixes moby/libnetwork#1387 * Fix external DNS responses > 512 bytes getting dropped * Fix crash when remote plugin returns empty address string * Make service LB work from self * Fixed a few race-conditions Signed-off-by: Madhu Venugopal <madhu@docker.com> (cherry picked from commit 6645ff8) Signed-off-by: Tibor Vass <tibor@docker.com>

alontorres · 2016-08-15T20:29:23Z

@thaJeztah @mavenugo
I'd like to update that we've created a new swarm cluster running version 1.12.1-rc1 with 1 drained manager and 2 workers. After running the jenkins job @eran-totango mentioned above again for approx 30 hours without any issues, we tried adding 2 more workers to the swarm and the network issues returned. Some of the service tasks fail randomly with a "network <network_name> not found" error, even though the network exists in the node. Most of the time, all the tasks start running as expected, but most of the curl requests still time out.

Some of the errors I've found repeating in the logs:

Failed to delete real server 10.255.0.18 for vip 10.255.0.5 fwmark 256: no such file or directory

[INFO] memberlist: Suspect <hostname> has failed, no acks received\n

network <network name> remove failed: network <network name> not found

Reproduction steps are exactly like @eran-totango described before. This time, adding more workers after the service was up already started this issue for us. We will try to create a new cluster tomorrow and attempt to reproduce the issue again - I'll update with the results.

mrjana · 2016-08-15T20:45:01Z

@alontorres The gossip layer in your cluster seems to be having a problem which may be because of network congestion in your cluster. If you gossip issues in your cluster service loadbalancing is bound to have issues. Do you know at what time this message

[INFO] memberlist: Suspect \<hostname\> has failed, no acks received\n

appeared in your logs. Did it happen when you added more nodes to the cluster or when you created services?

alontorres · 2016-08-15T20:57:18Z

@mrjana - this error has been happening randomly every few hours since the swarm was created about two days ago.

alontorres · 2016-08-17T12:43:35Z

@mrjana @thaJeztah @mavenugo We've tried opening up all tcp and udp ports, using different instance types, playing with different numbers of manager and workers, but we are still getting the same issues. After a while, scaling a service causes intermittent timeouts. We are getting many errors, now including msg="fatal task error" error="Unable to complete atomic operation, key modified" module=taskmanager

mrjana · 2016-08-17T16:23:13Z

@alontorres Is there a consistent set of reproduction steps? Would you mind attaching your whole docker daemon logs here?

alontorres · 2016-08-17T17:25:03Z

@mrjana
The steps were described already by @eran-totango - he even added a nice screenshot from jenkins :)

Here's a shorter version:

create a swarm cluster with a manager and a few workers
create a simple service that listens on a published port
make a timed script that runs every few minutes, resizes the service and tries to query one of the nodes in the swarm a few times using the service's published port
after a few hours, the script should start getting timeouts
if it doesn't, adding more workers seems to cause the issues to start occuring sooner.

I've also attached logs as requested. The worker logs are extremely long, and a lot of the same, so I attached the first and last 100k lines. I also attached a manager log.

worker-log-start.txt
worker-log-end.txt
manager-log.txt

thanks

alontorres · 2016-08-20T10:18:57Z

@mrjana @thaJeztah @mavenugo

More failures. This time with 1.12.1 stable.
This is exactly what i've done a few minutes ago:

created a new swarm from scratch, using 1.12.1 stable with 1 drained manager, 2 workers
created an ingress network with subnet 137.0.0.0/24
created service using the test image we discussed before in this issue - a simple nodejs app that returns date and hostname. I published port 3000 on port 31001
scaled replicas up to 14 - curled 10 times and passed successfully
scaled down to 11 - curled and immediately started getting timeouts.

One of the workers returns curl localhost:31001 consistently. The requests return hostnames of tasks that reside on both workers, not just the one I'm curling, which means the routing is working here.
The other worker keeps failing when running curl localhost:31001. I had debug logs enabled, and the only error I found on the worker was: level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=cjo2id913p6y7y6hyu9yxq0g6

All TCP and UDP ports are open between the various nodes. They are all configured identically.

eran-totango · 2016-08-25T09:40:56Z

@mrjana @thaJeztah @mavenugo
any update on this issue?

* Fixes moby#25236 * Fixes moby#24789 * Fixes moby#25340 * Fixes moby#25130 * Fixes moby/libnetwork#1387 * Fix external DNS responses > 512 bytes getting dropped * Fix crash when remote plugin returns empty address string * Make service LB work from self * Fixed a few race-conditions Signed-off-by: Madhu Venugopal <madhu@docker.com>

GordonTheTurtle added the version/1.12 label Jul 19, 2016

cpuguy83 added the area/networking label Jul 19, 2016

cpuguy83 added this to the 1.12.0 milestone Jul 19, 2016

cpuguy83 added the priority/P3 Best effort: those are nice to have / minor issues. label Jul 19, 2016

thaJeztah modified the milestones: 1.12.1, 1.12.0 Jul 28, 2016

mrjana mentioned this issue Aug 4, 2016

ingress network(s) not shared across swarm #25386

Closed

mavenugo mentioned this issue Aug 11, 2016

Vendoring libnetwork for 1.12.1-rc1 #25603

Merged

vdemeester closed this as completed in #25603 Aug 11, 2016

alontorres mentioned this issue Aug 17, 2016

Intermittent failure connecting to port #25698

Closed

teetrinkers mentioned this issue Sep 5, 2016

Occasional Error 502 Bad Gateway nginx-proxy/nginx-proxy#516

Open

vicary mentioned this issue Mar 24, 2018

Connects to existing AWS host shouldn't fail. rancher/rancher#12243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.12.0-rc4 - Random network issues on service with published port. #24789

1.12.0-rc4 - Random network issues on service with published port. #24789

eran-totango commented Jul 19, 2016

bluepuma77 commented Jul 19, 2016

cpuguy83 commented Jul 19, 2016

eran-totango commented Jul 23, 2016

thaJeztah commented Jul 25, 2016

sanimej commented Jul 25, 2016

eran-totango commented Jul 26, 2016 •

edited

Loading

alontorres commented Jul 28, 2016

oraboy commented Jul 28, 2016

mavenugo commented Jul 28, 2016

eran-totango commented Aug 3, 2016

mrjana commented Aug 3, 2016

alontorres commented Aug 15, 2016 •

edited

Loading

mrjana commented Aug 15, 2016

alontorres commented Aug 15, 2016

alontorres commented Aug 17, 2016 •

edited

Loading

mrjana commented Aug 17, 2016

alontorres commented Aug 17, 2016 •

edited

Loading

alontorres commented Aug 20, 2016

eran-totango commented Aug 25, 2016

1.12.0-rc4 - Random network issues on service with published port. #24789

1.12.0-rc4 - Random network issues on service with published port. #24789

Comments

eran-totango commented Jul 19, 2016

bluepuma77 commented Jul 19, 2016

cpuguy83 commented Jul 19, 2016

eran-totango commented Jul 23, 2016

thaJeztah commented Jul 25, 2016

sanimej commented Jul 25, 2016

eran-totango commented Jul 26, 2016 • edited Loading

alontorres commented Jul 28, 2016

oraboy commented Jul 28, 2016

mavenugo commented Jul 28, 2016

eran-totango commented Aug 3, 2016

mrjana commented Aug 3, 2016

alontorres commented Aug 15, 2016 • edited Loading

mrjana commented Aug 15, 2016

alontorres commented Aug 15, 2016

alontorres commented Aug 17, 2016 • edited Loading

mrjana commented Aug 17, 2016

alontorres commented Aug 17, 2016 • edited Loading

alontorres commented Aug 20, 2016

eran-totango commented Aug 25, 2016

eran-totango commented Jul 26, 2016 •

edited

Loading

alontorres commented Aug 15, 2016 •

edited

Loading

alontorres commented Aug 17, 2016 •

edited

Loading

alontorres commented Aug 17, 2016 •

edited

Loading