Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.12.0-rc4 - Random network issues on service with published port. #24789

Closed
eran-totango opened this issue Jul 19, 2016 · 19 comments
Closed

1.12.0-rc4 - Random network issues on service with published port. #24789

eran-totango opened this issue Jul 19, 2016 · 19 comments
Labels
area/networking priority/P3 Best effort: those are nice to have / minor issues. version/1.12
Milestone

Comments

@eran-totango
Copy link

Output of docker version:

Client:
 Version:      1.12.0-rc4
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   e4a0dbc
 Built:        Wed Jul 13 03:54:54 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-rc4
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   e4a0dbc
 Built:        Wed Jul 13 03:54:54 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.12.0-rc4
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 9
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay host null
Swarm: pending
 NodeID: 5vgjx6gx8gvkotr6tx5gpndg4
 IsManager: Yes
 Managers: 0
 Nodes: 0
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-91-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.954 GiB
Name: swarm-manager-test-11998
ID: RPKB:WGGU:C5DZ:GSXI:6XLE:5IRX:F75X:J3SV:JFWW:QLRY:777H:TUP7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS EC2.

Steps to reproduce the issue:

  1. Setup a swarm cluster.
  2. create a network - docker network create -d overlay --subnet 172.0.0.0/24 mynet
  3. create a service - docker service create --name my_service --network mynet --registry-auth -p 31001:3008 --replicas 1 --update-delay 10s --update-parallelism 1 <image_name>
  4. start playing with the replicas for this service. the update command is used is :
    docker service update --registry-auth --replicas X my_service

To query my_service, i have a dns entry that points to an ELB, and the ELB only points to the swarm managers. What happens is that in some scenarios, querying the service becomes really slow, or either it doesn't respond at all and i get a timeout from my_service

When i've done my tests, i created a few different clusters.
(1 manager 0 workers, 1 manager 1 worker, 1 manager 2 workers, 2 managers 2 workers, 2 managers 0 workers, 3 managers 0 workers)
On each cluster setup, i created the network and the service with 1 replica, querying the service and then increasing the replicas by one from 1 to 8 (querying the service after each change), and then decrease the replicas from 8 to 1 (also querying the service after each change).

Describe the results you received:
I was investigating it for really long, but could not find a certain pattern to reproduce the issue. I mean, i can reproduce it but i can't exactly in what scenario it happens.
That's what i find :

1 NODE : 1 Manager and 0 workers
everything works always.
2 NODES : a manager and a worker.
1 replica on manager. (answers fast)
2 replicas. 1 on manager and one on worker. (answers but slowly)
3 replicas. 1 on manager and 2 on worker. (answers but way slower)
4 replicas. 2 on manager and 2 on worker. (answers slowly but sometimes fast)
5 replicas. 3 on manager and 2 on worker. (2 fast answers and one slow answer)
6 replicas. 3 on manager and 3 on worker. (1 fast answers 2 slow answers)
7 replicas. 4 on manager 3 on worker. (no certain pattern)
8 replicas. 4 on each node. (1 fast answer 1 slow answer)
7 replicas. 4 manager 3 worker. (1 fast answer 1 slow answer)
6 replicas. 3 manager 3 worker. (most of the answers were slow)
5 replicas. 3 manager 2 worker. (most of the answers were slow)
4 replicas. 3 manager 1 worker. (no certain pattern)
3 replicas. 2 manager 1 worker. (always fast)
2 replicas. 1 manager one worker. (always fast)
1 replica. (always fast)
3 Nodes : 1 manager and 2 workers.
1 replica on manager. (answers fast)
2 replicas. 1 manager and 1 worker. (slow)
3 replicas. 1 on each node. (slow)
4 replicas. 2 on worker and one on each manager. (slow)
5 replicas. 2 manager, 2 on worker and 1 on other worker. (slow)
6 replicas. 2 on each node. (slow)
7 replicas. 3 on worker and 2 on each other node. (slow)
8 replicas. 2 on manager and 3 on each worker. (slow)
7 replicas. 2 on manager and worker, 3 on other worker. (slow)
6 replicas. (slow)
5 replicas. (slow)
4 replicas. (slow, but sometimes better)
3 replicas. (slow)
2 replicas. (really slow)
1 replica on WORKER. (timeout)
1 replica on MANAGER. (answers fast again!!)
4 NODES: 2 managers and 2 workers.
I tested this scenario twice, the first time i tested it everything worked properly. There weren't slow answers at all.
the second time i tried that i got slow answers sometimes but could not find a certain pattern for this.
2 NODES: 2 managers 0 workers.
it worked for most of my tests, expect when i decreased the replicas from 3 to 2. 
After, the service responded really slowly.
I noticed that when i started the service it was on manager A, and when i decreased from 3 to 2 replicas the replicas were only on MANAGER B.
I tried to drain MANAGER B (make the replicas move to MANAGER A) and it worked again.

I really hope this information will be useful for you to investigate this issue.
Is there any chance it's related to the network i created?

Describe the results you expected:
No slow answers at all, no matter how many replicas i have from a service, and no matter how many managers or workers i have.

Additional information you deem important (e.g. issue happens only occasionally):

@cpuguy83 cpuguy83 added this to the 1.12.0 milestone Jul 19, 2016
@cpuguy83 cpuguy83 added the priority/P3 Best effort: those are nice to have / minor issues. label Jul 19, 2016
@bluepuma77
Copy link

+1 for 1.12.0-rc4

Two hardware servers, 1 overlay network, 2 services, 2 containers on each server.

docker ps does not return on the swarm master, it's just stuck forever.

Now I have removed all services, removed the overlay network, 2nd server left the swarm, still docker ps is stuck on swarm leader server.

@cpuguy83
Copy link
Member

@bluepuma77 I'm not sure how a non-responsive docker ps relates to this, can you explain further? Are you seeing slow network responses for the container?

@eran-totango
Copy link
Author

@cpuguy83 Hey :)
any update about this issue?

@thaJeztah
Copy link
Member

ping @sanimej do you know if this was resolved in the latest libnetwork bump?

@sanimej
Copy link

sanimej commented Jul 25, 2016

@thaJeztah No, there wasn't any specific change for this issue.

@eran-totango To eliminate some variables, can you confirm the slowness happens only after you do some service updates to increase or decrease the number of replicas ?

@eran-totango
Copy link
Author

eran-totango commented Jul 26, 2016

@sanimej i now feel the slowness even with only 1 replica on only 1 node in the swarm cluster.

after a service update, it sometimes works and sometimes doesn't.
The update is done using this command :
docker service update --registry-auth --image X <service_name>

@alontorres
Copy link

@sanimej is there any way to help diagnose the underlying problem? I'm guessing something in the swarm routing is the culprit.
This is a show-stopper for us. Swarm is essentially unusable for us right now due to this issue.

@thaJeztah thaJeztah modified the milestones: 1.12.1, 1.12.0 Jul 28, 2016
@oraboy
Copy link

oraboy commented Jul 28, 2016

Any update on this issue? We're using a pretty basic swarm configuration but are hitting this network performance inconsistency which is blocking us from moving swarm to production.

Is there a workaround of sorts we can use until the issue is looked at? Any suggestions would be much appreciated!

@mavenugo
Copy link
Contributor

@eran-totango @alontorres @oraboy we fixed a few issues in rc5 that resolves a few ip-address conflict issues and others. were you able to give rc5 a try ?
When a timeout happens at the routing-mesh level, it typically indicates that a particular load-balancing call (towards the backend) is failing. The failure could be either IPVS misprogramming or issue in forwarding on overlay network for that particular container/ip-address.

If you get a chance to try rc5 and if it happens again, please give us the exact reproduction steps. Including if you had restarted the daemon with running services, etc....

Also start the daemon with debug logs "-D" and pass on the daemon logs.

@eran-totango
Copy link
Author

@thaJeztah @mavenugo

We tried docker 1.12.0 and this is what we experience (can be reproduced):

  1.  Setup a cluster with 1 manager and 2 workers.
  2. Set manager availability to drain
  3. Configure an AWS ELB for the swarm workers, and configure a DNS record in route53 that points to the ELB.
  4. Create a network with this command :
    docker network create -d overlay --subnet 137.0.0.0/24 mynet
  5. Create a service running simple nodejs code that returns the date and hostname when you curl it:
    docker service create --name pingapp --network mynet -p 31001:3000 --replicas 1 erantotango/pingapp:latest
  6. Create a basic bash script, that every changes the service scale to a random number (from 1 to 14, using this command: docker service scale pingapp=X), and then try to query the service 10 times repetitively, using the dns entry we made, in port 31001 every 5 minutes.
    curl -sS --max-time 8 swarm-sndbx.<domain-name>:31001

Attached is the build history in our jenkins job that automated step 6.
You can see it's not working well. most of the times the service didn't respond.
screen shot 2016-08-03 at 10 38 13 am

(NOTE: When the service didn't answer we checked and we did have running containers for our pingapp)

@mrjana
Copy link
Contributor

mrjana commented Aug 3, 2016

This should possibly be fixed by moby/libnetwork#1370 and should be available in the next patch release.

mavenugo added a commit to mavenugo/docker that referenced this issue Aug 11, 2016
* Fixes moby#25236
* Fixes moby#24789
* Fixes moby#25340
* Fixes moby#25130
* Fixes moby/libnetwork#1387
* Fix external DNS responses > 512 bytes getting dropped
* Fix crash when remote plugin returns empty address string
* Make service LB work from self
* Fixed a few race-conditions

Signed-off-by: Madhu Venugopal <madhu@docker.com>
tiborvass pushed a commit to tiborvass/docker that referenced this issue Aug 12, 2016
* Fixes moby#25236
* Fixes moby#24789
* Fixes moby#25340
* Fixes moby#25130
* Fixes moby/libnetwork#1387
* Fix external DNS responses > 512 bytes getting dropped
* Fix crash when remote plugin returns empty address string
* Make service LB work from self
* Fixed a few race-conditions

Signed-off-by: Madhu Venugopal <madhu@docker.com>
(cherry picked from commit 6645ff8)
Signed-off-by: Tibor Vass <tibor@docker.com>
@alontorres
Copy link

alontorres commented Aug 15, 2016

@thaJeztah @mavenugo
I'd like to update that we've created a new swarm cluster running version 1.12.1-rc1 with 1 drained manager and 2 workers. After running the jenkins job @eran-totango mentioned above again for approx 30 hours without any issues, we tried adding 2 more workers to the swarm and the network issues returned. Some of the service tasks fail randomly with a "network <network_name> not found" error, even though the network exists in the node. Most of the time, all the tasks start running as expected, but most of the curl requests still time out.

Some of the errors I've found repeating in the logs:

Failed to delete real server 10.255.0.18 for vip 10.255.0.5 fwmark 256: no such file or directory

[INFO] memberlist: Suspect <hostname> has failed, no acks received\n

network <network name> remove failed: network <network name> not found

Reproduction steps are exactly like @eran-totango described before. This time, adding more workers after the service was up already started this issue for us. We will try to create a new cluster tomorrow and attempt to reproduce the issue again - I'll update with the results.

@mrjana
Copy link
Contributor

mrjana commented Aug 15, 2016

@alontorres The gossip layer in your cluster seems to be having a problem which may be because of network congestion in your cluster. If you gossip issues in your cluster service loadbalancing is bound to have issues. Do you know at what time this message

[INFO] memberlist: Suspect \<hostname\> has failed, no acks received\n

appeared in your logs. Did it happen when you added more nodes to the cluster or when you created services?

@alontorres
Copy link

@mrjana - this error has been happening randomly every few hours since the swarm was created about two days ago.

@alontorres
Copy link

alontorres commented Aug 17, 2016

@mrjana @thaJeztah @mavenugo We've tried opening up all tcp and udp ports, using different instance types, playing with different numbers of manager and workers, but we are still getting the same issues. After a while, scaling a service causes intermittent timeouts. We are getting many errors, now including msg="fatal task error" error="Unable to complete atomic operation, key modified" module=taskmanager

@mrjana
Copy link
Contributor

mrjana commented Aug 17, 2016

@alontorres Is there a consistent set of reproduction steps? Would you mind attaching your whole docker daemon logs here?

@alontorres
Copy link

alontorres commented Aug 17, 2016

@mrjana
The steps were described already by @eran-totango - he even added a nice screenshot from jenkins :)

Here's a shorter version:

  1. create a swarm cluster with a manager and a few workers
  2. create a simple service that listens on a published port
  3. make a timed script that runs every few minutes, resizes the service and tries to query one of the nodes in the swarm a few times using the service's published port
  4. after a few hours, the script should start getting timeouts
  5. if it doesn't, adding more workers seems to cause the issues to start occuring sooner.

I've also attached logs as requested. The worker logs are extremely long, and a lot of the same, so I attached the first and last 100k lines. I also attached a manager log.

worker-log-start.txt
worker-log-end.txt
manager-log.txt

thanks

@alontorres
Copy link

@mrjana @thaJeztah @mavenugo

More failures. This time with 1.12.1 stable.
This is exactly what i've done a few minutes ago:

  1. created a new swarm from scratch, using 1.12.1 stable with 1 drained manager, 2 workers
  2. created an ingress network with subnet 137.0.0.0/24
  3. created service using the test image we discussed before in this issue - a simple nodejs app that returns date and hostname. I published port 3000 on port 31001
  4. scaled replicas up to 14 - curled 10 times and passed successfully
  5. scaled down to 11 - curled and immediately started getting timeouts.

One of the workers returns curl localhost:31001 consistently. The requests return hostnames of tasks that reside on both workers, not just the one I'm curling, which means the routing is working here.
The other worker keeps failing when running curl localhost:31001. I had debug logs enabled, and the only error I found on the worker was: level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=cjo2id913p6y7y6hyu9yxq0g6

All TCP and UDP ports are open between the various nodes. They are all configured identically.

@eran-totango
Copy link
Author

@mrjana @thaJeztah @mavenugo
any update on this issue?

resouer pushed a commit to resouer/docker that referenced this issue Oct 24, 2016
* Fixes moby#25236
* Fixes moby#24789
* Fixes moby#25340
* Fixes moby#25130
* Fixes moby/libnetwork#1387
* Fix external DNS responses > 512 bytes getting dropped
* Fix crash when remote plugin returns empty address string
* Make service LB work from self
* Fixed a few race-conditions

Signed-off-by: Madhu Venugopal <madhu@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking priority/P3 Best effort: those are nice to have / minor issues. version/1.12
Projects
None yet
Development

No branches or pull requests

10 participants