-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.12.0-rc4 - Random network issues on service with published port. #24789
Comments
+1 for Two hardware servers, 1 overlay network, 2 services, 2 containers on each server.
Now I have removed all services, removed the overlay network, 2nd server left the swarm, still |
@bluepuma77 I'm not sure how a non-responsive |
@cpuguy83 Hey :) |
ping @sanimej do you know if this was resolved in the latest libnetwork bump? |
@thaJeztah No, there wasn't any specific change for this issue. @eran-totango To eliminate some variables, can you confirm the slowness happens only after you do some service updates to increase or decrease the number of replicas ? |
@sanimej i now feel the slowness even with only 1 replica on only 1 node in the swarm cluster. after a service update, it sometimes works and sometimes doesn't. |
@sanimej is there any way to help diagnose the underlying problem? I'm guessing something in the swarm routing is the culprit. |
Any update on this issue? We're using a pretty basic swarm configuration but are hitting this network performance inconsistency which is blocking us from moving swarm to production. Is there a workaround of sorts we can use until the issue is looked at? Any suggestions would be much appreciated! |
@eran-totango @alontorres @oraboy we fixed a few issues in rc5 that resolves a few ip-address conflict issues and others. were you able to give rc5 a try ? If you get a chance to try rc5 and if it happens again, please give us the exact reproduction steps. Including if you had restarted the daemon with running services, etc.... Also start the daemon with debug logs "-D" and pass on the daemon logs. |
This should possibly be fixed by moby/libnetwork#1370 and should be available in the next patch release. |
* Fixes moby#25236 * Fixes moby#24789 * Fixes moby#25340 * Fixes moby#25130 * Fixes moby/libnetwork#1387 * Fix external DNS responses > 512 bytes getting dropped * Fix crash when remote plugin returns empty address string * Make service LB work from self * Fixed a few race-conditions Signed-off-by: Madhu Venugopal <madhu@docker.com>
* Fixes moby#25236 * Fixes moby#24789 * Fixes moby#25340 * Fixes moby#25130 * Fixes moby/libnetwork#1387 * Fix external DNS responses > 512 bytes getting dropped * Fix crash when remote plugin returns empty address string * Make service LB work from self * Fixed a few race-conditions Signed-off-by: Madhu Venugopal <madhu@docker.com> (cherry picked from commit 6645ff8) Signed-off-by: Tibor Vass <tibor@docker.com>
@thaJeztah @mavenugo Some of the errors I've found repeating in the logs:
Reproduction steps are exactly like @eran-totango described before. This time, adding more workers after the service was up already started this issue for us. We will try to create a new cluster tomorrow and attempt to reproduce the issue again - I'll update with the results. |
@alontorres The gossip layer in your cluster seems to be having a problem which may be because of network congestion in your cluster. If you gossip issues in your cluster service loadbalancing is bound to have issues. Do you know at what time this message
appeared in your logs. Did it happen when you added more nodes to the cluster or when you created services? |
@mrjana - this error has been happening randomly every few hours since the swarm was created about two days ago. |
@mrjana @thaJeztah @mavenugo We've tried opening up all tcp and udp ports, using different instance types, playing with different numbers of manager and workers, but we are still getting the same issues. After a while, scaling a service causes intermittent timeouts. We are getting many errors, now including |
@alontorres Is there a consistent set of reproduction steps? Would you mind attaching your whole docker daemon logs here? |
@mrjana Here's a shorter version:
I've also attached logs as requested. The worker logs are extremely long, and a lot of the same, so I attached the first and last 100k lines. I also attached a manager log. worker-log-start.txt thanks |
More failures. This time with 1.12.1 stable.
One of the workers returns All TCP and UDP ports are open between the various nodes. They are all configured identically. |
@mrjana @thaJeztah @mavenugo |
* Fixes moby#25236 * Fixes moby#24789 * Fixes moby#25340 * Fixes moby#25130 * Fixes moby/libnetwork#1387 * Fix external DNS responses > 512 bytes getting dropped * Fix crash when remote plugin returns empty address string * Make service LB work from self * Fixed a few race-conditions Signed-off-by: Madhu Venugopal <madhu@docker.com>
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
AWS EC2.
Steps to reproduce the issue:
docker network create -d overlay --subnet 172.0.0.0/24 mynet
docker service create --name my_service --network mynet --registry-auth -p 31001:3008 --replicas 1 --update-delay 10s --update-parallelism 1 <image_name>
docker service update --registry-auth --replicas X my_service
To query my_service, i have a dns entry that points to an ELB, and the ELB only points to the swarm managers. What happens is that in some scenarios, querying the service becomes really slow, or either it doesn't respond at all and i get a timeout from my_service
When i've done my tests, i created a few different clusters.
(1 manager 0 workers, 1 manager 1 worker, 1 manager 2 workers, 2 managers 2 workers, 2 managers 0 workers, 3 managers 0 workers)
On each cluster setup, i created the network and the service with 1 replica, querying the service and then increasing the replicas by one from 1 to 8 (querying the service after each change), and then decrease the replicas from 8 to 1 (also querying the service after each change).
Describe the results you received:
I was investigating it for really long, but could not find a certain pattern to reproduce the issue. I mean, i can reproduce it but i can't exactly in what scenario it happens.
That's what i find :
I really hope this information will be useful for you to investigate this issue.
Is there any chance it's related to the network i created?
Describe the results you expected:
No slow answers at all, no matter how many replicas i have from a service, and no matter how many managers or workers i have.
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: