-
Notifications
You must be signed in to change notification settings - Fork 467
kube-aws: Networking slower using 0.7 vs. 0.4 #533
Comments
@four43 what 0.4 vs 0.7 versions are you speaking about? |
The tags of this repo. We are using multi-node on AWS: https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws |
There are a lot of variables here but we tried to keep things consistent. We have the same app, running on the same class of servers, in the same VPC, just with the kube-aws created subnet. |
@four43 Please can you clarify what you mean by "with and without Calico"; why do you suspect this to be a Calico issue if you see the performance degradation without Calico? |
Calico was new to us (since it wasn't around for the 1.1.8 version of -Seth On Wed, Jun 8, 2016 at 3:18 AM, Shaun Crampton notifications@github.com
|
If you set useCalico to false then I think calico isn't even installed. On Wed, 8 Jun 2016 14:06 Seth Miller, notifications@github.com wrote:
|
@four43 "our cluster's performance degrades on simple network operations when there is a slight volume increase". Can you put this in numbers, say with IPerf ? If using flannel, what's your networking mode set to (vxlan/host-gw/aws-vpc)? |
@fasaxc - I agree, I just wanted to throw that out there to show that I had tried it. @somejfn - I'll pull some numbers with iperf. I haven't used that tool before, thanks for the point in the right direction. Also I'm not sure how to check flannel's networking mode. I'm using this tool to create my environment (multi-node/aws). The generated cloud-config shows a flannel section and specifies interface and etcd_endpoints. There is a drop in on the controller that specifies "vxlan": - name: flanneld.service
drop-ins:
- name: 10-etcd.conf
content: |
[Service]
ExecStartPre=/usr/bin/curl --silent -X PUT -d \
"value={\"Network\" : \"{{.PodCIDR}}\", \"Backend\" : {\"Type\" : \"vxlan\"}}" \
http://localhost:2379/v2/keys/coreos.com/network/config?prevExist=false Nothing of the sort on the worker setup. Could that be causing issues or will flannel contact the controller and etcd for that info? Thanks guys, I really appreciate it. Looking into IPerf now... EDIT: Previous flannel config from this repos v0.4 tag output something like this (artifact for controller):
No specification of backend. Interesting. |
Bandwidth Checks (Good)IPerf running on Kuberentes hosts on AWS, same VPC, different regions. Running m4.xlarge servers. Raw EC2 -> EC2:
Container -> Container (Same host):
Container -> Container (Different hosts):
That seems pretty dang good. Bandwidth seems fine between everything. Moving on to latency checks... |
Latency Checks (Good)Latency tests using ping. Same setup as above: Ping Container -> Container (different hosts)
Again, fast! Let's try curl with some fancy output with different paths (to avoid app caching):
Okay, little different there. The time_namelookup and time_connect seem high.
Running that same request in quick succession:
Which is significantly faster. Looking into specifically what those curl variables mean... Looks like a DNS issue?None of the containers in the dns pods seem particularly chatty...
Those log messages are from when we started our cluster. Lets try without dns:
Nice! So it seems Service DNS record (my-svc.namespace) format is slow to resolve to IP. |
Sorry @four43, I thought this issue was in the |
Your IPerf numbers looks good. DNS resolution do seem a bit slow ( On Wed, Jun 8, 2016 at 5:58 PM, Seth Miller notifications@github.com
|
@somejfn - I actually tried curl using service IP there at the end. Sorry it was an edit, so you wouldn't have seen it via email. I didn't want to spam this thread and crush you all with notifications. Please see those results above, they were much faster. We are using SkyDNS as provided by this tool (multi-node/aws) |
External DNS Tests (Bad)Inside a pod -> external address:
From a worker -> external address:
My cluster DNS is slow even resolving outside addresses. The host however is very fast. |
Solution - Scale up DNSSo after all this poking around I found a simple solution: Add more DNS pods. By scaling up the DNS RC I saw Thanks all for your help. TL;DR - If your DNS is slow, add more DNS pods |
This issue was a great read, thanks a bunch @four43 for the detailed debugging breakdown. Can I ask what kind of scale you're working at that DNS became an issue? I wonder if we should up the cpu limits on the DNS pod by default. Maybe DNS should run as a daemonset? I just googled that idea and found this issue: kubernetes/kubernetes#26707, so it looks like it might be a fairly common problem. |
@cgag - We aren't running a huge load on this little cluster. We have a cluster of 3 workers (m4.xlarge) currently with this test load. We are servicing about 8,000 req/min. Each of those requests has anywhere from 2 to 10 services it will contact both internal to the cluster and externally. We found a scale of 6 DNS pods to provide stable, relatively quick DNS results in that 0.015s range. EDIT: I tossed a comment over there to hopefully help build some interest. Good find on that issue, @cgag |
We have run into similar issues in a number of ways now, both with number of pods being to low as well as cpu/memory limits being to low on kube-dns / healthz. We also have seen the problem get exponentially if we had more DNS pods than worker nodes. Daemonset does seem to be a good solution, we are busy converting our test and production clusters over, will post findings after |
I'm curious if just setting the resource limits higher on the DNS RC helps. -Seth On Thu, Oct 20, 2016 at 3:56 PM, jdn-za notifications@github.com wrote:
|
Hey all,
This is going to be unfortunately vague, but we are seeing significantly slower networking performance from 0.7 (using Kubernetes 1.2.4 with and without Calico) compared to 0.4 (using Kubernetes 1.1.8). We've had a heck of a time tracking down the issue but our cluster's performance degrades on simple network operations when there is a slight volume increase. cAdvisor doesn't show anything out of the ordinary, from what I can tell in regards to the networking. Could it be something with the Kubernetes Network Policy? Is there something I can look at or check?
I'm watching latency from pods to outside connections, those seem slow. Connection via ping to /_v1/ping seems quick however. There were previous issues in the Kubernetes repo about that. Seems like almost a bandwidth limitation or limitation in open sockets? nofiles seems set correctly however. I'm out of idea.
The text was updated successfully, but these errors were encountered: