kube-aws: Networking slower using 0.7 vs. 0.4 #533

Closed
four43 opened this Issue Jun 7, 2016 · 19 comments

Projects

None yet

6 participants

@four43
four43 commented Jun 7, 2016 edited

Hey all,

This is going to be unfortunately vague, but we are seeing significantly slower networking performance from 0.7 (using Kubernetes 1.2.4 with and without Calico) compared to 0.4 (using Kubernetes 1.1.8). We've had a heck of a time tracking down the issue but our cluster's performance degrades on simple network operations when there is a slight volume increase. cAdvisor doesn't show anything out of the ordinary, from what I can tell in regards to the networking. Could it be something with the Kubernetes Network Policy? Is there something I can look at or check?

I'm watching latency from pods to outside connections, those seem slow. Connection via ping to /_v1/ping seems quick however. There were previous issues in the Kubernetes repo about that. Seems like almost a bandwidth limitation or limitation in open sockets? nofiles seems set correctly however. I'm out of idea.

@idvoretskyi
Contributor

@four43 what 0.4 vs 0.7 versions are you speaking about?

@four43
four43 commented Jun 7, 2016

The tags of this repo. We are using multi-node on AWS: https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws

@four43
four43 commented Jun 7, 2016

There are a lot of variables here but we tried to keep things consistent. We have the same app, running on the same class of servers, in the same VPC, just with the kube-aws created subnet.

@fasaxc
fasaxc commented Jun 8, 2016

@four43 Please can you clarify what you mean by "with and without Calico"; why do you suspect this to be a Calico issue if you see the performance degradation without Calico?

@four43
four43 commented Jun 8, 2016

Calico was new to us (since it wasn't around for the 1.1.8 version of
Kubernetes). There is a portion of the YAML config for the kube-aws tool
that has "useCalico". We set it to false and deployed, saw slowness. We
also set it to true, deployed, and saw similar slowness.

-Seth

On Wed, Jun 8, 2016 at 3:18 AM, Shaun Crampton notifications@github.com
wrote:

@four43 https://github.com/four43 Please can you clarify what you mean
by "with and without Calico"; why do you suspect this to be a Calico issue
if you see the performance degradation without Calico?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#533 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAfjjl4B-9WF69lCF5aZICxai1fB6Qlvks5qJnq9gaJpZM4Iv7H4
.

@fasaxc
fasaxc commented Jun 8, 2016

If you set useCalico to false then I think calico isn't even installed.
Unless I'm misunderstanding something, the problem has to be elsewhere.

On Wed, 8 Jun 2016 14:06 Seth Miller, notifications@github.com wrote:

Calico was new to us (since it wasn't around for the 1.1.8 version of
Kubernetes). There is a portion of the YAML config for the kube-aws tool
that has "useCalico". We set it to false and deployed, saw slowness. We
also set it to true, deployed, and saw similar slowness.

-Seth

On Wed, Jun 8, 2016 at 3:18 AM, Shaun Crampton notifications@github.com
wrote:

@four43 https://github.com/four43 Please can you clarify what you mean
by "with and without Calico"; why do you suspect this to be a Calico
issue
if you see the performance degradation without Calico?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#533 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe/AAfjjl4B-9WF69lCF5aZICxai1fB6Qlvks5qJnq9gaJpZM4Iv7H4

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#533 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAcpEIY1hKGWpN4124aOa70DXJqgSVrHks5qJr5ngaJpZM4Iv7H4
.

@somejfn
somejfn commented Jun 8, 2016

@four43 "our cluster's performance degrades on simple network operations when there is a slight volume increase". Can you put this in numbers, say with IPerf ? If using flannel, what's your networking mode set to (vxlan/host-gw/aws-vpc)?

@four43
four43 commented Jun 8, 2016 edited

@fasaxc - I agree, I just wanted to throw that out there to show that I had tried it.

@somejfn - I'll pull some numbers with iperf. I haven't used that tool before, thanks for the point in the right direction. Also I'm not sure how to check flannel's networking mode. I'm using this tool to create my environment (multi-node/aws). The generated cloud-config shows a flannel section and specifies interface and etcd_endpoints. There is a drop in on the controller that specifies "vxlan":

- name: flanneld.service
      drop-ins:
        - name: 10-etcd.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/curl --silent -X PUT -d \
            "value={\"Network\" : \"{{.PodCIDR}}\", \"Backend\" : {\"Type\" : \"vxlan\"}}" \
            http://localhost:2379/v2/keys/coreos.com/network/config?prevExist=false

Nothing of the sort on the worker setup.

Could that be causing issues or will flannel contact the controller and etcd for that info?

Thanks guys, I really appreciate it. Looking into IPerf now...


EDIT: Previous flannel config from this repos v0.4 tag output something like this (artifact for controller):

RES=$(curl --silent -X PUT -d "value={\"Network\":\"$POD_NETWORK\"}" "$ACTIVE_ETCD/v2/keys/coreos.com/network/config?prevExist=false")

No specification of backend. Interesting.

@four43
four43 commented Jun 8, 2016 edited

Bandwidth Checks (Good)

IPerf running on Kuberentes hosts on AWS, same VPC, different regions. Running m4.xlarge servers.

Raw EC2 -> EC2:

# iperf -c 10.0.1.150
------------------------------------------------------------
Client connecting to 10.0.1.150, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 10.0.4.188 port 60816 connected with 10.0.1.150 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.19 GBytes  1.02 Gbits/sec

Container -> Container (Same host):

# iperf -c 10.2.2.7
------------------------------------------------------------
Client connecting to 10.2.2.7, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.2.2.9 port 36878 connected with 10.2.2.7 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  25.5 GBytes  21.9 Gbits/sec

Container -> Container (Different hosts):

# iperf -c 10.2.6.4
------------------------------------------------------------
Client connecting to 10.2.6.4, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.2.2.9 port 50854 connected with 10.2.6.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   891 MBytes   747 Mbits/sec

That seems pretty dang good. Bandwidth seems fine between everything. Moving on to latency checks...

@four43
four43 commented Jun 8, 2016 edited

Latency Checks (Good)

Latency tests using ping. Same setup as above:

Ping Container -> Container (different hosts)

# ping 10.2.6.4
PING 10.2.6.4 (10.2.6.4): 56 data bytes
64 bytes from 10.2.6.4: icmp_seq=0 ttl=62 time=0.960 ms
64 bytes from 10.2.6.4: icmp_seq=1 ttl=62 time=1.728 ms
64 bytes from 10.2.6.4: icmp_seq=2 ttl=62 time=0.928 ms
64 bytes from 10.2.6.4: icmp_seq=3 ttl=62 time=0.931 ms
64 bytes from 10.2.6.4: icmp_seq=4 ttl=62 time=1.004 ms
64 bytes from 10.2.6.4: icmp_seq=5 ttl=62 time=0.877 ms
64 bytes from 10.2.6.4: icmp_seq=6 ttl=62 time=0.957 ms
64 bytes from 10.2.6.4: icmp_seq=7 ttl=62 time=0.910 ms
64 bytes from 10.2.6.4: icmp_seq=8 ttl=62 time=0.979 ms
64 bytes from 10.2.6.4: icmp_seq=9 ttl=62 time=0.898 ms
^C--- 10.2.6.4 ping statistics ---
10 packets transmitted, 10 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.877/1.017/1.728/0.240 ms

Again, fast!

Let's try curl with some fancy output with different paths (to avoid app caching):

# curl -w "@curl-format.txt" -o /dev/null http://[my-web-service.namespace]

            time_namelookup:  0.126
               time_connect:  0.127
            time_appconnect:  0.000
           time_pretransfer:  0.127
              time_redirect:  0.000
         time_starttransfer:  0.128
                            ----------
                 time_total:  0.128

Okay, little different there. The time_namelookup and time_connect seem high.
Trying full DNS name (with .svc.cluster.local)

# curl -w "@curl-format.txt" -o /dev/null http://[my-web-service.namespace.svc.cluster.local]

            time_namelookup:  0.126
               time_connect:  0.127
            time_appconnect:  0.000
           time_pretransfer:  0.127
              time_redirect:  0.000
         time_starttransfer:  0.131
                            ----------
                 time_total:  0.131

Running that same request in quick succession:

# curl -w "@curl-format.txt" -o /dev/null http://[my-web-service.namespace]/different-path

            time_namelookup:  0.029
               time_connect:  0.031
            time_appconnect:  0.000
           time_pretransfer:  0.031
              time_redirect:  0.000
         time_starttransfer:  0.034
                            ----------
                 time_total:  0.034

Which is significantly faster. Looking into specifically what those curl variables mean...
Reference: https://curl.haxx.se/docs/manpage.html#-w

Looks like a DNS issue?

None of the containers in the dns pods seem particularly chatty...

~$ kubectl logs -f kube-dns-v11-ya5ur -c skydns --namespace=kube-system
2016/06/05 20:47:15 skydns: falling back to default configuration, could not read from etcd: 100: Key not found (/skydns/config) [21]
2016/06/05 20:47:15 skydns: ready for queries on cluster.local. for tcp://0.0.0.0:53 [rcache 0]
2016/06/05 20:47:15 skydns: ready for queries on cluster.local. for udp://0.0.0.0:53 [rcache 0]

Those log messages are from when we started our cluster.

Lets try without dns:

# curl -w "@curl-format.txt" -o /dev/null http://[service-ip]/another-different-path

            time_namelookup:  0.000                             
               time_connect:  0.001                             
            time_appconnect:  0.000                             
           time_pretransfer:  0.002                             
              time_redirect:  0.000                             
         time_starttransfer:  0.005                             
                            ----------                          
                 time_total:  0.005                             

Nice! So it seems Service DNS record (my-svc.namespace) format is slow to resolve to IP.

@four43 four43 changed the title from Networking slower using 0.7 vs. 0.4 to kube-aws: Networking slower using 0.7 vs. 0.4 Jun 8, 2016
@fasaxc
fasaxc commented Jun 9, 2016

Sorry @four43, I thought this issue was in the calico-kubernetes repo but I just misread coreos-kubernetes; that's why I was confused and thought this was about Calico!

@somejfn
somejfn commented Jun 9, 2016

Your IPerf numbers looks good. DNS resolution do seem a bit slow (
time_namelookup) but is that served by K8s (SkyDNS) or is it AWS's VpC ?
What about testing with IPs and see the difference ?

On Wed, Jun 8, 2016 at 5:58 PM, Seth Miller notifications@github.com
wrote:

Latency tests using ping. Same setup as above:

Ping Container -> Container (different hosts)

ping 10.2.6.4

PING 10.2.6.4 (10.2.6.4): 56 data bytes
64 bytes from 10.2.6.4: icmp_seq=0 ttl=62 time=0.960 ms
64 bytes from 10.2.6.4: icmp_seq=1 ttl=62 time=1.728 ms
64 bytes from 10.2.6.4: icmp_seq=2 ttl=62 time=0.928 ms
64 bytes from 10.2.6.4: icmp_seq=3 ttl=62 time=0.931 ms
64 bytes from 10.2.6.4: icmp_seq=4 ttl=62 time=1.004 ms
64 bytes from 10.2.6.4: icmp_seq=5 ttl=62 time=0.877 ms
64 bytes from 10.2.6.4: icmp_seq=6 ttl=62 time=0.957 ms
64 bytes from 10.2.6.4: icmp_seq=7 ttl=62 time=0.910 ms
64 bytes from 10.2.6.4: icmp_seq=8 ttl=62 time=0.979 ms
64 bytes from 10.2.6.4: icmp_seq=9 ttl=62 time=0.898 ms
^C--- 10.2.6.4 ping statistics ---
10 packets transmitted, 10 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.877/1.017/1.728/0.240 ms

Again, fast!

Let's try curl with some fancy output
https://josephscott.org/archives/2011/10/timing-details-with-curl/:

curl -w "@curl-format.txt" -o /dev/null http://[my-web-service]

% Total % Received % Xferd Average Speed Time Time Time Current

                             Dload  Upload   Total   Spent    Left  Speed

100 132 100 132 0 0 1028 0 --:--:-- --:--:-- --:--:-- 1031

        time_namelookup:  0.126
           time_connect:  0.127
        time_appconnect:  0.000
       time_pretransfer:  0.127
          time_redirect:  0.000
     time_starttransfer:  0.128
                        ----------
             time_total:  0.128

Okay, little different there. The time_namelookup and time_connect seem
high.

Running that same request in quick succession:

curl -w "@curl-format.txt" -o /dev/null http://[my-web-service]/different-path

        time_namelookup:  0.029
           time_connect:  0.031
        time_appconnect:  0.000
       time_pretransfer:  0.031
          time_redirect:  0.000
     time_starttransfer:  0.034
                        ----------
             time_total:  0.034

Which is significantly faster. Looking into specifically what those curl
variables mean...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#533 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AOflqHg9gflw2xJtYUmPn-JjquXka0Rqks5qJzr0gaJpZM4Iv7H4
.

@four43
four43 commented Jun 9, 2016

@somejfn - I actually tried curl using service IP there at the end. Sorry it was an edit, so you wouldn't have seen it via email. I didn't want to spam this thread and crush you all with notifications. Please see those results above, they were much faster. We are using SkyDNS as provided by this tool (multi-node/aws)

@four43
four43 commented Jun 10, 2016 edited

External DNS Tests (Bad)

Inside a pod -> external address:

# curl -sw "@curl-format.txt" -o /dev/null http://www.google.com

            time_namelookup:  0.254   
               time_connect:  0.256   
            time_appconnect:  0.000   
           time_pretransfer:  0.256   
              time_redirect:  0.000   
         time_starttransfer:  0.304   
                            ----------
                 time_total:  0.304   

From a worker -> external address:

# curl -sw "@curl-format.txt" -o /dev/null http://www.google.com


            time_namelookup:  0.006
               time_connect:  0.009 
            time_appconnect:  0.000   
           time_pretransfer:  0.009 
              time_redirect:  0.000   
         time_starttransfer:  0.060 
                            ----------
                 time_total:  0.060                                            

My cluster DNS is slow even resolving outside addresses. The host however is very fast.

@four43
four43 commented Jun 10, 2016

Solution - Scale up DNS

So after all this poking around I found a simple solution: Add more DNS pods. By scaling up the DNS RC I saw time_namelookups drop down to a very reasonable 0.015s range. Maybe "throw more hardware (virtually)" should have been the first thing I tried, but I didn't think DNS was constrained that much (100 cpu units) On our fairly chatty microservice style service.

Thanks all for your help.

TL;DR - If your DNS is slow, add more DNS pods

@four43 four43 closed this Jun 10, 2016
@cgag
Member
cgag commented Jun 10, 2016

This issue was a great read, thanks a bunch @four43 for the detailed debugging breakdown.

Can I ask what kind of scale you're working at that DNS became an issue? I wonder if we should up the cpu limits on the DNS pod by default. Maybe DNS should run as a daemonset? I just googled that idea and found this issue: kubernetes/kubernetes#26707, so it looks like it might be a fairly common problem.

@four43
four43 commented Jun 10, 2016 edited

@cgag - We aren't running a huge load on this little cluster. We have a cluster of 3 workers (m4.xlarge) currently with this test load. We are servicing about 8,000 req/min. Each of those requests has anywhere from 2 to 10 services it will contact both internal to the cluster and externally. We found a scale of 6 DNS pods to provide stable, relatively quick DNS results in that 0.015s range.

EDIT: I tossed a comment over there to hopefully help build some interest. Good find on that issue, @cgag

@jdn-za
jdn-za commented Oct 20, 2016

We have run into similar issues in a number of ways now, both with number of pods being to low as well as cpu/memory limits being to low on kube-dns / healthz.

We also have seen the problem get exponentially if we had more DNS pods than worker nodes.

Daemonset does seem to be a good solution, we are busy converting our test and production clusters over, will post findings after

@four43
four43 commented Oct 25, 2016

I'm curious if just setting the resource limits higher on the DNS RC helps.
It seems creating tons and tons of DNS pods all having to sync with each
other isn't an efficient solution. Anyone have a good test setup for that?
Deleting DNS pods always seems to lead to some timeouts/dropped
connections/instability for us.

-Seth

On Thu, Oct 20, 2016 at 3:56 PM, jdn-za notifications@github.com wrote:

We have run into similar issues in a number of ways now, both with number
of pods being to low as well as cpu/memory limits being to low on kube-dns
/ healthz.

We also have seen the problem get exponentially if we had more DNS pods
than worker nodes.

Daemonset does seem to be a good solution, we are busy converting our test
and production clusters over, will post findings after


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#533 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAfjjhLls6aKoGnjVW_PPJmLL_4NDZYxks5q19WMgaJpZM4Iv7H4
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment