Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request spike creates memory spike #2593

Open
szuecs opened this Issue Feb 21, 2019 · 67 comments

Comments

Projects
None yet
7 participants
@szuecs
Copy link

szuecs commented Feb 21, 2019

To show the numbers in our tests, created by https://github.com/mikkeloscar/go-dnsperf

RPS
image

Memory
image

During an outage we had 18k RPS and 800MB memory consumption per CoreDNS instance. 800MB shown in grafana, but I expect the peak was much higher, because we had to increase memory from 1GB to 2 GB to survive). Before the outage we had 3.5k RPS and 64MB memory consumption per CoreDNS instance.

The usage pattern in test and outage are to have a couple of external (not cluster.local) DNS names to resolve.
CoreDNS configuration
CoreDNS deployment
/etc/resolv.conf is kubernetes default with ndots 5 and search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal. A call to www.example.org will do 5x 2 DNS queries.

More careful and detailed results by load tests with CoreDNS as daemonset and dnsmasq in front of that daemonset show these numbers:

  • CoreDNS with 100Mi could handle ~5-6k RPS (beyond that crashing CoreDNS)
  • CoreDNS with 1000Mi could handle ~10-11k RPS (beyond that crashing CoreDNS)
  • dnsmasq in front we can handle with 100Mi 35k RPS without crash

old setup

We used this config in our old setup:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
  labels:
    application: coredns
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        reload
    }

If you need the version from the outage and start params of the deployment: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dc008aa07ae480d9ba25dc9f6ca8d9d56aa813f4/cluster/manifests/coredns/deployment-coredns.yaml

new setup

Tests were running with CoreDNS 1.2, daemonset:
https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/coredns-local/daemonset-coredns.yaml
configmap: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/coredns-local/configmap-local.yaml

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Feb 21, 2019

@szuecs : Thanks you for isolating this problem in its own issue.

QUESTION: I guess that the outage you experiences is the one that is reported in this issue : #2554.

I read that post-mortem outage description but did not relate that was the description of THIS outage.

Let me verify I understand properly the different scenario you went throuhg:
1- you experienced an outage in production

During an outage we had 18k RPS and 800MB memory consumption per CoreDNS instance. 800MB shown in grafana, but I expect the peak was much higher, because we had to increase memory from 1GB to 2 GB to survive). Before the outage we had 3.5k RPS and 64MB memory consumption per CoreDNS instance.

During this outage, you think you queries at most hundredth of different domains.

2- your created a setup to show the problem using "old setup" and the result is what is visible in the graphs here

The test is just pinging always the same upstream domain (www.exemple.org).

3- you run 3 uses cases of the similar test, just changing the config of CoreDNS and dnsmasq, using "new setup", and modifying the max allocation of memory.

More careful and detailed results by load tests with CoreDNS as daemonset and dnsmasq in front of that daemonset show these numbers:

QUESTION: in this latter case (3). You are still using the same tool for sending load to CoreDNS ?

I guess you modified the options of the test until finding the crashing point of coreDNS. Am I correct ?
I mean, modifying the deployment-xxxx.yaml file.

       -names=example.org
        **-rps=100** <= here going up to 10000
        - timeout=10s
        - enable-logging=true

Without running the test yet .. I came to same conclusions as here

@rajansandeep is proposing to reproduce locally the same configuration so we can investigate what is really happening (and validate the above hypothesis or not).
In progress ....

@rajansandeep rajansandeep self-assigned this Feb 21, 2019

@szuecs

This comment has been minimized.

Copy link
Author

szuecs commented Feb 22, 2019

@szuecs : Thanks you for isolating this problem in its own issue.

QUESTION: I guess that the outage you experiences is the one that is reported in this issue : #2554.

I read that post-mortem outage description but did not relate that was the description of THIS outage.

Let me verify I understand properly the different scenario you went throuhg:
1- you experienced an outage in production

During an outage we had 18k RPS and 800MB memory consumption per CoreDNS instance. 800MB shown in grafana, but I expect the peak was much higher, because we had to increase memory from 1GB to 2 GB to survive). Before the outage we had 3.5k RPS and 64MB memory consumption per CoreDNS instance.

During this outage, you think you queries at most hundredth of different domains.

Yes

2- your created a setup to show the problem using "old setup" and the result is what is visible in the graphs here

The test is just pinging always the same upstream domain (www.exemple.org).

no it was requesting not one but multiple names (up to 100).

3- you run 3 uses cases of the similar test, just changing the config of CoreDNS and dnsmasq, using "new setup", and modifying the max allocation of memory.

More careful and detailed results by load tests with CoreDNS as daemonset and dnsmasq in front of that daemonset show these numbers:

QUESTION: in this latter case (3). You are still using the same tool for sending load to CoreDNS ?

Yes

I guess you modified the options of the test until finding the crashing point of coreDNS. Am I correct ?
I mean, modifying the deployment-xxxx.yaml file.

       -names=example.org
        **-rps=100** <= here going up to 10000
        - timeout=10s
        - enable-logging=true

No, we increased the replicas and set the same 100 names on all of them to make it more similar to one nodejs application with 150 replicas that hit the dns setup.

Without running the test yet .. I came to same conclusions as here

I think the best would be to use perf and maybe pprof to identify the memory peak.

@rajansandeep is proposing to reproduce locally the same configuration so we can investigate what is really happening (and validate the above hypothesis or not).
In progress ....

+1 for try to reproduce it locally, sometimes it's hard but if you have success in this it will be much easier to pinpoint the cause.
If you are not able to do this locally then it might make sense to create a cluster setup and run the test in this isolated environment and use cssh to run pprof/perf to get the data you need to find this.
If you see nothing in pprof (go runtime inspection) that shows it, then check perf (kernel view, dump to file, check later: socket,tcp,udp send/recv queues).

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 24, 2019

I'm (once again) coming from the other side with a bare minimal setup and going from there. This is running on Packet, with a dst and src machine that query via the network. Both dst and src running coredns, where the one on src forwards to dst.

A bare forward clause (not testing w/ proxy 'cause I will announce that it will be remove in the next-next release), testing with dnsperf (https://github.com/DNS-OARC/dnsperf), does about 22K qps for forwarded traffic on these machines

  Response codes:       NOERROR 224089 (100.00%)
  Average packet size:  request 29, response 91
  Run time (s):         10.005022
  Queries per second:   22397.651899

Add more plugins as go, prometheus and errors enabled; this then drops then to 20Kqps; depending on the output of this issue we may also want to look into that; there is still a defer (IIRC) in errors that can be removed.

The following prometheus metrics are from a half our run, which this outcome:

Statistics:

  Queries sent:         19745884
  Queries completed:    19745884 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 19745884 (100.00%)
  Average packet size:  request 29, response 93
  Run time (s):         1000.003402
  Queries per second:   19745.816825

  Average Latency (s):  0.004773 (min 0.000158, max 0.040260)
  Latency StdDev (s):   0.002272

screenshot from 2019-02-24 17-06-02
screenshot from 2019-02-24 17-06-19

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 24, 2019

Adding all plugins, except kubernetes, because that's so too hard to test outside k8s (see #2575). Drops a few qps. Memory according to Prometheus. (right most graph).
screenshot from 2019-02-24 17-21-33

If its easy to swap out, can you try a new coredns and change proxy to forward?

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 24, 2019

Ok, now having the dst coredns forward to 1.1.1.1, 8.8.8.8 and 8.8.4.4 and running alexa 1k. Running this again, for 65 second (expunges the cache at least once) and also because of the 2nd forwarding this is actually internet data.

DNS Performance Testing Tool
Version 2.2.1

[Status] Command line: dnsperf -s 127.0.0.1 -p 1053 -l 65 -d ./top-1k.dnsperf
[Status] Sending queries (to 127.0.0.1)
[Status] Started at: Sun Feb 24 18:08:38 2019
[Status] Stopping after 65.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         1808564
  Queries completed:    1808564 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1804946 (99.80%), SERVFAIL 3618 (0.20%)
  Average packet size:  request 29, response 101
  Run time (s):         65.071978
  Queries per second:   27793.284538

  Average Latency (s):  0.003284 (min 0.000057, max 2.007104)
  Latency StdDev (s):   0.010106

process_resident_memory_bytes also hovers in the 40/60 MB range.

So this cuts out everything except the k8s plugin.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 25, 2019

Note this Corefile

.:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        reload
    }

Is inefficient, the entire reverse trees are tunneled through k8s and only if they are nxdomain (falltthrough) they are resolved on the internet. In the original configmap there was also a rewrite meaning you apply a regexp on every request as well.

Much better would be to split this up into multiple servers and specify a more specific reverse for k8s:

cluster.local 10.x/16 ::1/16 { # or whatever the reverse v6 is
    errors
    health
    kubernetes  {
            pods insecure
            upstream
   }
 cache 30 # even this is border line, because of internal k8s caching
}

. {
        errors
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        reload
    }
@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 25, 2019

I think what we need to do is perf just the k8s plugin and check where memory is being used .

@szuecs

This comment has been minimized.

Copy link
Author

szuecs commented Feb 25, 2019

@miekg the problem with all dnsperf tools is that they do not create load that makes sense for general case. https://github.com/mikkeloscar/go-dnsperf uses /etc/resolv.conf settings to generate queries. We got completely other results when we used other tools to create load.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 25, 2019

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Feb 25, 2019

What @szuecs is saying if i'm not mistaken, is that most DNS performance tools (for good reason) send requests directly (as a single request), whereas go-dnsperf appends search domains from /etc/resolv.conf. With the k8s "search-path/ndots:5" situation, this multiplies the actual number of queries being sent by a large amount (X4-5). So, from the client POV, it's only making 1000 RPS, the Server sees 4000-5000 RPS.

@szuecs

This comment has been minimized.

Copy link
Author

szuecs commented Feb 25, 2019

@chrisohaver exactly, but ( 1 + number of search path) * 2 (A and AAAA records are requested separately)

@miekg while you are correct, it's not, because of TLD nameservers are different, caching might be different, .... Details really matter in this case.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 25, 2019

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Feb 25, 2019

@szuecs, are there a large number of reverse lookups in your load?

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Feb 25, 2019

any reverse lookup will give atrocious performance.

Actually, any reverse lookup of an IP outside the cluster would have bad performance. For reverse lookups inside the cluster, there would be no performance penalty.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 25, 2019

re: mem usage
there is one obvious candidate for unbounded memory growth and that's the kube-cache that caches things. Figuring that out what exactlyrequires navigating the client-go libs again (*sigh, as these a complex and opaque)

@rajansandeep

This comment has been minimized.

Copy link
Member

rajansandeep commented Feb 25, 2019

The usage pattern in test and outage are to have a couple of external (not cluster.local) DNS names to resolve.

@szuecs So the queries in the test are all external names or are there internal name queries as well?

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Feb 25, 2019

@szuecs, can you share the list of names you tested with?

@szuecs

This comment has been minimized.

Copy link
Author

szuecs commented Feb 25, 2019

We don’t use PTR records and the last time I saw something like a reverse lookup was when Apache did a reverse lookup for every access log. :D

The host name pattern looks like this:

svcname.clustername.example.com and we have often also cross cluster calls. If you have 5 different clusters and 10 different services and multiply it should be fine for the test. These hostnames were all external names. They just started to move workloads into this cluster.

Our current tested idea to make it even better is ndots: 2 and caching with dnsmasq in front of coredns.

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Feb 25, 2019

OK thanks. Due to the ndots/search path thing, svcname.clustername.example.com results in about 60% of the query load being destined to the kubernetes plugin, the rest being forwarded upstream. The queries that go to k8s plugin though get rejected pretty early, mostly during qname parsing (in parseRequest()), before diving into the k8s go-client cache.

Regarding the go-client, there is the k8s api watch (asynchronous from queries), but the resource usage there should not be correlated to the RPS load. i.e. we should not expect the client-go watch to start taking up more resources when the RPS ramps up.

Of course there is also the response cache (cache plugin). Which means that at these high RPS loads, practically 100% of queries are actually being served from cache, and with a small set of distinct query names (5 or so), the cache should not be write locked very much at all during the test.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 28, 2019

@szuecs do you have a graph of the number of goroutines and cpu?

also not sure anymore if this is kubernetes plugin related

@rajansandeep

This comment has been minimized.

Copy link
Member

rajansandeep commented Feb 28, 2019

I think I have reproduced the memory issue.
TL;DR: CoreDNS gets OOMKilled at high RPS.

Setup

I used the perf-tool used by @szuecs from https://github.com/mikkeloscar/go-dnsperf to check performance of CoreDNS in Kubernetes.

  • 4 node Kubernetes v1.13.3 ( 1 Master and 3 worker nodes)
  • CoreDNS v1.3.1 with 2 replicas deployed on the Master node with following default ConfigMap Corefile:
 Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

  • My /etc/resolv.conf is as follows:
cat /etc/resolv.conf 
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Case 1

Number of client replicas deployed : 50
RPS of each replica :100

CoreDNS was able to handle the requests, with memory consumption peaking at 220 MiB and going down slightly as time went by, stabilizing at around 157Mi.

image

Case 2

Number of client replicas deployed :90
RPS of each replica :100

CoreDNS gets OOMKilled constantly and is not able to handle all the requests from the clients.

Logs from one of the client replica:

2019/02/28 18:04:23 [ERROR] lookup kubernetes.io on 10.96.0.10:53: dial udp 10.96.0.10:53: i/o timeout
2019/02/28 18:04:23 [ERROR] lookup example.org on 10.96.0.10:53: dial udp 10.96.0.10:53: i/o timeout
2019/02/28 18:04:23 [ERROR] lookup google.com on 10.96.0.10:53: dial udp 10.96.0.10:53: i/o timeout

Looking at the memory consumption of CoreDNS, it seems to take up around 1.2 GiB of memory before getting OOMKilled and restarting.
I do not understand yet why it gets OOMKilled, since the Memory Limit is set at 1.66GiB.

image

CPU Usage:

image

Requests handled: CoreDNS is unable to keep up.

image

Cache Hitrate: Looks like we are hitting the cache as expected.

image
Cache Size:

image

I will be continuing my investigation further.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 1, 2019

@rajansandeep

This comment has been minimized.

Copy link
Member

rajansandeep commented Mar 5, 2019

Continuing my investigation, on the same setup as #2593 (comment),

  • I have 1 instance of CoreDNS on the Master node.
  • RPS of each DNS client replica: 100
  • All queries to CoreDNS were external queries.
  • Initially, the number of DNS client replica was kept at 25 and was increased until I observed OOMKills in the CoreDNS pod (which was at 70 DNS client replicas)

Observations made during the test:

  • The maximum incoming request the CoreDNS pod could handle was ~21.5kpps at 25 DNS client replicas.

  • When the client replicas was increased beyond 25 replicas, CoreDNS continued to serve a maximum of ~21.5k kpps.

  • For every step that I increased the client replicas, the CoreDNS memory used kept increasing (possibly due to the number of Goroutines increasing)

  • This continued till I increased the client replicas to 70, after which CoreDNS started to get OOMKilled repeatedly and couldn't recover - This is because as the pod restarts, it is flooded with requests from all 70 replicas at the same time. CoreDNS handles requests better when the load is incremental rather than a burst of requests flooding it.

  • When I decreased the replicas to 60, CoreDNS was able to recover, serving the same ~21.5kpps at a considerable high memory.

  • After the recovery, the number of goroutines was constant at ~30k

Further test analysis:

  • At 25 replicas, when CoreDNS is able to process all requested queries, the memory is stable at around 200MiB, with goroutines at around ~4k
  • The go routines (server’s worker) goes up until being able to process that quantity
  • Througout the test, it seems there are always 5kpps processed in < 25ms
  • Extra queries time or process depends of the client QPS : if pressure is high, these queries are processed slower and as number of go routines increase, memory increases too.
  • When we reach the limit of 70 client replicas, then CoreDNS starts to crash, and go routine goes up to 75k - 85k and memory blows-up.

I have attached the metrics (Can be zoomed in for better readability) in the following order:

  • Total requests processed by CoreDNS
  • Goroutines
  • Memory
  • CPU
  • Query response time

test

Also attaching pprof:

pprof.coredns.samples.cpu.007.pb.gz
pprof.coredns.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz

@szuecs

This comment has been minimized.

Copy link
Author

szuecs commented Mar 5, 2019

Very interesting observations and data!

You reproduced the same we saw in our production outage.
The memory spikes are happening, if coredns instance crashed, similar to our outage and this is why we had to set the memory limit super high to make sure it survives the start (maybe caused by the first flood of requests ?).
Are the pprof files during the time of the spikes?

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 7, 2019

you can easily check out a version before 98a1ef45 (of miekg/dns) and see how that goes; although you miss a whole bunch of optimization that happened

The change proposed by @tmthrgd cancel any code of the workers, and keep all the optimizations that happened after ... the result is the test @rajansandeep commented here

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 8, 2019

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 8, 2019

The health plugin exports coredns_health_request_duration_seconds that should
indicate how busy the process is.

Great idea> I did not though of it. I guess that can help to understand why we get a "livenessprobe issue".

@rajansandeep : I am surprised the last test (no workers) was finally much better than what @szuecs experienced initially and also on what we saw at early investigations. I am wondering if there can be a difference if the QPS load is adding gradually or in one shot.

Can you make a test when running from 1 to directly 70 replicas for the client to simulate a "one shot" big burst. and stay at 70 until some stabilization ? (for both images : no-worker, and capped-goroutines)

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 8, 2019

NOTE: There is also a proposition if a pool of worker only here : miekg/dns#924
I will build an image so we can compare too.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 9, 2019

@rajansandeep : for the test with no workers here, did you collected the pprof for mem and cpu, when under pressure of 90 client replicas ?

For information:
1 client replicas generate 900 req/sec for the DNS Server
90 clients could generates up to 81 kqps, (if all of these requests reach the CoreDNS pod)

@rajansandeep

This comment has been minimized.

Copy link
Member

rajansandeep commented Mar 11, 2019

@fturib, yes, I did collect.
I've updated the comment to include the pprof data.

@rajansandeep

This comment has been minimized.

Copy link
Member

rajansandeep commented Mar 12, 2019

Seeing that there are changes made, I tested CoreDNS on the master branch which uses miekg/dns:1.1.6.

The test conducted is the same as the previous cases where the DNS client replicas are increased from 25->30->50->60->70.

Test results:

  • CoreDNS gets OOMKilled when there are 70 replicas running due to CreoDNS taking up very high memory.
State:          Running
      Started:      Tue, 12 Mar 2019 11:19:46 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 12 Mar 2019 11:18:39 -0400
      Finished:     Tue, 12 Mar 2019 11:19:31 -0400
    Ready:          True
    Restart Count:  2

  • The CPU usage is very high
  • The number of goroutines and the memory consumed by CoreDNS is proportional.
  • I had to reduce the replicas to 60 to prevent CoreDNS from crashing continuously.

In short, CoreDNS performance on current master is the same as the CoreDNS v1.3.1

Metrics to back up the test results:

stitched

Pprof data when there were 60 client replicas (Couldn't take pprof at 70 replicas as CoreDNS kept crashing):

pprof.coredns.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz
pprof.coredns.samples.cpu.004.pb.gz

@szuecs

This comment has been minimized.

Copy link
Author

szuecs commented Mar 12, 2019

Goroutines are cheap, but don't come for free, as far as I know 4kB for 1 goroutine which is 320MB for 80k goroutines. Some other data has to be found that is created to add up to 1.5GB.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 12, 2019

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 13, 2019

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 13, 2019

@rajansandeep : can you make the try with a limitation on the CoreDNS plugins list.
image is : ftur/coredns:1.4.0-ftur-throttle
it will just drop the query if the number of concurrent requests is higher than a limit.

You will need to add this stanza to the corefile: (to be in the same use case of test above), so we can have a comparison basis

throttle 1000 10ms

You can get a new metrics about the limitation: coredns_throttle_queries
That provide you the number of queries accepted, and the number of queries dropped.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 13, 2019

panic counter is panic_count_total.
I understand will count the number of time we recover, in ServeDNS, from a panic case happening in CoreDNS - panic as interruption of go program.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 13, 2019

@rajansandeep

This comment has been minimized.

Copy link
Member

rajansandeep commented Mar 14, 2019

Continuing the testing as @miekg and @fturib suggested, I split my Corefile into a Kubernetes section and all other zones.

Since the DNS query is specific, eg: example.com., the number of DNS client replicas have been increased 4 times to keep the incoming requests constant as the previous tests.

The number of DNS client replicas are increased from 0->240, giving a burst of queries on CoreDNS

Case 1 onCoreDNS master:

Corefile:

cluster.local {
     errors
     health
    kubernetes cluster.local in-addr.arpa ip6.arpa  {
         pods insecure
         upstream
         fallthrough in-addr.arpa ip6.arpa
     }
    prometheus :9153
    pprof :6060
    cache 30
    proxy . /etc/resolv.conf
    reload
    } 
  .:53 {
    prometheus :9153
    }

CoreDNS is unable to handle the requests and frequently gets OOMKilled, unable to recover unless I scale down the number of replicas back to 0.

Note: I tested the same removing the prometheus plugin and it results in the same OOMkills.

stitched

Case 2 on @fturib'sthrottle plugin:

The Corefile has an additional throttle plugin setup as per @fturib's description:

it will just drop the query if the number of concurrent requests is higher than a limit.

Corefile:

cluster.local {
     errors
     health
    kubernetes cluster.local in-addr.arpa ip6.arpa  {
         pods insecure
         upstream
         fallthrough in-addr.arpa ip6.arpa
     }
    prometheus :9153
    pprof :6060
    cache 30
    proxy . /etc/resolv.conf
    reload
    } 
  .:53 {
     prometheus :9153
     throttle 1000 10ms
    }

CoreDNS does not crash, consuming memory at ~50MiB and the number of goroutines was also around ~1k.

stitched

The throttle plugin gives an additional metric of coredns_throttle_queries as below:

throttle

I am continuing testing on this.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 14, 2019

Some comments on the results provided above by @rajansandeep:

  1. There is not visible change if the limitation on go-routines is implemented at creation of the go-routines (in Miekg/dns) or as a plugin in CoreDNS.

  2. If there is no throttle/limitation on the go-routines then, whatever the plugin used, or no plugin at all , we can have a hug amount of go-routines (more than 5k) and the memory consumed increase more or less in proportion.

  3. The creation of go-routines on msg received is not always the same. I mean that if the distribution of incoming DNS msg is well balanced, then less than 100 go-routines is enough to manage the workload. If the distribution is unbalanced, then we it can goes up to 100k go-routines .. and the memory blows-up.

  4. Whatever the number of go-routines created to process the incoming DNS requests, the output rate of processed DNS request reaches a limit (in out test, this limit is between 22k to 28k qps).

  5. It is ok, for the purpose to avoid the OOM crash, to have the limitation of go-routine implemented as a plugin (first plugin in the chain). It means that the go-routines are created anyway, but their dropped in the throttle plugin is fast enough to avoid a piling of the go-routines.

My feeling, with all the tests done since a week, is that when the go-routine that process the DNS msg are piling, it is only because they are waiting to write the response. It does not depend of the plugins used, what seems slow is the throughput of the pod.

NOTE: We still do not know what memory is stuck in the go-routines, my guess is that it is the basic 2k / go-routine + the stack + at least the DNS msgs (query/response)

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Mar 15, 2019

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 15, 2019

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 15, 2019

Wouldn't dropping open request after 6 seconds help here as well?

It seems all about reacting quickly enough to avoid the go-routine to pile up.
I cannot say if 6 seconds is good enough or not. we will make a try.

I also want to figure out why forward can't be faster.
Maybe ... but here that is not tied to any plugin. Whatever the speed of forward, this "burst" sensitivity will be there. We did tests with Cache, Erratic, only Prometheus
We need to make a try with no plugin at all .. but in that case it will be only a validation that pod is crashing or not - as we cannot see any metrics.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 15, 2019

It's not RRL that is still an open issue, ...
@johnbelamaric: FYI, Chris implemented a first version of RRL here. But I do not think that is applicable right now for this issue.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 15, 2019

I wonder though - is there a way to autotune it? If we measure throughput
per go-routine, and we see it declining, we could start throttling.

Maybe by monitoring the time of writing itself (do not take into account the time of collecting or computing the response).
That means time spend in ResponseWriter .. not sure how that is feasible.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 15, 2019

@johnbelamaric :

One question - you put the throttle in the .:53 stanza. So, we are not throttling requests if they are for cluster.local?

Exact. If it is a plugin, it works per pluginChain, which means per ServerKey

let say your write:

domainA, domainB: {
    throttle 1000
}
. {
  throttle 2000
}

That means you could serve a total of 1000+1000+2000 simultaneus queries if the incoming domains are properly distributed.

I do not see it as a benefits ... because what we want to avoid is memory peak when burst is coming .. and memory is shared by the whole CoreDNS.

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Mar 15, 2019

Regarding RRL (response rate limiting): RRL is a very specialized tool that exclusively aims to mitigate DNS reflection attacks. It's made specifically for that purpose. In a nutshell, for each packet received, RRL receives and processes the request fully and analyzes result before sending a response to the client or dropping it. This is done because responses accounting is done separately per response code (and we need to know what the response code is before we can do the accounting).

For that reason, I think it's the wrong tool to try to prevent too many packets in flight.

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Mar 15, 2019

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 22, 2019

Are we getting closer in an root cause here?
To throw in (another) theory: proxy can't process queries fast enough hence the buildup

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Mar 22, 2019

I understand that the issue is also reproducible with only the erratic plugin, no cache or forwarding.

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 22, 2019

that would be really nice, because then we can actually write a unit test against it.

@chrisohaver

This comment has been minimized.

Copy link
Member

chrisohaver commented Mar 22, 2019

In theory - I think it would be reproducible with no plugins at all - the client just receiving REFUSED for every request. But I don't think this has actually been tested, whereas solo erratic has.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 22, 2019

Yes, it was reproduced with a configuration with no plugin at all.
However, in order to show results, we need to add the prometheus plugin. (to get the request_total ..and now the info provided by throttle).

I will just finalize the PR on throttle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.