Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request spike creates memory spike #2593

Closed
szuecs opened this issue Feb 21, 2019 · 164 comments
Closed

request spike creates memory spike #2593

szuecs opened this issue Feb 21, 2019 · 164 comments

Comments

@szuecs
Copy link

szuecs commented Feb 21, 2019

To show the numbers in our tests, created by https://github.com/mikkeloscar/go-dnsperf

RPS
image

Memory
image

During an outage we had 18k RPS and 800MB memory consumption per CoreDNS instance. 800MB shown in grafana, but I expect the peak was much higher, because we had to increase memory from 1GB to 2 GB to survive). Before the outage we had 3.5k RPS and 64MB memory consumption per CoreDNS instance.

The usage pattern in test and outage are to have a couple of external (not cluster.local) DNS names to resolve.
CoreDNS configuration
CoreDNS deployment
/etc/resolv.conf is kubernetes default with ndots 5 and search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal. A call to www.example.org will do 5x 2 DNS queries.

More careful and detailed results by load tests with CoreDNS as daemonset and dnsmasq in front of that daemonset show these numbers:

  • CoreDNS with 100Mi could handle ~5-6k RPS (beyond that crashing CoreDNS)
  • CoreDNS with 1000Mi could handle ~10-11k RPS (beyond that crashing CoreDNS)
  • dnsmasq in front we can handle with 100Mi 35k RPS without crash

old setup

We used this config in our old setup:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
  labels:
    application: coredns
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        reload
    }

If you need the version from the outage and start params of the deployment: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dc008aa07ae480d9ba25dc9f6ca8d9d56aa813f4/cluster/manifests/coredns/deployment-coredns.yaml

new setup

Tests were running with CoreDNS 1.2, daemonset:
https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/coredns-local/daemonset-coredns.yaml
configmap: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/coredns-local/configmap-local.yaml

@fturib
Copy link
Contributor

fturib commented Feb 21, 2019

@szuecs : Thanks you for isolating this problem in its own issue.

QUESTION: I guess that the outage you experiences is the one that is reported in this issue : #2554.

I read that post-mortem outage description but did not relate that was the description of THIS outage.

Let me verify I understand properly the different scenario you went throuhg:
1- you experienced an outage in production

During an outage we had 18k RPS and 800MB memory consumption per CoreDNS instance. 800MB shown in grafana, but I expect the peak was much higher, because we had to increase memory from 1GB to 2 GB to survive). Before the outage we had 3.5k RPS and 64MB memory consumption per CoreDNS instance.

During this outage, you think you queries at most hundredth of different domains.

2- your created a setup to show the problem using "old setup" and the result is what is visible in the graphs here

The test is just pinging always the same upstream domain (www.exemple.org).

3- you run 3 uses cases of the similar test, just changing the config of CoreDNS and dnsmasq, using "new setup", and modifying the max allocation of memory.

More careful and detailed results by load tests with CoreDNS as daemonset and dnsmasq in front of that daemonset show these numbers:

QUESTION: in this latter case (3). You are still using the same tool for sending load to CoreDNS ?

I guess you modified the options of the test until finding the crashing point of coreDNS. Am I correct ?
I mean, modifying the deployment-xxxx.yaml file.

       -names=example.org
        **-rps=100** <= here going up to 10000
        - timeout=10s
        - enable-logging=true

Without running the test yet .. I came to same conclusions as here

@rajansandeep is proposing to reproduce locally the same configuration so we can investigate what is really happening (and validate the above hypothesis or not).
In progress ....

@rajansandeep rajansandeep self-assigned this Feb 21, 2019
@szuecs
Copy link
Author

szuecs commented Feb 22, 2019

@szuecs : Thanks you for isolating this problem in its own issue.

QUESTION: I guess that the outage you experiences is the one that is reported in this issue : #2554.

I read that post-mortem outage description but did not relate that was the description of THIS outage.

Let me verify I understand properly the different scenario you went throuhg:
1- you experienced an outage in production

During an outage we had 18k RPS and 800MB memory consumption per CoreDNS instance. 800MB shown in grafana, but I expect the peak was much higher, because we had to increase memory from 1GB to 2 GB to survive). Before the outage we had 3.5k RPS and 64MB memory consumption per CoreDNS instance.

During this outage, you think you queries at most hundredth of different domains.

Yes

2- your created a setup to show the problem using "old setup" and the result is what is visible in the graphs here

The test is just pinging always the same upstream domain (www.exemple.org).

no it was requesting not one but multiple names (up to 100).

3- you run 3 uses cases of the similar test, just changing the config of CoreDNS and dnsmasq, using "new setup", and modifying the max allocation of memory.

More careful and detailed results by load tests with CoreDNS as daemonset and dnsmasq in front of that daemonset show these numbers:

QUESTION: in this latter case (3). You are still using the same tool for sending load to CoreDNS ?

Yes

I guess you modified the options of the test until finding the crashing point of coreDNS. Am I correct ?
I mean, modifying the deployment-xxxx.yaml file.

       -names=example.org
        **-rps=100** <= here going up to 10000
        - timeout=10s
        - enable-logging=true

No, we increased the replicas and set the same 100 names on all of them to make it more similar to one nodejs application with 150 replicas that hit the dns setup.

Without running the test yet .. I came to same conclusions as here

I think the best would be to use perf and maybe pprof to identify the memory peak.

@rajansandeep is proposing to reproduce locally the same configuration so we can investigate what is really happening (and validate the above hypothesis or not).
In progress ....

+1 for try to reproduce it locally, sometimes it's hard but if you have success in this it will be much easier to pinpoint the cause.
If you are not able to do this locally then it might make sense to create a cluster setup and run the test in this isolated environment and use cssh to run pprof/perf to get the data you need to find this.
If you see nothing in pprof (go runtime inspection) that shows it, then check perf (kernel view, dump to file, check later: socket,tcp,udp send/recv queues).

@miekg
Copy link
Member

miekg commented Feb 24, 2019

I'm (once again) coming from the other side with a bare minimal setup and going from there. This is running on Packet, with a dst and src machine that query via the network. Both dst and src running coredns, where the one on src forwards to dst.

A bare forward clause (not testing w/ proxy 'cause I will announce that it will be remove in the next-next release), testing with dnsperf (https://github.com/DNS-OARC/dnsperf), does about 22K qps for forwarded traffic on these machines

  Response codes:       NOERROR 224089 (100.00%)
  Average packet size:  request 29, response 91
  Run time (s):         10.005022
  Queries per second:   22397.651899

Add more plugins as go, prometheus and errors enabled; this then drops then to 20Kqps; depending on the output of this issue we may also want to look into that; there is still a defer (IIRC) in errors that can be removed.

The following prometheus metrics are from a half our run, which this outcome:

Statistics:

  Queries sent:         19745884
  Queries completed:    19745884 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 19745884 (100.00%)
  Average packet size:  request 29, response 93
  Run time (s):         1000.003402
  Queries per second:   19745.816825

  Average Latency (s):  0.004773 (min 0.000158, max 0.040260)
  Latency StdDev (s):   0.002272

screenshot from 2019-02-24 17-06-02
screenshot from 2019-02-24 17-06-19

@miekg
Copy link
Member

miekg commented Feb 24, 2019

Adding all plugins, except kubernetes, because that's so too hard to test outside k8s (see #2575). Drops a few qps. Memory according to Prometheus. (right most graph).
screenshot from 2019-02-24 17-21-33

If its easy to swap out, can you try a new coredns and change proxy to forward?

@miekg
Copy link
Member

miekg commented Feb 24, 2019

Ok, now having the dst coredns forward to 1.1.1.1, 8.8.8.8 and 8.8.4.4 and running alexa 1k. Running this again, for 65 second (expunges the cache at least once) and also because of the 2nd forwarding this is actually internet data.

DNS Performance Testing Tool
Version 2.2.1

[Status] Command line: dnsperf -s 127.0.0.1 -p 1053 -l 65 -d ./top-1k.dnsperf
[Status] Sending queries (to 127.0.0.1)
[Status] Started at: Sun Feb 24 18:08:38 2019
[Status] Stopping after 65.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         1808564
  Queries completed:    1808564 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 1804946 (99.80%), SERVFAIL 3618 (0.20%)
  Average packet size:  request 29, response 101
  Run time (s):         65.071978
  Queries per second:   27793.284538

  Average Latency (s):  0.003284 (min 0.000057, max 2.007104)
  Latency StdDev (s):   0.010106

process_resident_memory_bytes also hovers in the 40/60 MB range.

So this cuts out everything except the k8s plugin.

@miekg
Copy link
Member

miekg commented Feb 25, 2019

Note this Corefile

.:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        reload
    }

Is inefficient, the entire reverse trees are tunneled through k8s and only if they are nxdomain (falltthrough) they are resolved on the internet. In the original configmap there was also a rewrite meaning you apply a regexp on every request as well.

Much better would be to split this up into multiple servers and specify a more specific reverse for k8s:

cluster.local 10.x/16 ::1/16 { # or whatever the reverse v6 is
    errors
    health
    kubernetes  {
            pods insecure
            upstream
   }
 cache 30 # even this is border line, because of internal k8s caching
}

. {
        errors
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        reload
    }

@miekg
Copy link
Member

miekg commented Feb 25, 2019

I think what we need to do is perf just the k8s plugin and check where memory is being used .

@szuecs
Copy link
Author

szuecs commented Feb 25, 2019

@miekg the problem with all dnsperf tools is that they do not create load that makes sense for general case. https://github.com/mikkeloscar/go-dnsperf uses /etc/resolv.conf settings to generate queries. We got completely other results when we used other tools to create load.

@miekg
Copy link
Member

miekg commented Feb 25, 2019 via email

@chrisohaver
Copy link
Member

What @szuecs is saying if i'm not mistaken, is that most DNS performance tools (for good reason) send requests directly (as a single request), whereas go-dnsperf appends search domains from /etc/resolv.conf. With the k8s "search-path/ndots:5" situation, this multiplies the actual number of queries being sent by a large amount (X4-5). So, from the client POV, it's only making 1000 RPS, the Server sees 4000-5000 RPS.

@szuecs
Copy link
Author

szuecs commented Feb 25, 2019

@chrisohaver exactly, but ( 1 + number of search path) * 2 (A and AAAA records are requested separately)

@miekg while you are correct, it's not, because of TLD nameservers are different, caching might be different, .... Details really matter in this case.

@miekg
Copy link
Member

miekg commented Feb 25, 2019 via email

@chrisohaver
Copy link
Member

@szuecs, are there a large number of reverse lookups in your load?

@chrisohaver
Copy link
Member

any reverse lookup will give atrocious performance.

Actually, any reverse lookup of an IP outside the cluster would have bad performance. For reverse lookups inside the cluster, there would be no performance penalty.

@miekg
Copy link
Member

miekg commented Feb 25, 2019

re: mem usage
there is one obvious candidate for unbounded memory growth and that's the kube-cache that caches things. Figuring that out what exactlyrequires navigating the client-go libs again (*sigh, as these a complex and opaque)

@rajansandeep
Copy link
Member

The usage pattern in test and outage are to have a couple of external (not cluster.local) DNS names to resolve.

@szuecs So the queries in the test are all external names or are there internal name queries as well?

@chrisohaver
Copy link
Member

@szuecs, can you share the list of names you tested with?

@szuecs
Copy link
Author

szuecs commented Feb 25, 2019

We don’t use PTR records and the last time I saw something like a reverse lookup was when Apache did a reverse lookup for every access log. :D

The host name pattern looks like this:

svcname.clustername.example.com and we have often also cross cluster calls. If you have 5 different clusters and 10 different services and multiply it should be fine for the test. These hostnames were all external names. They just started to move workloads into this cluster.

Our current tested idea to make it even better is ndots: 2 and caching with dnsmasq in front of coredns.

@chrisohaver
Copy link
Member

chrisohaver commented Feb 25, 2019

OK thanks. Due to the ndots/search path thing, svcname.clustername.example.com results in about 60% of the query load being destined to the kubernetes plugin, the rest being forwarded upstream. The queries that go to k8s plugin though get rejected pretty early, mostly during qname parsing (in parseRequest()), before diving into the k8s go-client cache.

Regarding the go-client, there is the k8s api watch (asynchronous from queries), but the resource usage there should not be correlated to the RPS load. i.e. we should not expect the client-go watch to start taking up more resources when the RPS ramps up.

Of course there is also the response cache (cache plugin). Which means that at these high RPS loads, practically 100% of queries are actually being served from cache, and with a small set of distinct query names (5 or so), the cache should not be write locked very much at all during the test.

@miekg
Copy link
Member

miekg commented Feb 28, 2019

@szuecs do you have a graph of the number of goroutines and cpu?

also not sure anymore if this is kubernetes plugin related

@rajansandeep
Copy link
Member

I think I have reproduced the memory issue.
TL;DR: CoreDNS gets OOMKilled at high RPS.

Setup

I used the perf-tool used by @szuecs from https://github.com/mikkeloscar/go-dnsperf to check performance of CoreDNS in Kubernetes.

  • 4 node Kubernetes v1.13.3 ( 1 Master and 3 worker nodes)
  • CoreDNS v1.3.1 with 2 replicas deployed on the Master node with following default ConfigMap Corefile:
 Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

  • My /etc/resolv.conf is as follows:
cat /etc/resolv.conf 
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Case 1

Number of client replicas deployed : 50
RPS of each replica :100

CoreDNS was able to handle the requests, with memory consumption peaking at 220 MiB and going down slightly as time went by, stabilizing at around 157Mi.

image

Case 2

Number of client replicas deployed :90
RPS of each replica :100

CoreDNS gets OOMKilled constantly and is not able to handle all the requests from the clients.

Logs from one of the client replica:

2019/02/28 18:04:23 [ERROR] lookup kubernetes.io on 10.96.0.10:53: dial udp 10.96.0.10:53: i/o timeout
2019/02/28 18:04:23 [ERROR] lookup example.org on 10.96.0.10:53: dial udp 10.96.0.10:53: i/o timeout
2019/02/28 18:04:23 [ERROR] lookup google.com on 10.96.0.10:53: dial udp 10.96.0.10:53: i/o timeout

Looking at the memory consumption of CoreDNS, it seems to take up around 1.2 GiB of memory before getting OOMKilled and restarting.
I do not understand yet why it gets OOMKilled, since the Memory Limit is set at 1.66GiB.

image

CPU Usage:

image

Requests handled: CoreDNS is unable to keep up.

image

Cache Hitrate: Looks like we are hitting the cache as expected.

image
Cache Size:

image

I will be continuing my investigation further.

@miekg
Copy link
Member

miekg commented Mar 1, 2019 via email

@rajansandeep
Copy link
Member

rajansandeep commented Mar 5, 2019

Continuing my investigation, on the same setup as #2593 (comment),

  • I have 1 instance of CoreDNS on the Master node.
  • RPS of each DNS client replica: 100
  • All queries to CoreDNS were external queries.
  • Initially, the number of DNS client replica was kept at 25 and was increased until I observed OOMKills in the CoreDNS pod (which was at 70 DNS client replicas)

Observations made during the test:

  • The maximum incoming request the CoreDNS pod could handle was ~21.5kpps at 25 DNS client replicas.

  • When the client replicas was increased beyond 25 replicas, CoreDNS continued to serve a maximum of ~21.5k kpps.

  • For every step that I increased the client replicas, the CoreDNS memory used kept increasing (possibly due to the number of Goroutines increasing)

  • This continued till I increased the client replicas to 70, after which CoreDNS started to get OOMKilled repeatedly and couldn't recover - This is because as the pod restarts, it is flooded with requests from all 70 replicas at the same time. CoreDNS handles requests better when the load is incremental rather than a burst of requests flooding it.

  • When I decreased the replicas to 60, CoreDNS was able to recover, serving the same ~21.5kpps at a considerable high memory.

  • After the recovery, the number of goroutines was constant at ~30k

Further test analysis:

  • At 25 replicas, when CoreDNS is able to process all requested queries, the memory is stable at around 200MiB, with goroutines at around ~4k
  • The go routines (server’s worker) goes up until being able to process that quantity
  • Througout the test, it seems there are always 5kpps processed in < 25ms
  • Extra queries time or process depends of the client QPS : if pressure is high, these queries are processed slower and as number of go routines increase, memory increases too.
  • When we reach the limit of 70 client replicas, then CoreDNS starts to crash, and go routine goes up to 75k - 85k and memory blows-up.

I have attached the metrics (Can be zoomed in for better readability) in the following order:

  • Total requests processed by CoreDNS
  • Goroutines
  • Memory
  • CPU
  • Query response time

test

Also attaching pprof:

pprof.coredns.samples.cpu.007.pb.gz
pprof.coredns.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz

@szuecs
Copy link
Author

szuecs commented Mar 5, 2019

Very interesting observations and data!

You reproduced the same we saw in our production outage.
The memory spikes are happening, if coredns instance crashed, similar to our outage and this is why we had to set the memory limit super high to make sure it survives the start (maybe caused by the first flood of requests ?).
Are the pprof files during the time of the spikes?

@szuecs
Copy link
Author

szuecs commented Nov 6, 2019

@tommyulfsparre good catch, I could not disclose it before, because we had to create a fix for our infrastructure first. The underlying issue is more huge, than I expected.
I also wrote to security@golang.org at 2010-10-21. It was not considered, but I was asked to create a public issue, which is fine.

@miekg probably this is helpful, not DNS but this will crash any go http/proxy, that has unbounded growth of goroutines and this leads to memory spikes and oom kill if you run in a memory limited cgroup.

golang/go#35407

@miekg miekg changed the title request spike creates memory spike in docker request spike creates memory spike Nov 15, 2019
@miekg
Copy link
Member

miekg commented Nov 15, 2019

Thanks @szuecs for filing that. As go dns closely mimics how net/http does (these kinds) of things, I wonder what they will implement. Meanwhile we need to do something in the forward plugin, or more generic in miekg/dns

mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this issue Nov 26, 2019
This introduces improvements to the CoreDNS configuration as suggested
in coredns/coredns#2593 (comment)
The change is to use multiple server directives to avoid expensive
lookup from Kubernetes plugin in terms of reverse DNS lookup or
expensive regex matching for `ingress.cluster.local` names.

* Use the `ready` plugin for readinessProbe
  https://github.com/coredns/coredns/tree/master/plugin/ready

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this issue Nov 26, 2019
This introduces improvements to the CoreDNS configuration as suggested
in coredns/coredns#2593 (comment)
The change is to use multiple server directives to avoid expensive
lookup from Kubernetes plugin in terms of reverse DNS lookup or
expensive regex matching for `ingress.cluster.local` names.

* Use the `ready` plugin for readinessProbe
  https://github.com/coredns/coredns/tree/master/plugin/ready

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this issue Nov 27, 2019
This introduces improvements to the CoreDNS configuration as suggested
in coredns/coredns#2593 (comment)
The change is to use multiple server directives to avoid expensive
lookup from Kubernetes plugin in terms of reverse DNS lookup or
expensive regex matching for `ingress.cluster.local` names.

* Use the `ready` plugin for readinessProbe
  https://github.com/coredns/coredns/tree/master/plugin/ready

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this issue Nov 27, 2019
This introduces improvements to the CoreDNS configuration as suggested
in coredns/coredns#2593 (comment)
The change is to use multiple server directives to avoid expensive
lookup from Kubernetes plugin in terms of reverse DNS lookup or
expensive regex matching for `ingress.cluster.local` names.

* Use the `ready` plugin for readinessProbe
  https://github.com/coredns/coredns/tree/master/plugin/ready

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this issue Nov 27, 2019
This introduces improvements to the CoreDNS configuration as suggested
in coredns/coredns#2593 (comment)
The change is to use multiple server directives to avoid expensive
lookup from Kubernetes plugin in terms of reverse DNS lookup or
expensive regex matching for `ingress.cluster.local` names.

* Use the `ready` plugin for readinessProbe
  https://github.com/coredns/coredns/tree/master/plugin/ready

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
mikkeloscar added a commit to zalando-incubator/kubernetes-on-aws that referenced this issue Nov 27, 2019
This introduces improvements to the CoreDNS configuration as suggested
in coredns/coredns#2593 (comment)
The change is to use multiple server directives to avoid expensive
lookup from Kubernetes plugin in terms of reverse DNS lookup or
expensive regex matching for `ingress.cluster.local` names.

* Use the `ready` plugin for readinessProbe
  https://github.com/coredns/coredns/tree/master/plugin/ready

Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
@miekg
Copy link
Member

miekg commented Feb 4, 2020

see #3640 for a potential fix. You need to manually fiddle with the new max_concurrent setting though

@szuecs
Copy link
Author

szuecs commented Feb 5, 2020

@miekg thanks!
https://github.com/coredns/coredns/pull/3640/files#diff-e01203f369c90be1ca31ffd87006062fR53
and https://github.com/coredns/coredns/pull/3640/files#diff-e01203f369c90be1ca31ffd87006062fR87 seems not to be aligned on the name max_queries vs. max_concurrent. As far as I read the PR max_queries should be changed to max_concurrent.

@chrisohaver
Copy link
Member

seems not to be aligned on the name

Thanks - we changed the name during the review, and i missed a place.

@szuecs
Copy link
Author

szuecs commented Mar 5, 2020

Since we have now the possibility to fix the issue, can we have a release and close this issue?
:)

@miekg
Copy link
Member

miekg commented Mar 5, 2020 via email

@szuecs
Copy link
Author

szuecs commented Apr 15, 2020

Since https://github.com/coredns/coredns/releases/tag/v1.6.9 we can set a concurrency limit

@szuecs szuecs closed this as completed Apr 15, 2020
@willzgli
Copy link

any reverse lookup will give atrocious performance.

Actually, any reverse lookup of an IP outside the cluster would have bad performance. For reverse lookups inside the cluster, there would be no performance penalty.

@chrisohaver why? Could you please explain it for me? thanks

@chrisohaver
Copy link
Member

@rootdeep, in the default Kubernetes deployment, all reverse lookups are intercepted by the kubernetes plugin which searches for the IP address in the Service and Endpoints indexed object cache. If no IPs match, then the request is passed to the next plugin, forward, which forwards the request upstream. Thus any reverse lookup of an IP outside the cluster results in extra work (e.g. parsing IP from qname, and two indexed object lookups) before the request is forwarded upstream. This could be avoided by more precisely defining the reverse zones for the kubernetes plugin in the Corefile so they match the actual Cluster IP and Pod IP subnets. However, it is not trivial to automatically determine those subnets during an install of Kubernetes, hence the default behavior.

@erwbgy
Copy link

erwbgy commented Apr 7, 2021

@chrisohaver Could you provide an example optimal Corefile somewhere with placeholders for the cluster IP and pod subnets? Then we could substitute in the values for our clusters and have better performance.

@chrisohaver
Copy link
Member

It would be as per the default kubernetes CoreDNS configuration, with the kubernetes plugin replaced with, for example ...

kubernetes cluster.local 8.9.10.in-addr.arpa 0.172.in-addr.arpa

... for a cluster with a ClusterIP subnet 10.9.8.0/24 and Pod subnet 172.0.0.0/16 . Note that there should not be a fallthrough in-addr.arpa ip6.arp in the kubernetes stanza.

@erwbgy
Copy link

erwbgy commented Apr 7, 2021

Ok, so replace:

        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }

with:

        kubernetes cluster.local 8.9.10.in-addr.arpa 0.172.in-addr.arpa {
            pods insecure
        }

Perfect. Thank you @chrisohaver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests