Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredns doesn't perform better despite having more cores #5595

Open
gpl opened this issue Sep 5, 2022 · 22 comments
Open

coredns doesn't perform better despite having more cores #5595

gpl opened this issue Sep 5, 2022 · 22 comments

Comments

@gpl
Copy link

gpl commented Sep 5, 2022

We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.

With GOMAXPROCS set to 1, we observe ~60k qps and full utilization of one core.

With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps, but it consumes almost entirely two cores.

With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

We have the following corefile:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

db.example.org

$ORIGIN example.org.
@       3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600
        3600 IN NS a.iana-servers.net.
        3600 IN NS b.iana-servers.net.

www     IN A     127.0.0.1
        IN AAAA  ::1

We are using dnsperf: https://github.com/DNS-OARC/dnsperf

And the following command:

  dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8

test.txt:

www.example.com AAAA

Is there anything we could be missing?

Thanks!

@gpl gpl added the question label Sep 5, 2022
@johnbelamaric
Copy link
Member

johnbelamaric commented Sep 5, 2022 via email

@johnbelamaric
Copy link
Member

johnbelamaric commented Sep 5, 2022 via email

@Tantalor93
Copy link
Collaborator

Hello, If you could collect profiling data (CPU profile) exposed by pprof plugin, this could greatly benefit the investigation @gpl

@gpl
Copy link
Author

gpl commented Sep 5, 2022

Attaching profiles for gomaxprocs 1,2,4,8,16.
coredns-gomaxprocs.zip

@Tantalor93
Copy link
Collaborator

Tantalor93 commented Sep 8, 2022

Based on a quick look at the profiles, it seems that most of the CPU time serving DNS requests was spent on writing responses to the client
image

Are you running dnsperf tool on the same machine as CoreDNS instance is running? If that is the case then CoreDNS might be influenced by the dnsperf tool as they are using shared resources like UDP sockets and OS might have trouble providing them to the CoreDNS and therefore a lot of time is spent in a syscalls, but this is only my wild guess.

@johnbelamaric
Copy link
Member

Could be something like that.

Generally, if giving more CPU doesn't fix it, it is because you are hitting other bottlenecks. The question is whether those are in the CoreDNS code (for example, some mutex contention or somethign), or in the underlying OS or hardware. In this case it looks like writing to the UDP socket. Look into tuning UDP performance on your kernel. You may want to look at your UDP write buffer sizes, for example.

@gpl
Copy link
Author

gpl commented Sep 8, 2022

Hmm, I don't believe either of those are the issue here --

We had previously attempted to adjust a various number of kernel parameters and haven't seen any significant deviance in performance - additionally, from our telemetry I don't believe we're seeing any issues on that front.

Notably, the following values were adjusted on all hosts involved in this test:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=262144
net.core.wmem_max=262144

We've also adjusted net.ipv4.ip_local_port_range just in case, to 1024-65k.

The tests were also run from various combination of hosts; and we observed the same results if the tests and server were on different hosts (identical hardware).

@lobshunter
Copy link
Contributor

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

Does the same CPU usage means CoreDNS use up all 8-64 cores? If so, have you check whether those CPU usage was all from CoreDNS? For instance other process/system service can steal some of those CPU time.

Another idea is to measure CPU time in different categories(user, system, softirq, etc.). That can be helpful to find out the bottleneck.

@gpl
Copy link
Author

gpl commented Sep 12, 2022

Sorry for the lack of clarity - CoreDNS doesn't consume more than 4-5 cores.

I rebuilt coredns with symbols and ran perf instead of pprof:

coredns-c02-flamegraph

@lobshunter
Copy link
Contributor

I tried off CPU analysis, the off CPU flame graph looks similar to perf's. With more than 4 CPUs assigned to CoreDNS, time spent in serveUDP increased significantly. Haven't got any clue though.

@Lobshunter86
Copy link

I did more digging after that, seems like the bottleneck is the network IO pattern.

CoreDNS starts 1 listener goroutine for each server instance, and creates 1 new goroutine for each new request. So we have a single-producer(reads request packets), and multi-consumers(handles requests and writes response packets) workflow.

With more CPUs assigned to the CoreDNS process, consumers' processing speed can scale correspondingly but producer's cannot. And when the Corefile only uses some light plugins, the consumers' job is relatively simple so the handling process doesn't need much CPU time. We can hit producer's limit under high load because it has only 1 goroutine, it cannot utilize more than 1 core of CPU.

I ran some tests with the following Corefile on my laptop:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

.:56 {
  file db.example.org example.org
  cache 100
  whoami
}

tests:

  1. 1 dnsperf process, sent requests to :55, total QPS ~110k.
  2. 2 dnsperf processes, both sent requests to :55, total QPS ~110k.
  3. 2 dnsperf processes, sent requests to :55 and :56 respectively, total QPS ~190k(and CoreDNS's CPU usage increased significantly).

@johnbelamaric
Copy link
Member

Interesting. Any proposal for improvement?

@lobshunter
Copy link
Contributor

I could try to find a way. But I do agree with the idea of redis team: scaling horizontally is paramount, and CoreDNS can scale horizontally pretty well. So it's not a critical issue that it doesn't scale vertically.

PS: @Lobshunter86 is me, too.

@horahoradev
Copy link

I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?

@lobshunter
Copy link
Contributor

I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?

Please go ahead and have fun😉. I have been occupied at work recently.

@horahoradev
Copy link

horahoradev commented Dec 31, 2022

I threw the UDP message read within miekg/dns into a Goroutine pool. results are ok.
Without my change:

  Queries sent:         13988687
  Queries completed:    13988288 (100.00%)
  Queries lost:         300 (0.00%)
  Queries interrupted:  99 (0.00%)

  Response codes:       NOERROR 13988288 (100.00%)
  Average packet size:  request 32, response 100
  Run time (s):         105.433637
  Queries per second:   132673.863845

  Average Latency (s):  0.000627 (min 0.000023, max 0.022370)
  Latency StdDev (s):   0.000151

CPU utilization ~420%

With my change:

  Queries sent:         5735429
  Queries completed:    5735336 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  93 (0.00%)

  Response codes:       NOERROR 5735336 (100.00%)
  Average packet size:  request 32, response 100
  Run time (s):         29.624372
  Queries per second:   193601.943697

  Average Latency (s):  0.000483 (min 0.000014, max 0.009152)
  Latency StdDev (s):   0.000336

CPU utilization ~560%

So notably the CPU utilization went up, but QPS went up ~50%, and avg latency went down semi-significantly.
I'll have to do some profiling later. I wonder why the latency stddev went up 🤔
golang/go#45886 should help if UDPConn's readmsg is the bottleneck

@lobshunter
Copy link
Contributor

In my understanding, golang/go#45886 should improve the performance of UDP long connections(i.e. read a bunch of data from the same UDP socket, like QUIC). Would it help improve DNS workload? Since every DNS request-response belongs to different socket.

@rrrix
Copy link

rrrix commented Mar 11, 2023

This CloudFlare blog post seems keenly relevant to this issue: Go, don't collect my garbage

The author describes a performance puzzle very similar to what was described in the first post - namely, 1-4 cores works well, with quickly diminishing returns with higher concurrency. He achieved vastly improved performance by experimenting with Go Memory Garbage Collection tuning using the GOGC environment variable (a.k.a runtime/debug.SetGCPercent function).

SetGCPercent sets the garbage collection target percentage: a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. SetGCPercent returns the previous setting. The initial setting is the value of the GOGC environment variable at startup, or 100 if the variable is not set. This setting may be effectively reduced in order to maintain a memory limit. A negative percentage effectively disables garbage collection, unless the memory limit is reached. See SetMemoryLimit for more details.

Before GOGC tuning, # of goroutines / Ops/s:
image

After GOGC tuning, # of goroutines / Ops/s:
image

One caveat, is that his performance benchmark only ran for 10 seconds, which may have skewed the results in unexpected ways.

The challenge with this is that it's highly hardware dependent, so there's no one "right answer" for setting the GOGC value that would fit every user for every scenario.

@gpl perhaps performing some tuning of the GOGC environment variable in same manner as the blog post above may yield positive results?

P.S.
Excellent additional/background reading on Go Garbage Collection: A Guide to the Go Garbage Collector

@lobshunter
Copy link
Contributor

A memo: I found an interesting approach that uses SO_REUSEPORT and multiple net.ListenUDP call. According to the author's benchmark, it outperforms the solution of single listen, multiple ReadFromUDP.

I shall give it a try when I got time.

@iyashu
Copy link
Contributor

iyashu commented Mar 22, 2023

yes @lobshunter that is correct. I think lwn article explain the improvements and few caveats (esp. with TCP) of using SO_REUSEPORT option. Last week, I had validated the improvements by simply starting multiple servers on same port (as we've already set above option at ListenPacket as seen here) after making following code changes:

diff --git a/core/dnsserver/register.go b/core/dnsserver/register.go
index 8de55906..ac581eca 100644
--- a/core/dnsserver/register.go
+++ b/core/dnsserver/register.go
@@ -3,6 +3,8 @@ package dnsserver
 import (
  "fmt"
  "net"
+ "os"
+ "strconv"
  "time"

  "github.com/coredns/caddy"
@@ -157,36 +159,43 @@ func (h *dnsContext) MakeServers() ([]caddy.Server, error) {
  }
  // then we create a server for each group
  var servers []caddy.Server
- for addr, group := range groups {
- // switch on addr
- switch tr, _ := parse.Transport(addr); tr {
- case transport.DNS:
- s, err := NewServer(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)

- case transport.TLS:
- s, err := NewServerTLS(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ numSock, err := strconv.ParseInt(os.Getenv("NUM_SOCK"), 10, 64)
+ if err != nil {
+ numSock = 1
+ }
+ for i := 0; i < int(numSock); i++ {
+ for addr, group := range groups {
+ // switch on addr
+ switch tr, _ := parse.Transport(addr); tr {
+ case transport.DNS:
+ s, err := NewServer(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.GRPC:
- s, err := NewServergRPC(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ case transport.TLS:
+ s, err := NewServerTLS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.HTTPS:
- s, err := NewServerHTTPS(addr, group)
- if err != nil {
- return nil, err
+ case transport.GRPC:
+ s, err := NewServergRPC(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
+
+ case transport.HTTPS:
+ s, err := NewServerHTTPS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
  }
- servers = append(servers, s)
  }
  }

Essentially, I've just exposed an env var NUM_SOCK representing no. of socket (thereby servers) one wants to use for serving requests. For validating the improvements, I've used similar Corefile as mentioned at issue description above:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

1. With single listen socket, I'm able to achieve ~130K qps throughput from dnsperf on some private cloud instance.

$ NUM_SOCK=1 taskset -c 2-35 ./coredns-fix
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         5919568
  Queries completed:    5919470 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  98 (0.00%)

  Response codes:       NOERROR 5919470 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         45.693927
  Queries per second:   129546.099200

  Average Latency (s):  0.000756 (min 0.000016, max 0.006743)
  Latency StdDev (s):   0.000400
CoreDNS CPU Utilization: 275%
DNS Perf CPU Utilization: 480%

2. With two listen socket, I'm able to achieve ~235K qps throughput from dnsperf.

$ NUM_SOCK=2 taskset -c 2-35 ./coredns-fix
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ ss -u -a | grep 55
UNCONN 0      0                      *:55                *:*
UNCONN 0      0                      *:55                *:*
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         17760093
  Queries completed:    17759997 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  96 (0.00%)

  Response codes:       NOERROR 17759997 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         75.404526
  Queries per second:   235529.588768

  Average Latency (s):  0.000411 (min 0.000018, max 0.006754)
  Latency StdDev (s):   0.000379
CoreDNS CPU Utilization: 570%
DNS Perf CPU Utilization: 780%

3. With 4 listen socket, I'm able to achieve ~400K qps throughput from dnsperf.

$ NUM_SOCK=4 taskset -c 2-35 ./coredns-fix
.:55
.:55
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ ss -u -a | grep 55
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         20535534
  Queries completed:    20535443 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  91 (0.00%)

  Response codes:       NOERROR 20535443 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         51.342591
  Queries per second:   399968.965337

  Average Latency (s):  0.000235 (min 0.000020, max 0.003655)
  Latency StdDev (s):   0.000197
CoreDNS CPU Utilization: 1371%
DNS Perf CPU Utilization: 1191%

So, I think bottleneck was indeed due to throughput limitation on single socket & we are able to scale throughput almost linearly as we increase no. of listen socket. I'll create a pull request after validating the tcp traffic (non tls based) as I gets some more time. Thanks.

@lobshunter
Copy link
Contributor

@iyashu Excellent productivity 👍.

@crliu3227
Copy link
Contributor

@iyashu Really looking forward for this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants