coredns doesn't perform better despite having more cores #5595

gpl · 2022-09-05T04:27:13Z

We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.

With GOMAXPROCS set to 1, we observe ~60k qps and full utilization of one core.

With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps, but it consumes almost entirely two cores.

With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

We have the following corefile:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

db.example.org

$ORIGIN example.org.
@       3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600
        3600 IN NS a.iana-servers.net.
        3600 IN NS b.iana-servers.net.

www     IN A     127.0.0.1
        IN AAAA  ::1

We are using dnsperf: https://github.com/DNS-OARC/dnsperf

And the following command:

  dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8

test.txt:

www.example.com AAAA

Is there anything we could be missing?

Thanks!

The text was updated successfully, but these errors were encountered:

johnbelamaric · 2022-09-05T05:25:02Z

Perhaps you are saturating the NIC throughput.

…

On Sun, Sep 4, 2022 at 9:27 PM Isogram ***@***.***> wrote: We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns. With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps. With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same. We have the following corefile: .:55 { file db.example.org example.org cache 100 whoami } db.example.org $ORIGIN example.org. @ 3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600 3600 IN NS a.iana-servers.net. 3600 IN NS b.iana-servers.net. www IN A 127.0.0.1 IN AAAA ::1 We are using dnsperf: https://github.com/DNS-OARC/dnsperf And the following command: dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8 test.txt: www.example.com AAAA Is there anything we could be missing? Thanks! — Reply to this email directly, view it on GitHub <#5595>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIHRM4L7BEGHO6KORZ7GCDV4VZDHANCNFSM6AAAAAAQET65CM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

johnbelamaric · 2022-09-05T05:25:43Z

I guess with local host that shouldn’t be the case.

…

On Sun, Sep 4, 2022 at 9:27 PM Isogram ***@***.***> wrote: We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns. With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps. With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same. We have the following corefile: .:55 { file db.example.org example.org cache 100 whoami } db.example.org $ORIGIN example.org. @ 3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600 3600 IN NS a.iana-servers.net. 3600 IN NS b.iana-servers.net. www IN A 127.0.0.1 IN AAAA ::1 We are using dnsperf: https://github.com/DNS-OARC/dnsperf And the following command: dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8 test.txt: www.example.com AAAA Is there anything we could be missing? Thanks! — Reply to this email directly, view it on GitHub <#5595>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACIHRM4L7BEGHO6KORZ7GCDV4VZDHANCNFSM6AAAAAAQET65CM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Tantalor93 · 2022-09-05T06:42:13Z

Hello, If you could collect profiling data (CPU profile) exposed by pprof plugin, this could greatly benefit the investigation @gpl

gpl · 2022-09-05T12:00:33Z

Attaching profiles for gomaxprocs 1,2,4,8,16.
coredns-gomaxprocs.zip

Tantalor93 · 2022-09-08T20:31:39Z

Based on a quick look at the profiles, it seems that most of the CPU time serving DNS requests was spent on writing responses to the client

Are you running dnsperf tool on the same machine as CoreDNS instance is running? If that is the case then CoreDNS might be influenced by the dnsperf tool as they are using shared resources like UDP sockets and OS might have trouble providing them to the CoreDNS and therefore a lot of time is spent in a syscalls, but this is only my wild guess.

johnbelamaric · 2022-09-08T20:40:33Z

Could be something like that.

Generally, if giving more CPU doesn't fix it, it is because you are hitting other bottlenecks. The question is whether those are in the CoreDNS code (for example, some mutex contention or somethign), or in the underlying OS or hardware. In this case it looks like writing to the UDP socket. Look into tuning UDP performance on your kernel. You may want to look at your UDP write buffer sizes, for example.

gpl · 2022-09-08T21:59:54Z

Hmm, I don't believe either of those are the issue here --

We had previously attempted to adjust a various number of kernel parameters and haven't seen any significant deviance in performance - additionally, from our telemetry I don't believe we're seeing any issues on that front.

Notably, the following values were adjusted on all hosts involved in this test:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=262144
net.core.wmem_max=262144

We've also adjusted net.ipv4.ip_local_port_range just in case, to 1024-65k.

The tests were also run from various combination of hosts; and we observed the same results if the tests and server were on different hosts (identical hardware).

lobshunter · 2022-09-10T17:48:35Z

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

Does the same CPU usage means CoreDNS use up all 8-64 cores? If so, have you check whether those CPU usage was all from CoreDNS? For instance other process/system service can steal some of those CPU time.

Another idea is to measure CPU time in different categories(user, system, softirq, etc.). That can be helpful to find out the bottleneck.

gpl · 2022-09-12T00:55:01Z

Sorry for the lack of clarity - CoreDNS doesn't consume more than 4-5 cores.

I rebuilt coredns with symbols and ran perf instead of pprof:

lobshunter · 2022-09-12T17:37:13Z

I tried off CPU analysis, the off CPU flame graph looks similar to perf's. With more than 4 CPUs assigned to CoreDNS, time spent in serveUDP increased significantly. Haven't got any clue though.

Lobshunter86 · 2022-11-05T10:54:14Z

I did more digging after that, seems like the bottleneck is the network IO pattern.

CoreDNS starts 1 listener goroutine for each server instance, and creates 1 new goroutine for each new request. So we have a single-producer(reads request packets), and multi-consumers(handles requests and writes response packets) workflow.

With more CPUs assigned to the CoreDNS process, consumers' processing speed can scale correspondingly but producer's cannot. And when the Corefile only uses some light plugins, the consumers' job is relatively simple so the handling process doesn't need much CPU time. We can hit producer's limit under high load because it has only 1 goroutine, it cannot utilize more than 1 core of CPU.

I ran some tests with the following Corefile on my laptop:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

.:56 {
  file db.example.org example.org
  cache 100
  whoami
}

tests:

1 dnsperf process, sent requests to :55, total QPS ~110k.
2 dnsperf processes, both sent requests to :55, total QPS ~110k.
2 dnsperf processes, sent requests to :55 and :56 respectively, total QPS ~190k(and CoreDNS's CPU usage increased significantly).

johnbelamaric · 2022-11-28T20:16:00Z

Interesting. Any proposal for improvement?

lobshunter · 2022-12-02T07:45:02Z

I could try to find a way. But I do agree with the idea of redis team: scaling horizontally is paramount, and CoreDNS can scale horizontally pretty well. So it's not a critical issue that it doesn't scale vertically.

PS: @Lobshunter86 is me, too.

horahoradev · 2022-12-26T00:06:16Z

I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?

lobshunter · 2022-12-26T02:39:27Z

I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?

Please go ahead and have fun😉. I have been occupied at work recently.

horahoradev · 2022-12-31T22:44:54Z

I threw the UDP message read within miekg/dns into a Goroutine pool. results are ok.
Without my change:

  Queries sent:         13988687
  Queries completed:    13988288 (100.00%)
  Queries lost:         300 (0.00%)
  Queries interrupted:  99 (0.00%)

  Response codes:       NOERROR 13988288 (100.00%)
  Average packet size:  request 32, response 100
  Run time (s):         105.433637
  Queries per second:   132673.863845

  Average Latency (s):  0.000627 (min 0.000023, max 0.022370)
  Latency StdDev (s):   0.000151

CPU utilization ~420%

With my change:

  Queries sent:         5735429
  Queries completed:    5735336 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  93 (0.00%)

  Response codes:       NOERROR 5735336 (100.00%)
  Average packet size:  request 32, response 100
  Run time (s):         29.624372
  Queries per second:   193601.943697

  Average Latency (s):  0.000483 (min 0.000014, max 0.009152)
  Latency StdDev (s):   0.000336

CPU utilization ~560%

So notably the CPU utilization went up, but QPS went up ~50%, and avg latency went down semi-significantly.
I'll have to do some profiling later. I wonder why the latency stddev went up 🤔
golang/go#45886 should help if UDPConn's readmsg is the bottleneck

lobshunter · 2023-01-03T03:19:20Z

In my understanding, golang/go#45886 should improve the performance of UDP long connections(i.e. read a bunch of data from the same UDP socket, like QUIC). Would it help improve DNS workload? Since every DNS request-response belongs to different socket.

rrrix · 2023-03-11T00:00:03Z

This CloudFlare blog post seems keenly relevant to this issue: Go, don't collect my garbage

The author describes a performance puzzle very similar to what was described in the first post - namely, 1-4 cores works well, with quickly diminishing returns with higher concurrency. He achieved vastly improved performance by experimenting with Go Memory Garbage Collection tuning using the GOGC environment variable (a.k.a runtime/debug.SetGCPercent function).

SetGCPercent sets the garbage collection target percentage: a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. SetGCPercent returns the previous setting. The initial setting is the value of the GOGC environment variable at startup, or 100 if the variable is not set. This setting may be effectively reduced in order to maintain a memory limit. A negative percentage effectively disables garbage collection, unless the memory limit is reached. See SetMemoryLimit for more details.

Before GOGC tuning, # of goroutines / Ops/s:

After GOGC tuning, # of goroutines / Ops/s:

One caveat, is that his performance benchmark only ran for 10 seconds, which may have skewed the results in unexpected ways.

The challenge with this is that it's highly hardware dependent, so there's no one "right answer" for setting the GOGC value that would fit every user for every scenario.

@gpl perhaps performing some tuning of the GOGC environment variable in same manner as the blog post above may yield positive results?

P.S.
Excellent additional/background reading on Go Garbage Collection: A Guide to the Go Garbage Collector

lobshunter · 2023-03-22T03:38:18Z

A memo: I found an interesting approach that uses SO_REUSEPORT and multiple net.ListenUDP call. According to the author's benchmark, it outperforms the solution of single listen, multiple ReadFromUDP.

I shall give it a try when I got time.

iyashu · 2023-03-22T19:23:40Z

yes @lobshunter that is correct. I think lwn article explain the improvements and few caveats (esp. with TCP) of using SO_REUSEPORT option. Last week, I had validated the improvements by simply starting multiple servers on same port (as we've already set above option at ListenPacket as seen here) after making following code changes:

diff --git a/core/dnsserver/register.go b/core/dnsserver/register.go
index 8de55906..ac581eca 100644
--- a/core/dnsserver/register.go
+++ b/core/dnsserver/register.go
@@ -3,6 +3,8 @@ package dnsserver
 import (
  "fmt"
  "net"
+ "os"
+ "strconv"
  "time"

  "github.com/coredns/caddy"
@@ -157,36 +159,43 @@ func (h *dnsContext) MakeServers() ([]caddy.Server, error) {
  }
  // then we create a server for each group
  var servers []caddy.Server
- for addr, group := range groups {
- // switch on addr
- switch tr, _ := parse.Transport(addr); tr {
- case transport.DNS:
- s, err := NewServer(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)

- case transport.TLS:
- s, err := NewServerTLS(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ numSock, err := strconv.ParseInt(os.Getenv("NUM_SOCK"), 10, 64)
+ if err != nil {
+ numSock = 1
+ }
+ for i := 0; i < int(numSock); i++ {
+ for addr, group := range groups {
+ // switch on addr
+ switch tr, _ := parse.Transport(addr); tr {
+ case transport.DNS:
+ s, err := NewServer(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.GRPC:
- s, err := NewServergRPC(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ case transport.TLS:
+ s, err := NewServerTLS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.HTTPS:
- s, err := NewServerHTTPS(addr, group)
- if err != nil {
- return nil, err
+ case transport.GRPC:
+ s, err := NewServergRPC(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
+
+ case transport.HTTPS:
+ s, err := NewServerHTTPS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
  }
- servers = append(servers, s)
  }
  }

Essentially, I've just exposed an env var NUM_SOCK representing no. of socket (thereby servers) one wants to use for serving requests. For validating the improvements, I've used similar Corefile as mentioned at issue description above:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

1. With single listen socket, I'm able to achieve ~130K qps throughput from dnsperf on some private cloud instance.

$ NUM_SOCK=1 taskset -c 2-35 ./coredns-fix
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3

$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         5919568
  Queries completed:    5919470 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  98 (0.00%)

  Response codes:       NOERROR 5919470 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         45.693927
  Queries per second:   129546.099200

  Average Latency (s):  0.000756 (min 0.000016, max 0.006743)
  Latency StdDev (s):   0.000400

CoreDNS CPU Utilization: 275%
DNS Perf CPU Utilization: 480%

2. With two listen socket, I'm able to achieve ~235K qps throughput from dnsperf.

$ NUM_SOCK=2 taskset -c 2-35 ./coredns-fix
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3

$ ss -u -a | grep 55
UNCONN 0      0                      *:55                *:*
UNCONN 0      0                      *:55                *:*

$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         17760093
  Queries completed:    17759997 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  96 (0.00%)

  Response codes:       NOERROR 17759997 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         75.404526
  Queries per second:   235529.588768

  Average Latency (s):  0.000411 (min 0.000018, max 0.006754)
  Latency StdDev (s):   0.000379

CoreDNS CPU Utilization: 570%
DNS Perf CPU Utilization: 780%

3. With 4 listen socket, I'm able to achieve ~400K qps throughput from dnsperf.

$ NUM_SOCK=4 taskset -c 2-35 ./coredns-fix
.:55
.:55
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3

$ ss -u -a | grep 55
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*

$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         20535534
  Queries completed:    20535443 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  91 (0.00%)

  Response codes:       NOERROR 20535443 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         51.342591
  Queries per second:   399968.965337

  Average Latency (s):  0.000235 (min 0.000020, max 0.003655)
  Latency StdDev (s):   0.000197

CoreDNS CPU Utilization: 1371%
DNS Perf CPU Utilization: 1191%

So, I think bottleneck was indeed due to throughput limitation on single socket & we are able to scale throughput almost linearly as we increase no. of listen socket. I'll create a pull request after validating the tcp traffic (non tls based) as I gets some more time. Thanks.

lobshunter · 2023-03-23T01:05:13Z

@iyashu Excellent productivity 👍.

crliu3227 · 2023-10-12T08:50:11Z

@iyashu Really looking forward for this PR

gpl added the question label Sep 5, 2022

Tantalor93 added the performance label Sep 12, 2022

horahoradev mentioned this issue Feb 4, 2023

Goroutine Pool for UDP datagram reads miekg/dns#1416

Closed

lobshunter mentioned this issue Feb 15, 2023

Is there any perfomance test about coredns? #5924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coredns doesn't perform better despite having more cores #5595

coredns doesn't perform better despite having more cores #5595

gpl commented Sep 5, 2022 •

edited

Loading

johnbelamaric commented Sep 5, 2022 via email

johnbelamaric commented Sep 5, 2022 via email

Tantalor93 commented Sep 5, 2022

gpl commented Sep 5, 2022

Tantalor93 commented Sep 8, 2022 •

edited

Loading

johnbelamaric commented Sep 8, 2022

gpl commented Sep 8, 2022

lobshunter commented Sep 10, 2022

gpl commented Sep 12, 2022

lobshunter commented Sep 12, 2022

Lobshunter86 commented Nov 5, 2022

johnbelamaric commented Nov 28, 2022

lobshunter commented Dec 2, 2022

horahoradev commented Dec 26, 2022

lobshunter commented Dec 26, 2022

horahoradev commented Dec 31, 2022 •

edited

Loading

lobshunter commented Jan 3, 2023

rrrix commented Mar 11, 2023

lobshunter commented Mar 22, 2023

iyashu commented Mar 22, 2023

lobshunter commented Mar 23, 2023

crliu3227 commented Oct 12, 2023

coredns doesn't perform better despite having more cores #5595

coredns doesn't perform better despite having more cores #5595

Comments

gpl commented Sep 5, 2022 • edited Loading

johnbelamaric commented Sep 5, 2022 via email

johnbelamaric commented Sep 5, 2022 via email

Tantalor93 commented Sep 5, 2022

gpl commented Sep 5, 2022

Tantalor93 commented Sep 8, 2022 • edited Loading

johnbelamaric commented Sep 8, 2022

gpl commented Sep 8, 2022

lobshunter commented Sep 10, 2022

gpl commented Sep 12, 2022

lobshunter commented Sep 12, 2022

Lobshunter86 commented Nov 5, 2022

johnbelamaric commented Nov 28, 2022

lobshunter commented Dec 2, 2022

horahoradev commented Dec 26, 2022

lobshunter commented Dec 26, 2022

horahoradev commented Dec 31, 2022 • edited Loading

lobshunter commented Jan 3, 2023

rrrix commented Mar 11, 2023

lobshunter commented Mar 22, 2023

iyashu commented Mar 22, 2023

lobshunter commented Mar 23, 2023

crliu3227 commented Oct 12, 2023

gpl commented Sep 5, 2022 •

edited

Loading

Tantalor93 commented Sep 8, 2022 •

edited

Loading

horahoradev commented Dec 31, 2022 •

edited

Loading