Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreDNS maxes out CPU / Possible cascading problem #2624

Open
WinstonPrivacy opened this Issue Feb 27, 2019 · 23 comments

Comments

Projects
None yet
6 participants
@WinstonPrivacy
Copy link

WinstonPrivacy commented Feb 27, 2019

We've noticed an occasional problem with using CoreDNS on a low-end ARM64 board (1Gb RAM, dual core 1.0 Ghz CPU).

If another process temporarily maxes out CPU, CoreDNS CPU spikes to 80%+ and stays there indefinitely, even after the other process has been terminated.

The issue only appears if there is a lot of DNS activity during the period of peak CPU activity. My guess is that the DNS queries pile up but CoreDNS is unable to ever catch up with them for some reason. As an example, here is a screenshot of CoreDNS log output taken a few minutes after terminating a high-CPU process. This will continue indefinitely and in fact, the response times almost seem to get worse.

The problem goes away once CoreDNS is restarted. Happy to try and help diagnose this but I'm not sure where to start.

coredns failure

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 27, 2019

@johnbelamaric

This comment has been minimized.

Copy link
Member

johnbelamaric commented Feb 27, 2019

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 27, 2019

@WinstonPrivacy

This comment has been minimized.

Copy link
Author

WinstonPrivacy commented Feb 27, 2019

A configurable threshold limit for inbound requests would be excellent. Reaching it could trigger some kind of clean up logic. For instance, in this catastrophic case, many of the requests are duplicated. Just resolving them all at once might help a great deal.

Dropping 80% of the queue might is another simplistic but effective solution. In our case, it really doesn't matter because everything is resolved with SERVFAIL, so nothing is getting through. Dropping the entire queue would even work.

A LIFO approach would be good (oldest requests get dropped first because the client is more likely to have given up on them).

@miekg

This comment has been minimized.

Copy link
Member

miekg commented Feb 27, 2019

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 3, 2019

I have repro'd on coredns, same build, same platform as @WinstonPrivacy : spike coredns CPU, then bombard coredns with queries. Attached are prometheus screen shots. Notice that go_routines spike to > 2000 when reproducing. I cannot fetch pprof data when enabled in the Corefile when reproducing the issue.

go_routines

process_cpu_seconds_total

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 3, 2019

  • for the go routines, I think that is expected: there is no limitation of the go routine to process a DNS msg. The limitation is only on keeping that routine alive to process subsequent msg

see : https://github.com/miekg/dns/blob/092d7745b47c1a19a6533514ff7a05387b5c5bf9/server.go#L244-L255

I agree that anyway, we should somehow limit the number of go routine running in parallel and ensure we do not go higher than a reasonable amount of memory.

  • What is concerning is that we do not release the memory after the spike of msgs. Looks like that once the system is stressed, it keeps it's memory.

@willwp : can you share the Corefile for this test, and the version of the image used ?

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 4, 2019

image: CoreDNS-1.2.5, linux/arm64, go1.11.2 , with this fix merged in: #2319
Corefile

:53 {
       cache 3600
       forward . tls://1.1.1.1 tls://1.0.0.1 {
               tls_servername cloudflare-dns.com
               health_check 5s
               max_fails 1
               expire 90s
       }
       alternate SERVFAIL . tls://9.9.9.9 tls://149.112.112.112 {
                tls_servername dns.quad9.net
                health_check 5s
                max_fails 1
                expire 90s
       }
}
.:54 {
       cache 3600
       forward . tls://1.1.1.1 tls://1.0.0.1 {
               tls_servername cloudflare-dns.com
               health_check 5s
               max_fails 1
               expire 90s
        }
        alternate SERVFAIL . tls://9.9.9.9 tls://149.112.112.112 {
               tls_servername dns.quad9.net
               health_check 5s
               max_fails 1
               expire 90s
        }
}
@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 5, 2019

@willwp : the case your propose does not reproduce the same case as @WinstonPrivacy - IMO.

He is describing high CPU utilisation after a spike of queries, like if CoreDNS could never catch-up with the incoming queries and CoreDNS returning Servfail with long time queries.

In your case, you have a spike of incoming queries, the number of go routine is going high (2k), but after the peak, the go routines are returning to normal, along with the CPU (we can see the slope of process_cpu-second going flat), and the the total request processed.

You are showing an issue I think : I do not see why the memory used stay that high after the burst of queries

=> Can you opened a separated issue for this case ? (I mean keep separated the high CPU of this issue vs the high amount of memory of your case).

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 5, 2019

@fturib I'm on the same team as @WinstonPrivacy and followed the same steps to reproduce, on the same hardware/software stack. CoreDNS gets in the high CPU utilization after a spike of queries, when another process temporarily maxes out CPU, in my case coredns maxing out CPU is occurring. CPU maxing out at > 80%-100% and staying there. I cannot fetch with pprof from coredns when this CPU spike occurs. Is there a recommended method for me to properly collect pprof data? Are there other stats that would be helpful from Prometheus?

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 5, 2019

hum ... if you max out the CPU and stay there, why the graph of process_cpu_second is going flat after the burst ? For me it means we are not consuming the CPU anymore (almost) after the burst ...

When you say CoreDNS is maxing out CPU and stay there : what tool provide you the CPU used by CoreDNS ?

What is your environment here ? Are you in a Kubernetes cluster ? or CoreDNS is running as a process ?

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 5, 2019

I'm running same platform as @WinstonPrivacy low-end ARM64 board (1Gb RAM, dual core 1.0 Ghz CPU), CoreDNS running as a process. I run htop during reproducing this problem and see CPU for coredns pegged > 80%. Since pprof was not able to fetch any data, is it possible that the Prometheus client/plugin was not able to provide accurate CPU data for the Prometheus server?

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 5, 2019

I run htop during reproducing this problem and see CPU for coredns pegged > 80%.

  • You are not in a container environment ? (Kubernetes ? or Docker directly ?)
  • You are running a CoreDNS from cmd line of linux ? (or a linux service ?)

Since pprof was not able to fetch any data, is it possible that the Prometheus client/plugin was not able to provide accurate CPU data for the Prometheus server?

The data seems very nice .. it is not random points.

  • Can it happen that Prometheus here is pointing another instance of CoreDNS ? (I guess not, because we can see the burst).
  • Can it happen that the htop coredns is monitoring something bigger than ONLY CoreDNS ?
@willwp

This comment has been minimized.

Copy link

willwp commented Mar 5, 2019

* You are not in a container environment ? (Kubernetes ? or Docker directly ?)

No I'm not in a container environment.

* You are running a CoreDNS from cmd line of linux ? (or a linux service ?)

Running CoreDNS as a daemon, linux service, /etc/init.d/coredns

The data seems very nice .. it is not random points.

* Can it happen that Prometheus here is pointing another instance of CoreDNS ? (I guess not, because we can see the burst).

It is not pointing to another instance of CoreDNS. There are no other instances of CoreDNS to monitor in my network or setup. There are no other Prometheus clients running on this system, only the CoreDNS plugin.

* Can it happen that the `htop coredns` is monitoring something bigger than ONLY CoreDNS ?

htop displays the CPU usage of each process running on the system, and htop shows that the CoreDNS process is utilizing > 80%.

@SmashGuy2

This comment has been minimized.

Copy link

SmashGuy2 commented Mar 6, 2019

The issue is easy to reproduce. Is there logging code we can add to narrow down what's happening?

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 6, 2019

Sorry guys, I was busy on several other projects, and on the memory issue #2593.

But now re-reading and after the test @rajansandeep did for that memory issue.
I am thinking that if it is easy to reproduce, the best would be:

1- add the pprof plugin to the corefile
2- reproduce the mode with permanent high CPU (80%). I mean after the burst, when the go routines are back to almost zero.
3- activate collection of data pprof for the cpu
4- look at the result : we should see immediately in which proc the CPU is spent and most likely understand the problem.

Would that be feasible ?

@WinstonPrivacy

This comment has been minimized.

Copy link
Author

WinstonPrivacy commented Mar 7, 2019

The pprof plugin apparently isn't working when we reproduce this. We're going to rebuild and manually try to generate a pprof cpuprofile.

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 7, 2019

@fturib : Is this outlined solution in #2593 predicted to address what we're seeing in this ticket?

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 7, 2019

@fturib: I followed the proposed change by @tmthrgd in his comment .
Improvements noticed:

  • CoreDNS recovered from > 80% CPU after 2-5 minutes, after the other resource hungry process is done using CPU and memory resources.

  • CoreDNS was no longer stuck in SERVFAIL, and within 2-5 minutes starts resolving DNS queries.

Before @tmthrgd's change it would take CoreDNS > 2 hours to recover, even after other resource hungry processes were stopped.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 7, 2019

@willwp : I do not relate #2593 with the description of this issue, although both are facing a problem of behavior after a burst of query.

  • in #2593 : we do not have the CPU high after the query burst, but the memory is an issue during the burst.
  • in current one, you are facing a memory spike that does not resolve after the burst. And you are facing also high CPU **after*8 the burst, I mean after the go-routines are going down.

The difference can comes from difference of environment and of and maybe test:

  • Kubernetes pod for #2593 against direct linux process in current issue
  • test: #2593 is querying always the same 10 or 100 fqdn ... what about your case ?

for the configuration (Corefile), it is different but in both case, the main part of the query processing is just a cache hit.

I am happy to see that proposition with no workers is working for you ... but I have no idea why.
I do not understand yet where the CPU is being used in your case.

@fturib

This comment has been minimized.

Copy link
Member

fturib commented Mar 7, 2019

@willwp : I am curious to see the description of the kind of stress you are exercising during your test.
I mean what is the QPS, and if the domain queried are always the same (in that case we hit the cache most often) or very diverse (in that case we hit more ofthe the forward plugin)

@willwp

This comment has been minimized.

Copy link

willwp commented Mar 7, 2019

@fturib : Here is how I run the test, and the domains are diverse.

  • Run coredns command line with log enabled ./coredns -conf Corefile

  • dnsperf -s x.x.x.x -p 53 -d queryfile-example-10million-201202 -l 300 -c 30 -Q 2000

  • Run some devices mobile, streaming, apps that make a lot of queries

  • On the device build a substantial go program, such as coredns, to stress the memory and CPU.

I plan on removing the change from #2593 and reproducing with pprof compiled in coredns.go, because I cannot fetch anything from the pprof plugin during the failure.

coredns.go enabling cpu profiling.

package main

//go:generate go run directives_generate.go
import (
        "github.com/coredns/coredns/coremain"

        // Plug in CoreDNS
        _ "github.com/coredns/coredns/core/plugin"
        // Profiling, remove
      "github.com/pkg/profile"
)

func main() {
        // CPU profiling
       defer profile.Start().Stop()
        coremain.Run()
}
@miekg

This comment has been minimized.

Copy link
Member

miekg commented Mar 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.