reload plugin to Corefile before reloading #2339

sp-joseluis-ledesma · 2018-11-27T11:54:06Z

Right now if you are using the reload plugin and the Corefile is modified with a syntax error, Coredns will crash.

It would be nice if the reload plugin checks the Corefile before reloading, and just complains if the Corefile is not right.

chrisohaver · 2018-11-27T18:32:10Z

I thought it already behaved this way... or perhaps it used to, and has changed.

As part of the seamless restart design, a new server instance is created to read the new Corefile while the current server still operates. If the new config is invalid (results in an error during setup), current server remains active, and the new server is stopped.

I think there were some issues with that, specifically with plugins that bind to ports during the setup would step on each other. For example, this could cause the health plugin to stop working after trying to load an invalid config (which would result in a forced restart in k8s). I don't know if those issues still remain, or if they have been fixed.

@fturib

chrisohaver · 2018-11-27T18:56:57Z

For example, this could cause the health plugin to stop working after trying to load an invalid config (which would result in a forced restart in k8s).

... and after the forced restart, the new (invalid) config is used, since the old one no longer exists.

fturib · 2018-11-27T23:41:53Z

@sp-joseluis-ledesma : yes it is current behavior - we try to load the new Corefile before switching to the new configuration.

Can you define what you mean by CoreDNS is "crashing".
Is that in the logs of CoreDNS ?
or is it a crash at Kuberentes level ? In that later case, maybe what happen is that health does not recover properly from the reload operation with an invalid file.

sp-joseluis-ledesma · 2018-11-28T13:15:33Z

Hi,
sorry, I think I was not enough specific in the issue. We use CoreDNS (image k8s.gcr.io/coredns:1.1.3) in Kubernetes with the following configMap:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
  name: coredns
  namespace: kube-system
data:
  Corefile: |-
    .:53 {
        errors
        log
        health
        kubernetes cluster.local. in-addr.arpa ip6.arpa {
          pods verified
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        autopath @kubernetes
        proxy . /etc/resolv.conf
        cache 30
        reload
    }

If we update the Corefile adding:

file non.existing.file {
}

where non.existing.file does not exist, CoreDNS crashes with:

2018/11/28 13:10:15 [INFO] Reloading
2018/11/28 13:10:15 [ERROR] plugin/reload: Corefile changed but reload failed: plugin/file: open non.existing.file: no such file or directory

$ kubectl get pods |grep coredns
coredns-7946cb8c9d-hd95p                                 0/1       CrashLoopBackOff   8          13d

and it will never come back (obviously because the configuration is not valid).

sp-joseluis-ledesma · 2018-11-28T13:22:22Z

Also tried setting an invalid parameter (kmatch instead of match) using the template plugin, with the same behavior:

2018/11/28 13:20:17 [ERROR] plugin/reload: Corefile changed but reload failed: plugin/template: /etc/coredns/Corefile:11 - Error during parsing: Wrong argument count or unexpected line ending after 'kmatch'
2018/11/28 13:21:05 [INFO] SIGTERM: Shutting down servers then terminating

chrisohaver · 2018-11-28T13:33:54Z

@sp-joseluis-ledesma , This is probably due to what I explained above. Let me be more verbose:

order of events leading to the issue...

coredns is running with valid config
coredns configmap is changed to be invalid.
The old config at this point no longer exists because it was changed in the configmap.
coredns see's the config change, and tries to use it by spinning up a new server in parallel with the existing server.
the new server fails to start because the config is invalid, so coredns leaves the current server active, which is still running, and was setup using old config (when it still existed).
Due to a bug, coredns now starts to fail to respond to health checks from k8s infrastructure.
k8s infra kills coredns because it thinks coredns is unhealthy
coredns then restarts, using the current config (which is invalid).
coredns fails to start because the config is invalid
k8s restarts coredns over and over again on the bad config... i.e. crash loop

chrisohaver · 2018-11-28T14:02:39Z

@fturib, has the reload port collision problem been fixed? I don't see an issue open for it.

chrisohaver · 2018-11-28T14:13:43Z

Was it fixed by #1688? This was merged before 1.1.2 release date, so should be in 1.1.3...

fturib · 2018-11-28T17:05:56Z

the #1688 was not accepted as is.
So, right now, if the Reload fail, then we just lost the health.

There is now a new event in Caddy implemented by @ekleiner , that allow to restart the listener when the reload fail. (OnFailRestart)

However, I am not sure that will help really the k8s use-case : the current CoreDNS will continue running, but the Corefile is wrong. So, on next restart of CoreDNS it will be wrong ... and because that restart can be weeks later nobody will see the issue in the Corefile.

It is better to have a failing right at the time you change the Configmap (therefore the Corefile). And fix the issue right now instead of let CoreDNS running in memory, with a Corefile that will fail on next start.

What would be more useful (IMO), is a mechanism to check the Configmap BEFORE updating the Corefile.

miekg · 2018-11-28T22:54:22Z

I think this is WAI from a core perspective; it's a plugin throwing a fatal error and those checks are done after a successful reload. We can't control all plugins so the question becomes what is expected here.

Should the file plugin not error? Does that makes sense from the viewpoint of that plugin?

I.e. your giving it a valid config, but a plugin trips up after the reload. It looks to me your provisioning framework should take care of this. Not coredns.

johnbelamaric · 2018-11-28T23:26:20Z

No, it's not a valid config. It can't even load the Corefile. The problem is that the failed restart causes the health endpoint to stop responding, which then in turn gets the process killed and restarted with the now-bad Corefile.

…

On Wed, Nov 28, 2018 at 2:54 PM Miek Gieben ***@***.***> wrote: I think this is WAI from a core perspective; it's a plugin throwing a fatal error and those checks are done after a successful reload. We can't control all plugins so the question becomes what is expected here. Should the file plugin not error? Does that makes sense from the viewpoint of that plugin? I.e. your giving it a valid config, but a plugin trips up after the reload. It looks to me your provisioning framework should take care of this. Not coredns. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJB4szXGfp-Q3ftJNH3pRo83lDQK4Cu4ks5uzxQhgaJpZM4Y1VQK> .

fturib · 2018-11-29T00:21:28Z

As said above, I think we can now easily fix the the health endpoint by restarting it on the event "OnFailedRestart" of Caddy.

But the Corefile will stay invalid on the disk, and if user does not take care, it will face invalid Corefile later on, on next start of CoreDNS.

sp-joseluis-ledesma · 2018-11-29T07:15:59Z

IMO, CoreDNS should not crash if a bad Corefile is applied, because this will mean no dns resolution can be made in the cluster, which means all your applications running in that cluster will stop working at the very same time. It is no good obv to have a bad configuration in place, but you can monitor CoreDNS logs (or have a metric) if there was an error trying to reload the configuration, which is much better IMO than simply crashing.

miekg · 2018-11-29T07:38:09Z

[ Quoting <notifications@github.com> in "Re: [coredns/coredns] reload plugin..." ]

As said above, I think we can now easily fix the the health endpoint by restarting it on the event "OnFailedRestart" of Caddy.

Ah, didn't know that was added. That might simplify a lot of hackery. /Miek

…

-- Miek Gieben

chrisohaver · 2018-11-29T14:30:29Z

IMO, CoreDNS should not crash if a bad Corefile is applied,

Strictly technically, CoreDNS is not "crashing." Not initially. Kubernetes is forcibly restarting it. Once this happens, coredns tries to read the current config (which is now invalid), and it exits. I tired to illustrate this in my comment above which spells out how step-by-step that occurs.

IMO, we should fix the health check bug that causes Kubernetes to think coredns is unhealthy and kill it. But this wont fix everything...

It's critical to understand the mode of the failure, because as Francois points out, even if health check interaction is fixed: One you change the configuration in kubernetes configmap, all new instances of coredns will use that new config. If the new config is invalid, the new instances will fail to start. K8s can restart/reschedule pods for a number of reasons. For example, if a node fails, and the pods need to rescheduled elsewhere. Or if a coredns pod crashes for some unrelated coredns bug. Or if coredns is oom killed on an over subscribed node. Another example, if you need to scale up the number of coredns pods.

If coredns is started with an invalid config. What else can it do other than exit?

From one perspective, fixing the health check bug will make situations worse. It would change the current behavior of "fail within minutes of changing configmap" to "fail at some unknown time in the future".

chrisohaver · 2018-11-29T14:52:06Z

I think a kubernetes-y way of rolling out a configmap change would be to disable reload in coredns, and use a deployment rollout. i.e.

create a new configmap (instead of editing the existing one)
rollout a change to coredns deployment so it uses the new confimap
monitor rollout and revert if necessary

But this feels cumbersome.

sp-joseluis-ledesma · 2018-11-29T14:58:27Z

I think that it would be easy to have a metric reporting an error reloading the Corefile, or monitor the Corefile logs than making everything failing at the same time.

If a CoreDNS pod is restarted, or a new one tries to start and it fails, is something we can monitor and will not produce any downtime. If all the CoreDNS pods fails at the same time we will have downtime for sure.

There are services (like Prometheus installed with the operator and Apache comes to my mind, but I'm sure there are more) that just refuses to restart if the config is not valid, and it is up the user to monitor when this happens, but they do not end up crashing (for whatever reason) producing a downtime.

chrisohaver · 2018-11-29T15:20:52Z

I think that it would be easy to have a metric reporting

I believe you can get that metric via the k8s deployment, it will show X/Y pods running.

miekg · 2019-04-01T08:31:13Z

dupped into #2659

chrisohaver added the enhancement label Nov 27, 2018

miekg closed this as completed Apr 1, 2019

miekg added the duplicate label Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reload plugin to Corefile before reloading #2339

reload plugin to Corefile before reloading #2339

sp-joseluis-ledesma commented Nov 27, 2018

chrisohaver commented Nov 27, 2018

chrisohaver commented Nov 27, 2018

fturib commented Nov 27, 2018

sp-joseluis-ledesma commented Nov 28, 2018 •

edited

sp-joseluis-ledesma commented Nov 28, 2018

chrisohaver commented Nov 28, 2018 •

edited

chrisohaver commented Nov 28, 2018

chrisohaver commented Nov 28, 2018

fturib commented Nov 28, 2018

miekg commented Nov 28, 2018

johnbelamaric commented Nov 28, 2018 via email

fturib commented Nov 29, 2018

sp-joseluis-ledesma commented Nov 29, 2018

miekg commented Nov 29, 2018 via email

chrisohaver commented Nov 29, 2018

chrisohaver commented Nov 29, 2018 •

edited

sp-joseluis-ledesma commented Nov 29, 2018

chrisohaver commented Nov 29, 2018

miekg commented Apr 1, 2019

reload plugin to Corefile before reloading #2339

reload plugin to Corefile before reloading #2339

Comments

sp-joseluis-ledesma commented Nov 27, 2018

chrisohaver commented Nov 27, 2018

chrisohaver commented Nov 27, 2018

fturib commented Nov 27, 2018

sp-joseluis-ledesma commented Nov 28, 2018 • edited

sp-joseluis-ledesma commented Nov 28, 2018

chrisohaver commented Nov 28, 2018 • edited

chrisohaver commented Nov 28, 2018

chrisohaver commented Nov 28, 2018

fturib commented Nov 28, 2018

miekg commented Nov 28, 2018

johnbelamaric commented Nov 28, 2018 via email

fturib commented Nov 29, 2018

sp-joseluis-ledesma commented Nov 29, 2018

miekg commented Nov 29, 2018 via email

chrisohaver commented Nov 29, 2018

chrisohaver commented Nov 29, 2018 • edited

sp-joseluis-ledesma commented Nov 29, 2018

chrisohaver commented Nov 29, 2018

miekg commented Apr 1, 2019

sp-joseluis-ledesma commented Nov 28, 2018 •

edited

chrisohaver commented Nov 28, 2018 •

edited

chrisohaver commented Nov 29, 2018 •

edited