New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reload plugin to Corefile before reloading #2339
Comments
I thought it already behaved this way... or perhaps it used to, and has changed. As part of the seamless restart design, a new server instance is created to read the new Corefile while the current server still operates. If the new config is invalid (results in an error during setup), current server remains active, and the new server is stopped. I think there were some issues with that, specifically with plugins that bind to ports during the setup would step on each other. For example, this could cause the |
... and after the forced restart, the new (invalid) config is used, since the old one no longer exists. |
@sp-joseluis-ledesma : yes it is current behavior - we try to load the new Corefile before switching to the new configuration. Can you define what you mean by CoreDNS is "crashing". |
Hi,
If we update the Corefile adding:
where non.existing.file does not exist, CoreDNS crashes with:
and it will never come back (obviously because the configuration is not valid). |
Also tried setting an invalid parameter (kmatch instead of match) using the template plugin, with the same behavior:
|
@sp-joseluis-ledesma , This is probably due to what I explained above. Let me be more verbose: order of events leading to the issue...
|
@fturib, has the reload port collision problem been fixed? I don't see an issue open for it. |
Was it fixed by #1688? This was merged before 1.1.2 release date, so should be in 1.1.3... |
the #1688 was not accepted as is. There is now a new event in Caddy implemented by @ekleiner , that allow to restart the listener when the reload fail. (OnFailRestart) However, I am not sure that will help really the k8s use-case : the current CoreDNS will continue running, but the Corefile is wrong. So, on next restart of CoreDNS it will be wrong ... and because that restart can be weeks later nobody will see the issue in the Corefile. It is better to have a failing right at the time you change the Configmap (therefore the Corefile). And fix the issue right now instead of let CoreDNS running in memory, with a Corefile that will fail on next start. What would be more useful (IMO), is a mechanism to check the Configmap BEFORE updating the Corefile. |
I think this is WAI from a core perspective; it's a plugin throwing a fatal error and those checks are done after a successful reload. We can't control all plugins so the question becomes what is expected here. Should the file plugin not error? Does that makes sense from the viewpoint of that plugin? I.e. your giving it a valid config, but a plugin trips up after the reload. It looks to me your provisioning framework should take care of this. Not coredns. |
No, it's not a valid config. It can't even load the Corefile. The problem
is that the failed restart causes the health endpoint to stop responding,
which then in turn gets the process killed and restarted with the now-bad
Corefile.
…On Wed, Nov 28, 2018 at 2:54 PM Miek Gieben ***@***.***> wrote:
I think this is WAI from a core perspective; it's a plugin throwing a
fatal error and those checks are done after a successful reload. We can't
control all plugins so the question becomes what is expected here.
Should the file plugin not error? Does that makes sense from the viewpoint
of that plugin?
I.e. your giving it a valid config, but a plugin trips up after the
reload. It looks to me your provisioning framework should take care of
this. Not coredns.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJB4szXGfp-Q3ftJNH3pRo83lDQK4Cu4ks5uzxQhgaJpZM4Y1VQK>
.
|
As said above, I think we can now easily fix the the health endpoint by restarting it on the event "OnFailedRestart" of Caddy. But the Corefile will stay invalid on the disk, and if user does not take care, it will face invalid Corefile later on, on next start of CoreDNS. |
IMO, CoreDNS should not crash if a bad Corefile is applied, because this will mean no dns resolution can be made in the cluster, which means all your applications running in that cluster will stop working at the very same time. It is no good obv to have a bad configuration in place, but you can monitor CoreDNS logs (or have a metric) if there was an error trying to reload the configuration, which is much better IMO than simply crashing. |
[ Quoting <notifications@github.com> in "Re: [coredns/coredns] reload plugin..." ]
As said above, I think we can now easily fix the the health endpoint by restarting it on the event "OnFailedRestart" of Caddy.
Ah, didn't know that was added. That might simplify a lot of hackery.
/Miek
…--
Miek Gieben
|
Strictly technically, CoreDNS is not "crashing." Not initially. Kubernetes is forcibly restarting it. Once this happens, coredns tries to read the current config (which is now invalid), and it exits. I tired to illustrate this in my comment above which spells out how step-by-step that occurs. IMO, we should fix the health check bug that causes Kubernetes to think coredns is unhealthy and kill it. But this wont fix everything... It's critical to understand the mode of the failure, because as Francois points out, even if health check interaction is fixed: One you change the configuration in kubernetes configmap, all new instances of coredns will use that new config. If the new config is invalid, the new instances will fail to start. K8s can restart/reschedule pods for a number of reasons. For example, if a node fails, and the pods need to rescheduled elsewhere. Or if a coredns pod crashes for some unrelated coredns bug. Or if coredns is oom killed on an over subscribed node. Another example, if you need to scale up the number of coredns pods. If coredns is started with an invalid config. What else can it do other than exit? From one perspective, fixing the health check bug will make situations worse. It would change the current behavior of "fail within minutes of changing configmap" to "fail at some unknown time in the future". |
I think a kubernetes-y way of rolling out a configmap change would be to disable reload in coredns, and use a deployment rollout. i.e.
But this feels cumbersome. |
I think that it would be easy to have a metric reporting an error reloading the Corefile, or monitor the Corefile logs than making everything failing at the same time. If a CoreDNS pod is restarted, or a new one tries to start and it fails, is something we can monitor and will not produce any downtime. If all the CoreDNS pods fails at the same time we will have downtime for sure. There are services (like Prometheus installed with the operator and Apache comes to my mind, but I'm sure there are more) that just refuses to restart if the config is not valid, and it is up the user to monitor when this happens, but they do not end up crashing (for whatever reason) producing a downtime. |
I believe you can get that metric via the k8s deployment, it will show X/Y pods running. |
dupped into #2659 |
Right now if you are using the reload plugin and the Corefile is modified with a syntax error, Coredns will crash.
It would be nice if the reload plugin checks the Corefile before reloading, and just complains if the Corefile is not right.
The text was updated successfully, but these errors were encountered: