kubernetes: Verify k8s-API failure behavior #1967

chrisohaver · 2018-07-12T14:08:46Z

Recent question on slack relating to kubernets-api failover: During an API failure CoreDNS replies with DNS errors (presumably NXDOMAIN) for kubernetes records. Increasing TTLs could help, but need to validate behavior of API connection during API outages...

Need to revisit/review the k8s client lib and verify...

how it reconnects after k8s api comes back
how it maintains the k8s api cache while api is down
how it maintains k8s api cache during reconnect

Note: "k8s api cache" is the cache maintained by the k8s client lib. Not the CoreDNS cache plugin.

The text was updated successfully, but these errors were encountered:

miekg · 2018-07-12T15:20:21Z

[ Quoting <notifications@github.com> in "[coredns/coredns] kubernetes: Verif..." ]

Recent question on slack relating to kubernets-api failover: During an API failure CoreDNS replies with DNS errors (presumably NXDOMAIN) for kubernetes records. Increasing TTLs could help, but need to validate behavior of API connection during API outages...

NXDOMAIN... I hope not, SERVFAIL should hopefully be returned in that case.

Need to revisit/review the k8s client lib and verify... how it reconnects after k8s api comes back how it maintains the k8s api cache while api is down how it maintains k8s api cache during reconnect Note: "k8s api cache" is the cache maintained by the k8s client lib. Not the CoreDNS `cache` plugin.

Is this hard to text in the CI? /Miek

…

-- Miek Gieben

chrisohaver · 2018-07-12T15:28:22Z

Is this hard to test in the CI?

TBD. I think yes. I need to figure out how to test manually first. :)

chrisohaver · 2018-07-12T15:31:48Z

NXDOMAIN... I hope not, SERVFAIL should hopefully be returned in that case.

Agreed - SERVFAIL would be correct response. I don't know how we respond in this case. I'll see if i can reproduce it.

miekg · 2018-07-12T16:06:07Z

[ Quoting <notifications@github.com> in "Re: [coredns/coredns] kubernetes: V..." ]

> NXDOMAIN... I hope not, SERVFAIL should hopefully be returned in that case. Agreed - SERVFAIL would be correct response. I don't know how we respond in this case. I'll see if i can reproduce it.

iptables? And close of the endpoint from a vm?

chrisohaver · 2018-07-12T20:25:46Z

I was just thinking I'd delete the kubernetes-api pod... and see how that goes.
The tricky part is, i think, observations of the state of the api cache, and the timing of the observation, because kubernetes will spin the api back up fairly quickly. I'd need to add temporary debugging to log.

But making coredns run external to the cluster would allow us to control the length of the outage using iptables as you suggest... it would make timing observation of api cache state easier.

I haven't had time to to play with it yet. It may be easier to automate than I fear.

miekg · 2018-07-13T09:48:04Z

what is actually the problem, reading through the issue, it says if the control plane is down, we can't answer correctly?
Or don't we reconnect quickly enough?

chrisohaver · 2018-07-16T14:37:34Z

I dont know if there is a real problem here or not... need to look at the mode of failure first hand.
At the least , I want to be able to document how coredns behaves when the kubernetes api fails.
If we find that the way we behave is not desirable, then we'll need to look into fixing it if possible.

I have not validated this myself, but it sounds as if we too quickly lose the api cache when the api connection fails. Perhaps lost immediately as the connection is lost. This is, IIUC controlled by the kubernetes client lib. Hopefully there is a connection option to to add some persistence to it.

chrisohaver self-assigned this Jul 12, 2018

chrisohaver added the kubernetes label Jul 12, 2018

chrisohaver added the plugin/kubernetes label Jul 16, 2018

miekg added the not being worked on No one is working on this issue. Might be reopened at some point label Oct 18, 2018

miekg closed this as completed Oct 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes: Verify k8s-API failure behavior #1967

kubernetes: Verify k8s-API failure behavior #1967

chrisohaver commented Jul 12, 2018

miekg commented Jul 12, 2018 via email

chrisohaver commented Jul 12, 2018

chrisohaver commented Jul 12, 2018

miekg commented Jul 12, 2018 via email

chrisohaver commented Jul 12, 2018

miekg commented Jul 13, 2018

chrisohaver commented Jul 16, 2018 •

edited

kubernetes: Verify k8s-API failure behavior #1967

kubernetes: Verify k8s-API failure behavior #1967

Comments

chrisohaver commented Jul 12, 2018

miekg commented Jul 12, 2018 via email

chrisohaver commented Jul 12, 2018

chrisohaver commented Jul 12, 2018

miekg commented Jul 12, 2018 via email

chrisohaver commented Jul 12, 2018

miekg commented Jul 13, 2018

chrisohaver commented Jul 16, 2018 • edited

chrisohaver commented Jul 16, 2018 •

edited