Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes: Verify k8s-API failure behavior #1967

Closed
chrisohaver opened this issue Jul 12, 2018 · 7 comments
Closed

kubernetes: Verify k8s-API failure behavior #1967

chrisohaver opened this issue Jul 12, 2018 · 7 comments
Assignees
Labels
kubernetes not being worked on No one is working on this issue. Might be reopened at some point plugin/kubernetes

Comments

@chrisohaver
Copy link
Member

Recent question on slack relating to kubernets-api failover: During an API failure CoreDNS replies with DNS errors (presumably NXDOMAIN) for kubernetes records. Increasing TTLs could help, but need to validate behavior of API connection during API outages...

Need to revisit/review the k8s client lib and verify...

how it reconnects after k8s api comes back
how it maintains the k8s api cache while api is down
how it maintains k8s api cache during reconnect

Note: "k8s api cache" is the cache maintained by the k8s client lib. Not the CoreDNS cache plugin.

@chrisohaver chrisohaver self-assigned this Jul 12, 2018
@miekg
Copy link
Member

miekg commented Jul 12, 2018 via email

@chrisohaver
Copy link
Member Author

Is this hard to test in the CI?

TBD. I think yes. I need to figure out how to test manually first. :)

@chrisohaver
Copy link
Member Author

NXDOMAIN... I hope not, SERVFAIL should hopefully be returned in that case.

Agreed - SERVFAIL would be correct response. I don't know how we respond in this case. I'll see if i can reproduce it.

@miekg
Copy link
Member

miekg commented Jul 12, 2018 via email

@chrisohaver
Copy link
Member Author

I was just thinking I'd delete the kubernetes-api pod... and see how that goes.
The tricky part is, i think, observations of the state of the api cache, and the timing of the observation, because kubernetes will spin the api back up fairly quickly. I'd need to add temporary debugging to log.

But making coredns run external to the cluster would allow us to control the length of the outage using iptables as you suggest... it would make timing observation of api cache state easier.

I haven't had time to to play with it yet. It may be easier to automate than I fear.

@miekg
Copy link
Member

miekg commented Jul 13, 2018

what is actually the problem, reading through the issue, it says if the control plane is down, we can't answer correctly?
Or don't we reconnect quickly enough?

@chrisohaver
Copy link
Member Author

chrisohaver commented Jul 16, 2018

I dont know if there is a real problem here or not... need to look at the mode of failure first hand.
At the least , I want to be able to document how coredns behaves when the kubernetes api fails.
If we find that the way we behave is not desirable, then we'll need to look into fixing it if possible.

I have not validated this myself, but it sounds as if we too quickly lose the api cache when the api connection fails. Perhaps lost immediately as the connection is lost. This is, IIUC controlled by the kubernetes client lib. Hopefully there is a connection option to to add some persistence to it.

@miekg miekg added the not being worked on No one is working on this issue. Might be reopened at some point label Oct 18, 2018
@miekg miekg closed this as completed Oct 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kubernetes not being worked on No one is working on this issue. Might be reopened at some point plugin/kubernetes
Projects
None yet
Development

No branches or pull requests

2 participants