-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927
Comments
Regarding the 'do nothing' option and relying on retries. While I agree that any sane client will have retries, I think that it's something unexpected (one might ask: why would dead nodes be in the pool?). In any case clients will see timeouts/5xxs and/or increased latency. Given the number of nodes in smaller clusters and non-default maxUnavailable config, this can lower the availability of the cluster dramatically for these short periods of time (ie. 2/5 nodes upgraded at a time, 40% of requests timing out). Also, there will be clients that hit this multiple times in a row. This will affect their sla/o and request times. I think preStop hook is a quick fix we should go for and we should experiment with right value for sleep and grace period. This will cover most cases. Depending on the number of cases it doesn't cover, we should evaluate whether we need more complicated (although optimal) solution that would make sure that node is not in the pool before taking it down. I'm not sure what is the behavior of Service in case of already established connections, but if they are preserved, we could also give guarantees about long-living connections. |
I think I like the preStop hook proposal because it is simple compared to the alternatives and requires zero coordination via the operator reconciliation loop. Having said that a fixed 10 sec buffer is of course sub-optimal. IIUC the value should be somewhat dynamically tied to the time it takes for a terminating pod being dropped from the service's endpoints which might vary depending on factors such as load on the k8s apiserver, state of the k8s controller that manages services etc. |
A command we can run from the preStop hook to know when the IP disappears from the StatefulSet headless service DNSes: python -c 'import socket; socket.getaddrinfo("statefulset-name", 9200)' Which does not mean the endpoint is removed from the public-facing HTTP service though. |
We could add an env var to the pods with the http service name, so we can check that as well. I didn't realize we had python installed in the containers and thought we might need to add dig or something to do |
Related ES issue discussing some of the SIGTERM behavior: elastic/elasticsearch#45236 and earlier helm issue: thanks @Crazybus for the references |
I believe this would work too, without resorting to python, to find out which IPs are currently in the services DNS record:
|
I think we can optimize ES Pods shutdown to minimize downtime in Services endpoints when removing or restarting ES Pods.
Endpoint controller code: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go
It seems to behave as I would expect: just watch Pods resources, and Create/update/delete endpoints and services accordingly.
Which means that when we delete a Pod, there's a small time frame where its IP address still appears in the Service endpoints. Until the endpoint controller removes it.
In this time frame, a user that hits the service URL may have errors such as
connection refused
.I believe some of our E2E tests actually hit that sweet spot from time to time:
An interesting read: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.
We discussed in a meeting one way to optimize that behaviour:
We'd like to minimize chances that a Pod endpoint appearing in the ES service leads to an Elasticsearch node that is not accepting requests. Once a Pod is
Terminating
, it is removed from the Service endpoints. Which is fine, but in the meantime, the Elasticsearch process may not be accepting http requests anymore, and still be part of the service endpoints.If we setup a
preStop hook
to wait for a few seconds (eg.sleep 10
) before the Elasticsearch container receives a TERM signal, then there's a chance the corresponding endpoint disappears from the service before Elasticsearch stops listening to http requests.A more complex version of that could be to make the readiness probe respond to the preStop hook, but that doesn't seem necessary, since
Terminating
is enough for the endpoint to be removed.Most ES clients already have some retry logic, to handle random failures like that. Should we still optimize for it? (IMHO: yes)
The text was updated successfully, but these errors were encountered: