Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

Closed
sebgl opened this issue Oct 8, 2019 · 6 comments · Fixed by #2233
Closed

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

sebgl opened this issue Oct 8, 2019 · 6 comments · Fixed by #2233
Assignees
Labels
>enhancement Enhancement of existing functionality

Comments

@sebgl
Copy link
Contributor

sebgl commented Oct 8, 2019

I think we can optimize ES Pods shutdown to minimize downtime in Services endpoints when removing or restarting ES Pods.

Endpoint controller code: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go

It seems to behave as I would expect: just watch Pods resources, and Create/update/delete endpoints and services accordingly.

Which means that when we delete a Pod, there's a small time frame where its IP address still appears in the Service endpoints. Until the endpoint controller removes it.
In this time frame, a user that hits the service URL may have errors such as connection refused.

I believe some of our E2E tests actually hit that sweet spot from time to time:

--- FAIL: TestMutationAndReversal/Cluster_UUID_should_have_been_preserved#01 (0.06s)
15:40:59          cluster_uuid.go:58: 
15:40:59                  Error Trace:    cluster_uuid.go:58
15:40:59                  Error:          Received unexpected error:
15:40:59                                  Get https://test-reverted-mutation-q5qp-es-http.e2e-px7t6-mercury.svc:9200/: dial tcp 10.7.244.230:9200: connect: connection refused
15:40:59                  Test:           TestMutationAndReversal/Cluster_UUID_should_have_been_preserved#01
15:40:59  FAIL

An interesting read: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.

We discussed in a meeting one way to optimize that behaviour:

We'd like to minimize chances that a Pod endpoint appearing in the ES service leads to an Elasticsearch node that is not accepting requests. Once a Pod is Terminating, it is removed from the Service endpoints. Which is fine, but in the meantime, the Elasticsearch process may not be accepting http requests anymore, and still be part of the service endpoints.
If we setup a preStop hook to wait for a few seconds (eg. sleep 10) before the Elasticsearch container receives a TERM signal, then there's a chance the corresponding endpoint disappears from the service before Elasticsearch stops listening to http requests.
A more complex version of that could be to make the readiness probe respond to the preStop hook, but that doesn't seem necessary, since Terminating is enough for the endpoint to be removed.

Most ES clients already have some retry logic, to handle random failures like that. Should we still optimize for it? (IMHO: yes)

@sebgl sebgl added >flaky_test >enhancement Enhancement of existing functionality and removed >flaky_test labels Oct 8, 2019
@sebgl sebgl changed the title Check cluster service endpoints in E2E tests Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped Oct 8, 2019
@david-kow
Copy link
Contributor

Regarding the 'do nothing' option and relying on retries. While I agree that any sane client will have retries, I think that it's something unexpected (one might ask: why would dead nodes be in the pool?). In any case clients will see timeouts/5xxs and/or increased latency. Given the number of nodes in smaller clusters and non-default maxUnavailable config, this can lower the availability of the cluster dramatically for these short periods of time (ie. 2/5 nodes upgraded at a time, 40% of requests timing out). Also, there will be clients that hit this multiple times in a row. This will affect their sla/o and request times.

I think preStop hook is a quick fix we should go for and we should experiment with right value for sleep and grace period. This will cover most cases.

Depending on the number of cases it doesn't cover, we should evaluate whether we need more complicated (although optimal) solution that would make sure that node is not in the pool before taking it down. I'm not sure what is the behavior of Service in case of already established connections, but if they are preserved, we could also give guarantees about long-living connections.

@pebrc
Copy link
Collaborator

pebrc commented Oct 9, 2019

If we setup a preStop hook to wait for a few seconds (eg. sleep 10) before the Elasticsearch container receives a TERM signal, then there's a chance the corresponding endpoint disappears from the service before Elasticsearch stops listening to http requests.

I think I like the preStop hook proposal because it is simple compared to the alternatives and requires zero coordination via the operator reconciliation loop.

Having said that a fixed 10 sec buffer is of course sub-optimal. IIUC the value should be somewhat dynamically tied to the time it takes for a terminating pod being dropped from the service's endpoints which might vary depending on factors such as load on the k8s apiserver, state of the k8s controller that manages services etc.

@sebgl
Copy link
Contributor Author

sebgl commented Oct 15, 2019

A command we can run from the preStop hook to know when the IP disappears from the StatefulSet headless service DNSes:

python -c 'import socket; socket.getaddrinfo("statefulset-name", 9200)'

Which does not mean the endpoint is removed from the public-facing HTTP service though.
But could be a good approximation instead of relying on sleep X?

@anyasabo
Copy link
Contributor

We could add an env var to the pods with the http service name, so we can check that as well. I didn't realize we had python installed in the containers and thought we might need to add dig or something to do dig $(SERVICE_NAME).$(NAMESPACE).svc +short | grep $(POD_IP), but seems like reaching for python makes sense here since we already have it.

@anyasabo
Copy link
Contributor

Related ES issue discussing some of the SIGTERM behavior: elastic/elasticsearch#45236

and earlier helm issue:
elastic/helm-charts#190

thanks @Crazybus for the references

@pebrc
Copy link
Collaborator

pebrc commented Nov 21, 2019

I believe this would work too, without resorting to python, to find out which IPs are currently in the services DNS record:

getent hosts $SSET_NAME

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants