Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

sebgl · 2019-10-08T13:58:32Z

I think we can optimize ES Pods shutdown to minimize downtime in Services endpoints when removing or restarting ES Pods.

Endpoint controller code: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go

It seems to behave as I would expect: just watch Pods resources, and Create/update/delete endpoints and services accordingly.

Which means that when we delete a Pod, there's a small time frame where its IP address still appears in the Service endpoints. Until the endpoint controller removes it.
In this time frame, a user that hits the service URL may have errors such as connection refused.

I believe some of our E2E tests actually hit that sweet spot from time to time:

--- FAIL: TestMutationAndReversal/Cluster_UUID_should_have_been_preserved#01 (0.06s)
15:40:59          cluster_uuid.go:58: 
15:40:59                  Error Trace:    cluster_uuid.go:58
15:40:59                  Error:          Received unexpected error:
15:40:59                                  Get https://test-reverted-mutation-q5qp-es-http.e2e-px7t6-mercury.svc:9200/: dial tcp 10.7.244.230:9200: connect: connection refused
15:40:59                  Test:           TestMutationAndReversal/Cluster_UUID_should_have_been_preserved#01
15:40:59  FAIL

An interesting read: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.

We discussed in a meeting one way to optimize that behaviour:

We'd like to minimize chances that a Pod endpoint appearing in the ES service leads to an Elasticsearch node that is not accepting requests. Once a Pod is Terminating, it is removed from the Service endpoints. Which is fine, but in the meantime, the Elasticsearch process may not be accepting http requests anymore, and still be part of the service endpoints.
If we setup a preStop hook to wait for a few seconds (eg. sleep 10) before the Elasticsearch container receives a TERM signal, then there's a chance the corresponding endpoint disappears from the service before Elasticsearch stops listening to http requests.
A more complex version of that could be to make the readiness probe respond to the preStop hook, but that doesn't seem necessary, since Terminating is enough for the endpoint to be removed.

Most ES clients already have some retry logic, to handle random failures like that. Should we still optimize for it? (IMHO: yes)

The text was updated successfully, but these errors were encountered:

david-kow · 2019-10-09T08:37:15Z

Regarding the 'do nothing' option and relying on retries. While I agree that any sane client will have retries, I think that it's something unexpected (one might ask: why would dead nodes be in the pool?). In any case clients will see timeouts/5xxs and/or increased latency. Given the number of nodes in smaller clusters and non-default maxUnavailable config, this can lower the availability of the cluster dramatically for these short periods of time (ie. 2/5 nodes upgraded at a time, 40% of requests timing out). Also, there will be clients that hit this multiple times in a row. This will affect their sla/o and request times.

I think preStop hook is a quick fix we should go for and we should experiment with right value for sleep and grace period. This will cover most cases.

Depending on the number of cases it doesn't cover, we should evaluate whether we need more complicated (although optimal) solution that would make sure that node is not in the pool before taking it down. I'm not sure what is the behavior of Service in case of already established connections, but if they are preserved, we could also give guarantees about long-living connections.

pebrc · 2019-10-09T08:46:46Z

If we setup a preStop hook to wait for a few seconds (eg. sleep 10) before the Elasticsearch container receives a TERM signal, then there's a chance the corresponding endpoint disappears from the service before Elasticsearch stops listening to http requests.

I think I like the preStop hook proposal because it is simple compared to the alternatives and requires zero coordination via the operator reconciliation loop.

Having said that a fixed 10 sec buffer is of course sub-optimal. IIUC the value should be somewhat dynamically tied to the time it takes for a terminating pod being dropped from the service's endpoints which might vary depending on factors such as load on the k8s apiserver, state of the k8s controller that manages services etc.

sebgl · 2019-10-15T10:13:12Z

A command we can run from the preStop hook to know when the IP disappears from the StatefulSet headless service DNSes:

python -c 'import socket; socket.getaddrinfo("statefulset-name", 9200)'

Which does not mean the endpoint is removed from the public-facing HTTP service though.
But could be a good approximation instead of relying on sleep X?

anyasabo · 2019-10-15T15:21:11Z

We could add an env var to the pods with the http service name, so we can check that as well. I didn't realize we had python installed in the containers and thought we might need to add dig or something to do dig $(SERVICE_NAME).$(NAMESPACE).svc +short | grep $(POD_IP), but seems like reaching for python makes sense here since we already have it.

anyasabo · 2019-10-15T15:44:13Z

Related ES issue discussing some of the SIGTERM behavior: elastic/elasticsearch#45236

and earlier helm issue:
elastic/helm-charts#190

thanks @Crazybus for the references

pebrc · 2019-11-21T09:00:20Z

I believe this would work too, without resorting to python, to find out which IPs are currently in the services DNS record:

getent hosts $SSET_NAME

sebgl added >flaky_test >enhancement Enhancement of existing functionality and removed >flaky_test labels Oct 8, 2019

sebgl changed the title ~~Check cluster service endpoints in E2E tests~~ Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped Oct 8, 2019

david-kow mentioned this issue Oct 17, 2019

Introduce availability testing to e2e tests #2013

Closed

thbkrkr mentioned this issue Nov 26, 2019

k8s versions e2e test are failing since Oct 15th #2134

Closed

david-kow self-assigned this Nov 27, 2019

thbkrkr mentioned this issue Nov 28, 2019

Stack versions e2e test failing since Oct 14th #2133

Closed

david-kow mentioned this issue Dec 10, 2019

Minimize downtime during Pod recycling #2233

Merged

david-kow closed this as completed in #2233 Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

sebgl commented Oct 8, 2019 •

edited

Loading

david-kow commented Oct 9, 2019

pebrc commented Oct 9, 2019

sebgl commented Oct 15, 2019

anyasabo commented Oct 15, 2019

anyasabo commented Oct 15, 2019

pebrc commented Nov 21, 2019 •

edited

Loading

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped #1927

Comments

sebgl commented Oct 8, 2019 • edited Loading

david-kow commented Oct 9, 2019

pebrc commented Oct 9, 2019

sebgl commented Oct 15, 2019

anyasabo commented Oct 15, 2019

anyasabo commented Oct 15, 2019

pebrc commented Nov 21, 2019 • edited Loading

sebgl commented Oct 8, 2019 •

edited

Loading

pebrc commented Nov 21, 2019 •

edited

Loading