Remove non-functional pre-stop hook parts #4801

pebrc · 2021-08-31T12:23:02Z

The pre-stop hook script intended to improve availability during rolling upgrades has not been doing what it is advertising to do since ECK 1.3.0.

cloud-on-k8s/pkg/controller/elasticsearch/nodespec/statefulset.go

Line 51 in 72bfb73

PublishNotReadyAddresses: true,

We are including non-ready Pods in the headless service which we then try to use to determine when a terminating Pod has been removed from DNS (never, in this constellation). We have since added functionality in #3837 that relies on this setting for client-side node discovery (aka sniffing).

This means that our pre-stop hook was always running the full 20+30 secs before terminating the Pod. I see two potential fixes for this problem:

go back to not publishing unready Pods and revert Allow automatic Elasticsearch nodes discovery #3837
simplify the pre-stop hook and turn it into a simple timeout without trying to check DNS for the Pod IP (this PR)

I have opted for sticking with 50 seconds of wait time (but I am open to cutting this down)

NOTE: I also kept the terminationGracePeriod at 180s to avoid a rolling restart on ECK upgrade.

sebgl

LGTM
I actually like the simplicity of this more than our initial approximation that relies on the (unrelated) StatefulSet headless service DNS.

david-kow

I'm also +1 on this change. Should we mark it as breaking though? For users that set PRE_STOP_ADDITIONAL_WAIT_SECONDS to 0 this change will mean going from somewhat safe 20 seconds wait to no waiting at all.

david-kow · 2021-09-01T07:05:14Z

docs/orchestrating-elastic-stack-applications/elasticsearch/prestop.asciidoc

-First, the PreStop lifecycle hook keeps querying DNS for `PRE_STOP_MAX_WAIT_SECONDS` (defaulting to 20) until the Pod IP is no longer referenced.
-Then, it waits for an additional `PRE_STOP_ADDITIONAL_WAIT_SECONDS` (defaulting to 30). Additional wait is used to:
+To address this issue and minimize unavailability, ECK relies on a link:https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/[PreStop lifecycle hook].
+It waits for an additional `PRE_STOP_ADDITIONAL_WAIT_SECONDS` (defaulting to 50). The additional wait time is used to:

 1. Give time to in-flight requests to be completed.
 2. Give clients time to use the terminating Pod IP resolved just before DNS record was updated.


Line 22 should say environment variable instead of environment variableS:

The exact behavior is configurable using environment variable, for example:

barkbay

lgtm

pebrc added 2 commits August 31, 2021 13:59

Remove non-functional pre-stop hook parts

70ca69b

Reset termination grace period to original value

0d251da

pebrc added >bug Something isn't working v1.8.0 labels Aug 31, 2021

Stick with 50 seconds

d377bfc

sebgl approved these changes Aug 31, 2021

View reviewed changes

david-kow approved these changes Sep 1, 2021

View reviewed changes

pebrc added the >breaking label Sep 1, 2021

fix plural in doc

0720668

barkbay approved these changes Sep 1, 2021

View reviewed changes

pebrc merged commit 3a862b1 into elastic:master Sep 1, 2021

thbkrkr mentioned this pull request Feb 2, 2022

TestMutationWhileLoadTesting fails due to failed connection attempts #5263

Closed

pebrc mentioned this pull request Jun 8, 2022

Elasticsearch Pods graceful stop time is inconsistent #4415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove non-functional pre-stop hook parts #4801

Remove non-functional pre-stop hook parts #4801

pebrc commented Aug 31, 2021 •

edited

Loading

sebgl left a comment

david-kow left a comment

david-kow Sep 1, 2021

barkbay left a comment

Remove non-functional pre-stop hook parts #4801

Remove non-functional pre-stop hook parts #4801

Conversation

pebrc commented Aug 31, 2021 • edited Loading

sebgl left a comment

Choose a reason for hiding this comment

david-kow left a comment

Choose a reason for hiding this comment

david-kow Sep 1, 2021

Choose a reason for hiding this comment

barkbay left a comment

Choose a reason for hiding this comment

pebrc commented Aug 31, 2021 •

edited

Loading