Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove non-functional pre-stop hook parts #4801

Merged
merged 4 commits into from
Sep 1, 2021

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Aug 31, 2021

Fixes #4698

The pre-stop hook script intended to improve availability during rolling upgrades has not been doing what it is advertising to do since ECK 1.3.0.

We are including non-ready Pods in the headless service which we then try to use to determine when a terminating Pod has been removed from DNS (never, in this constellation). We have since added functionality in #3837 that relies on this setting for client-side node discovery (aka sniffing).

This means that our pre-stop hook was always running the full 20+30 secs before terminating the Pod. I see two potential fixes for this problem:

  1. go back to not publishing unready Pods and revert Allow automatic Elasticsearch nodes discovery #3837
  2. simplify the pre-stop hook and turn it into a simple timeout without trying to check DNS for the Pod IP (this PR)

I have opted for sticking with 50 seconds of wait time (but I am open to cutting this down)

NOTE: I also kept the terminationGracePeriod at 180s to avoid a rolling restart on ECK upgrade.

@pebrc pebrc added >bug Something isn't working v1.8.0 labels Aug 31, 2021
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I actually like the simplicity of this more than our initial approximation that relies on the (unrelated) StatefulSet headless service DNS.

Copy link
Contributor

@david-kow david-kow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also +1 on this change. Should we mark it as breaking though? For users that set PRE_STOP_ADDITIONAL_WAIT_SECONDS to 0 this change will mean going from somewhat safe 20 seconds wait to no waiting at all.

First, the PreStop lifecycle hook keeps querying DNS for `PRE_STOP_MAX_WAIT_SECONDS` (defaulting to 20) until the Pod IP is no longer referenced.
Then, it waits for an additional `PRE_STOP_ADDITIONAL_WAIT_SECONDS` (defaulting to 30). Additional wait is used to:
To address this issue and minimize unavailability, ECK relies on a link:https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/[PreStop lifecycle hook].
It waits for an additional `PRE_STOP_ADDITIONAL_WAIT_SECONDS` (defaulting to 50). The additional wait time is used to:

1. Give time to in-flight requests to be completed.
2. Give clients time to use the terminating Pod IP resolved just before DNS record was updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 22 should say environment variable instead of environment variableS:

The exact behavior is configurable using environment variable, for example:

@pebrc pebrc added the >breaking label Sep 1, 2021
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking >bug Something isn't working v1.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PreStop hook logic seems incorrect
4 participants