Fix the slow startup and termination of the vali-0 pod #7979

istvanballok · 2023-05-24T15:40:56Z

How to categorize this PR?

/area logging
/kind enhancement

What this PR does / why we need it:

The vali-0 pod takes more than a minute to start up and 30s to terminate eventually, "ungracefully". It turns out that this is the result of a few configuration issues.

Startup phase

When vali starts up in the default configuration, it waits for 1 minute until it reports a ready status via its /ready endpoint. This is reproducible directly with the docker image:

docker run --rm -it -p 3100:3100 ghcr.io/credativ/vali:v2.2.5
while true; do curl localhost:3100/ready; sleep 1; done
Ingester not ready: waiting for 1m0s after startup
...

This is to work around race conditions with ingesters exiting and updating the
ring.

https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go#L89

The reasoning behind this default value is not applicable in the Gardener setup of vali. Vali is executed in "single binary mode", so there are no other ingester instances involved. Hence we can set this delay to a lower value.

docker run --rm --entrypoint sh ghcr.io/credativ/vali:v2.2.5 -c "cat /etc/vali/local-config.yaml" | sed '/lifecycler:/a \    min_ready_duration: 1s' > local-config.yaml
docker run --rm -it -p 3100:3100 -v ./local-config.yaml:/etc/vali/local-config.yaml ghcr.io/credativ/vali:v2.2.5
while true; do curl localhost:3100/ready; sleep 1; done
Ingester not ready: waiting for 1s after startup
ready
...

Furthermore, the initialDelaySeconds setting of the readiness probe of vali is removed to let it default to 0s: even instances with significant amount of logs actually start up in less than 1s. There is no need to delay the initial readiness probe.

Termination phase

The final-sleep configuration parameter is changed to 0s from its default value of 30s. The default value of 30s meant that due to the same default 30s pod termination grace period, whenever vali was gracefully stopped (received a SIGTERM signal), it waited for 30(+epsilon)s until it would have shut down gracefully. However, shortly before that (after 30s), the kubelet sends a SIGKILL to forcefully stop the process: an undesired unclean shutdown, each and every time.

The motivation of allowing for a final scrape of the exposed metrics before the shutdown is not relevant in the Gardener case, given that the Prometheus scrape interval is 1m anyways. So this commit changes the final-sleep configuration to 0s, which happens to be the default value in the "single binary mode".

https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go#L90
https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/cmd/vali/vali-local-config.yaml#L17

Furthermore, the telegraf container of the vali-0 pod in the control plane namespace was not properly handling the SIGTERM signal: it was always killed by the kubelet after 30s with a SIGKILL. This was the other reason of the slow termination of the vali-0 pod. By letting bash execute the telegraf process in the background and waiting for it indefinitely with wait, the bash process with PID 1 can handle the SIGTERM signal that is sent by the kubelet. It sends a SIGTERM signal to its child bash process upon receiving a SIGTERM signal. This in turn handles the SIGTERM signal as well and sends it to its telegraf child process. Both bash processes wait for their child process to terminate. This way all the processes of the telegraf container terminate gracefully and in less than 1s when the SIGTERM signal is sent by the kubelet.

Note that testing the signal propagation in the "minimized" telegraf image is tricky. On a linux based "local setup" where the "nested" docker&containerd containers are just regular processes of the host, the following approach can help to understand the signal propagation.

Find the 3 processes of the telegraf container from the host:

ps aux | grep telegraf

root 2483208 /bin/bash -c trap 'kill %1; wait' SIGTERM bash /etc/telegraf/start.sh & wait
root 2483255 bash /etc/telegraf/start.sh
root 2483260 /usr/bin/telegraf --config /etc/telegraf/telegraf.conf
root 2494748 grep telegraf

Attach strace to the 3 processes with strace -p <pid>. Send a SIGTERM signal to PID 1 with k exec vali-0 -c telegraf -- bash -c "kill 1" to check the signal propagation.

With this change, the vali-0 pod terminates and starts up in less than 10 seconds in the examples below.

k get pods vali-0 -w | ts

17:25:09 vali-0   4/4     Terminating
17:25:11 vali-0   4/4     Terminating
17:25:11 vali-0   0/4     Terminating
17:25:11 vali-0   0/4     Terminating
17:25:11 vali-0   0/4     Terminating
17:25:11 vali-0   0/4     Pending
17:25:11 vali-0   0/4     Pending
17:25:11 vali-0   0/4     Init:0/2
17:25:12 vali-0   0/4     Init:0/2
17:25:13 vali-0   0/4     Init:1/2
17:25:14 vali-0   0/4     PodInitializing
17:25:15 vali-0   3/4     Running
17:25:16 vali-0   4/4     Running

17:27:38 vali-0   2/2     Terminating
17:27:41 vali-0   2/2     Terminating
17:27:42 vali-0   0/2     Terminating
17:27:42 vali-0   0/2     Terminating
17:27:42 vali-0   0/2     Terminating
17:27:42 vali-0   0/2     Pending
17:27:42 vali-0   0/2     Pending
17:27:42 vali-0   0/2     Init:0/2
17:27:42 vali-0   0/2     Init:0/2
17:27:43 vali-0   0/2     Init:1/2
17:27:44 vali-0   0/2     PodInitializing
17:27:45 vali-0   1/2     Running
17:27:46 vali-0   2/2     Running

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

/cc @vlvasilev @nickytd

Release note:

A configuration issue that resulted in a relatively slow startup and termination of the vali pods is fixed.

The vali-0 pod takes more than a minute to start up and 30s to terminate eventually, "ungracefully". It turns out that this is the result of a few configuration issues. <details> <summary>Startup phase</summary> When vali starts up in the default configuration, it waits for 1 minute until it reports a ready status via its `/ready` endpoint. This is reproducible directly with the docker image: ```text docker run --rm -it -p 3100:3100 ghcr.io/credativ/vali:v2.2.5 while true; do curl localhost:3100/ready; sleep 1; done Ingester not ready: waiting for 1m0s after startup ... ``` > This is to work around race conditions with ingesters exiting and updating the > ring. https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go#L89 The reasoning behind this default value is not applicable in the Gardener setup of vali. Vali is executed in "single binary mode", so there are no other ingester instances involved. Hence we can set this delay to a lower value. ```text docker run --rm --entrypoint sh ghcr.io/credativ/vali:v2.2.5 -c "cat /etc/vali/local-config.yaml" | sed '/lifecycler:/a \ min_ready_duration: 1s' > local-config.yaml docker run --rm -it -p 3100:3100 -v ./local-config.yaml:/etc/vali/local-config.yaml ghcr.io/credativ/vali:v2.2.5 while true; do curl localhost:3100/ready; sleep 1; done Ingester not ready: waiting for 1s after startup ready ... ``` Furthermore, the `initialDelaySeconds` setting of the readiness probe of vali is removed to let it default to 0s: even instances with significant amount of logs actually start up in less than 1s. There is no need to delay the initial readiness probe. </details> <details> <summary>Termination phase</summary> The `final-sleep` configuration parameter is changed to 0s from its default value of 30s. The default value of 30s meant that due to the same default 30s pod termination grace period, whenever vali was gracefully stopped (received a SIGTERM signal), it waited for 30(+epsilon)s until it would have shut down gracefully. However, shortly before that (after 30s), the kubelet sends a SIGKILL to forcefully stop the process: an undesired unclean shutdown, each and every time. The motivation of allowing for a final scrape of the exposed metrics before the shutdown is not relevant in the Gardener case, given that the Prometheus scrape interval is 1m anyways. So this commit changes the `final-sleep` configuration to 0s, which happens to be the default value in the "single binary mode". https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go#L90 https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/cmd/vali/vali-local-config.yaml#L17 Furthermore, the telegraf container of the vali-0 pod in the control plane namespace was not properly handling the SIGTERM signal: it was always killed by the kubelet after 30s with a SIGKILL. This was the other reason of the slow termination of the vali-0 pod. By letting bash execute the telegraf process in the background and waiting for it indefinitely with wait, the bash process with PID 1 can handle the SIGTERM signal that is sent by the kubelet. It sends a SIGTERM signal to its child bash process upon receiving a SIGTERM signal. This in turn handles the SIGTERM signal as well and sends it to its telegraf child process. Both bash processes wait for their child process to terminate. This way all the processes of the telegraf container terminate gracefully and in less than 1s when the SIGTERM signal is sent by the kubelet. Note that testing the signal propagation in the "minimized" telegraf image is tricky. On a linux based "local setup" where the "nested" docker&containerd containers are just regular processes of the host, the following approach can help to understand the signal propagation. Find the 3 processes of the telegraf container from the _host_: ```text ps aux | grep telegraf root 2483208 /bin/bash -c trap 'kill %1; wait' SIGTERM bash /etc/telegraf/start.sh & wait root 2483255 bash /etc/telegraf/start.sh root 2483260 /usr/bin/telegraf --config /etc/telegraf/telegraf.conf root 2494748 grep telegraf ``` Attach strace to the 3 processes with `strace -p <pid>`. Send a SIGTERM signal to PID 1 with `k exec vali-0 -c telegraf -- bash -c "kill 1"` to check the signal propagation. </details> <details> <summary>With this change, the vali-0 pod starts up and terminates in less than 10 seconds in the examples below.</summary> ```text k get pods vali-0 -w | ts 17:25:09 vali-0 4/4 Terminating 17:25:11 vali-0 4/4 Terminating 17:25:11 vali-0 0/4 Terminating 17:25:11 vali-0 0/4 Terminating 17:25:11 vali-0 0/4 Terminating 17:25:11 vali-0 0/4 Pending 17:25:11 vali-0 0/4 Pending 17:25:11 vali-0 0/4 Init:0/2 17:25:12 vali-0 0/4 Init:0/2 17:25:13 vali-0 0/4 Init:1/2 17:25:14 vali-0 0/4 PodInitializing 17:25:15 vali-0 3/4 Running 17:25:16 vali-0 4/4 Running 17:27:38 vali-0 2/2 Terminating 17:27:41 vali-0 2/2 Terminating 17:27:42 vali-0 0/2 Terminating 17:27:42 vali-0 0/2 Terminating 17:27:42 vali-0 0/2 Terminating 17:27:42 vali-0 0/2 Pending 17:27:42 vali-0 0/2 Pending 17:27:42 vali-0 0/2 Init:0/2 17:27:42 vali-0 0/2 Init:0/2 17:27:43 vali-0 0/2 Init:1/2 17:27:44 vali-0 0/2 PodInitializing 17:27:45 vali-0 1/2 Running 17:27:46 vali-0 2/2 Running ``` </details> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

shafeeqes · 2023-05-24T16:00:24Z

/cherry-pick release-v1.71

gardener-ci-robot · 2023-05-24T16:00:26Z

@shafeeqes: once the present PR merges, I will cherry-pick it on top of release-v1.71 in a new PR and assign it to you.

In response to this:

/cherry-pick release-v1.71

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

istvanballok · 2023-05-24T16:38:36Z

/test pull-gardener-e2e-kind-ha-single-zone-upgrade

nickytd

/lgtm

gardener-prow · 2023-05-25T05:11:49Z

LGTM label has been added.

Git tree hash: b6db37070452457ac9c4a970168949cd0ed344cd

shafeeqes · 2023-05-25T08:39:51Z

/approve

gardener-prow · 2023-05-25T08:39:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shafeeqes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [shafeeqes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gardener-ci-robot · 2023-05-25T08:41:20Z

@shafeeqes: new pull request created: #7982

In response to this:

/cherry-pick release-v1.71

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The vali-0 pod takes more than a minute to start up and 30s to terminate eventually, "ungracefully". It turns out that this is the result of a few configuration issues. <details> <summary>Startup phase</summary> When vali starts up in the default configuration, it waits for 1 minute until it reports a ready status via its `/ready` endpoint. This is reproducible directly with the docker image: ```text docker run --rm -it -p 3100:3100 ghcr.io/credativ/vali:v2.2.5 while true; do curl localhost:3100/ready; sleep 1; done Ingester not ready: waiting for 1m0s after startup ... ``` > This is to work around race conditions with ingesters exiting and updating the > ring. https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go#L89 The reasoning behind this default value is not applicable in the Gardener setup of vali. Vali is executed in "single binary mode", so there are no other ingester instances involved. Hence we can set this delay to a lower value. ```text docker run --rm --entrypoint sh ghcr.io/credativ/vali:v2.2.5 -c "cat /etc/vali/local-config.yaml" | sed '/lifecycler:/a \ min_ready_duration: 1s' > local-config.yaml docker run --rm -it -p 3100:3100 -v ./local-config.yaml:/etc/vali/local-config.yaml ghcr.io/credativ/vali:v2.2.5 while true; do curl localhost:3100/ready; sleep 1; done Ingester not ready: waiting for 1s after startup ready ... ``` Furthermore, the `initialDelaySeconds` setting of the readiness probe of vali is removed to let it default to 0s: even instances with significant amount of logs actually start up in less than 1s. There is no need to delay the initial readiness probe. </details> <details> <summary>Termination phase</summary> The `final-sleep` configuration parameter is changed to 0s from its default value of 30s. The default value of 30s meant that due to the same default 30s pod termination grace period, whenever vali was gracefully stopped (received a SIGTERM signal), it waited for 30(+epsilon)s until it would have shut down gracefully. However, shortly before that (after 30s), the kubelet sends a SIGKILL to forcefully stop the process: an undesired unclean shutdown, each and every time. The motivation of allowing for a final scrape of the exposed metrics before the shutdown is not relevant in the Gardener case, given that the Prometheus scrape interval is 1m anyways. So this commit changes the `final-sleep` configuration to 0s, which happens to be the default value in the "single binary mode". https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go#L90 https://github.com/credativ/vali/blob/d8985fc135e85518a5829b255afa7757d4714349/cmd/vali/vali-local-config.yaml#L17 Furthermore, the telegraf container of the vali-0 pod in the control plane namespace was not properly handling the SIGTERM signal: it was always killed by the kubelet after 30s with a SIGKILL. This was the other reason of the slow termination of the vali-0 pod. By letting bash execute the telegraf process in the background and waiting for it indefinitely with wait, the bash process with PID 1 can handle the SIGTERM signal that is sent by the kubelet. It sends a SIGTERM signal to its child bash process upon receiving a SIGTERM signal. This in turn handles the SIGTERM signal as well and sends it to its telegraf child process. Both bash processes wait for their child process to terminate. This way all the processes of the telegraf container terminate gracefully and in less than 1s when the SIGTERM signal is sent by the kubelet. Note that testing the signal propagation in the "minimized" telegraf image is tricky. On a linux based "local setup" where the "nested" docker&containerd containers are just regular processes of the host, the following approach can help to understand the signal propagation. Find the 3 processes of the telegraf container from the _host_: ```text ps aux | grep telegraf root 2483208 /bin/bash -c trap 'kill %1; wait' SIGTERM bash /etc/telegraf/start.sh & wait root 2483255 bash /etc/telegraf/start.sh root 2483260 /usr/bin/telegraf --config /etc/telegraf/telegraf.conf root 2494748 grep telegraf ``` Attach strace to the 3 processes with `strace -p <pid>`. Send a SIGTERM signal to PID 1 with `k exec vali-0 -c telegraf -- bash -c "kill 1"` to check the signal propagation. </details> <details> <summary>With this change, the vali-0 pod starts up and terminates in less than 10 seconds in the examples below.</summary> ```text k get pods vali-0 -w | ts 17:25:09 vali-0 4/4 Terminating 17:25:11 vali-0 4/4 Terminating 17:25:11 vali-0 0/4 Terminating 17:25:11 vali-0 0/4 Terminating 17:25:11 vali-0 0/4 Terminating 17:25:11 vali-0 0/4 Pending 17:25:11 vali-0 0/4 Pending 17:25:11 vali-0 0/4 Init:0/2 17:25:12 vali-0 0/4 Init:0/2 17:25:13 vali-0 0/4 Init:1/2 17:25:14 vali-0 0/4 PodInitializing 17:25:15 vali-0 3/4 Running 17:25:16 vali-0 4/4 Running 17:27:38 vali-0 2/2 Terminating 17:27:41 vali-0 2/2 Terminating 17:27:42 vali-0 0/2 Terminating 17:27:42 vali-0 0/2 Terminating 17:27:42 vali-0 0/2 Terminating 17:27:42 vali-0 0/2 Pending 17:27:42 vali-0 0/2 Pending 17:27:42 vali-0 0/2 Init:0/2 17:27:42 vali-0 0/2 Init:0/2 17:27:43 vali-0 0/2 Init:1/2 17:27:44 vali-0 0/2 PodInitializing 17:27:45 vali-0 1/2 Running 17:27:46 vali-0 2/2 Running ``` </details> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

gardener-prow bot requested review from nickytd and vlvasilev May 24, 2023 15:41

gardener-prow bot added area/logging Logging related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 24, 2023

nickytd reviewed May 25, 2023

View reviewed changes

gardener-prow bot assigned nickytd May 25, 2023

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 25, 2023

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2023

gardener-prow bot merged commit 524f0f0 into gardener:master May 25, 2023
16 checks passed

gardener-ci-robot mentioned this pull request May 25, 2023

[release-v1.71] Fix the slow startup and termination of the vali-0 pod #7982

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the slow startup and termination of the vali-0 pod #7979

Fix the slow startup and termination of the vali-0 pod #7979

istvanballok commented May 24, 2023 •

edited

shafeeqes commented May 24, 2023

gardener-ci-robot commented May 24, 2023

istvanballok commented May 24, 2023

nickytd left a comment

gardener-prow bot commented May 25, 2023

shafeeqes commented May 25, 2023

gardener-prow bot commented May 25, 2023

gardener-ci-robot commented May 25, 2023

Fix the slow startup and termination of the vali-0 pod #7979

Fix the slow startup and termination of the vali-0 pod #7979

Conversation

istvanballok commented May 24, 2023 • edited

shafeeqes commented May 24, 2023

gardener-ci-robot commented May 24, 2023

istvanballok commented May 24, 2023

nickytd left a comment

Choose a reason for hiding this comment

gardener-prow bot commented May 25, 2023

shafeeqes commented May 25, 2023

gardener-prow bot commented May 25, 2023

gardener-ci-robot commented May 25, 2023

istvanballok commented May 24, 2023 •

edited