Fix runners to do their best to gracefully stop on pod eviction #1759

mumoshu · 2022-08-27T08:25:57Z

Please see the updated README for more details.

~~Perhaps this may already work with RunnerDeployment-managed rootless-dind runners too. I'll give it a shot and update the PR title and README if it works.~~ Worked. This works for rootless-dind pods managed by either RunnerDeployment or RunnerSet.

UPDATE: Added support for rootful dind runners too.
UPDATE: Added support for runners with dind sidecars.

Since this change, the runner which received SIGTERM while still running a job will show log like the below:

√ Connected to GitHub

2022-08-28 05:18:11.927  NOTICE --- Executing actions-runner-controller's SIGTERM handler.
2022-08-28 05:18:11.928  NOTICE --- Note that if this takes more time than terminationGracePeriodSeconds, the runner will be forcefully terminated by Kubernetes, which may result in the in-progress workflow job, if any, to fail.
2022-08-28 05:18:11.930  NOTICE --- Ensuring dockerd is still running.
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

# Runner removal

Current runner version: '2.296.0'
2022-08-28 05:18:12Z: Listening for Jobs
Failed: Removing runner from the server
Runner "org-runnerdeploy-tnshq-2k5p2" is still running a job"
2022-08-28 05:18:14.788  NOTICE --- Waiting for RUNNER_GRACEFUL_STOP_TIMEOUT=100 seconds until the runner agent to stop by itself.
2022-08-28 05:18:15Z: Running job: test2
2022-08-28 05:18:55Z: Job test2 completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner...
2022-08-28 05:18:57.502  NOTICE --- The runner agent stopped before RUNNER_GRACEFUL_STOP_TIMEOUT=100
2022-08-28 05:18:57.506  NOTICE --- The actions runner process exited.
2022-08-28 05:18:57.508  NOTICE --- Holding on until runner init (pid 16) exits, so that there will hopefully be no zombie processes remaining.
2022-08-28 05:18:57.510  NOTICE --- Graceful stop completed.
2022-08-28 05:18:57.512  NOTICE --- Runner init exited. Exiting this process with code 0 so that the container and the pod is GC'ed Kubernetes soon.

NOTICE messages are from our new graceful termination handler and other messages are from actions/runner's Runner.Listener(main agent) or config.sh remove.

As you could see, this change enables you to configure RUNNER_GRACEFUL_STOP_TIMEOUT.

Configure the timeout along with the existing terminationGracePeriodSeconds, so that the new graceful termination handler does two things for you:

It does its best not to disrupt the running workflow job, and
If it's impossible to finish the job before the termationGracePeriodSeconds elapses, do its best not to leave the job hanging for 10 minutes in the GitHub Actions Workflow Runs/Jobs UI

I had to change the entrypoint of the runner images from dumb-init to bash along the way.

If not done correctly, it might have resulted in leaving zombie processes in the container or the host (I think) in case either dockerd or actions/runner had some bugs. However, I've modified the entrypoint scripts to do internally use dumb-init for running dockerd and actions/runner.

The process tree after this change now looks like the ones below.

For runner with dind sidecar:

$ ps auxwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner       216  0.0  0.0   2608   596 pts/0    Ss   00:52   0:00 sh
runner       222  0.0  0.0   5892  2904 pts/0    R+   00:52   0:00  \_ ps auxwf
runner         1  0.0  0.0   3976  3004 ?        Ss   00:38   0:00 /bin/bash /usr/bin/actions-runner.sh
runner        13  0.0  0.0    216     4 ?        S    00:38   0:00 dumb-init bash
runner        16  0.0  0.0   3976  2944 ?        Ss   00:38   0:00  \_ bash
runner        17  0.0  0.0   3976  3108 ?        S    00:38   0:00      \_ /bin/bash ./run.sh
runner       125  0.0  0.0   3976  3016 ?        S    00:38   0:00          \_ /bin/bash /runner/run-helper.sh
runner       129  0.2  0.2 3699168 87076 ?       Sl   00:38   0:01              \_ /runner/bin/Runner.Listener run

For rootful dind runner:

runner@org-runnerdeploy-6dlsh-8dd56:/$ ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner       307  0.0  0.0   4240  3532 pts/0    Ss   05:27   0:00 bash
runner       314  0.0  0.0   5892  2904 pts/0    R+   05:27   0:00  \_ ps auxwwf
runner         1  0.0  0.0   3976  3092 ?        Ss   05:24   0:00 /bin/bash /usr/bin/startup.sh
runner        15  0.0  0.0    216     4 ?        S    05:24   0:00 dumb-init bash
runner        17  0.0  0.0   3976  3084 ?        Ss   05:24   0:00  \_ bash
root          23  0.0  0.0   6276  4200 ?        S    05:24   0:00      \_ sudo /usr/bin/supervisord -n
root          27  0.0  0.0  28448 23444 ?        S    05:24   0:00      |   \_ /usr/bin/python3 /usr/bin/supervisord -n
root          40  0.1  0.1 2014364 54036 ?       Sl   05:24   0:00      |       \_ /usr/local/bin/dockerd
root          55  0.0  0.0 2066508 29460 ?       Ssl  05:24   0:00      |           \_ containerd --config /var/run/docker/containerd/containerd.toml --log-level info
runner       158  0.0  0.0   3976  2968 ?        S    05:24   0:00      \_ /bin/bash ./run.sh
runner       262  0.0  0.0   3976  3104 ?        S    05:24   0:00          \_ /bin/bash /runner/run-helper.sh
runner       266  0.8  0.2 3699352 84988 ?       Sl   05:24   0:01              \_ /runner/bin/Runner.Listener run

For rootless dind runner:

runner@org-runnerdeploy-gpn9b-tx7wl:/$ ps auxwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner       314  0.3  0.0   4240  3504 pts/0    Ss   05:36   0:00 bash
runner       322  0.0  0.0   5892  2904 pts/0    R+   05:36   0:00  \_ ps auxwf
runner         1  0.0  0.0   3976  3180 ?        Ss   05:34   0:00 /bin/bash /usr/bin/rootless-startup.sh
runner        19  0.0  0.0    216     4 ?        S    05:34   0:00 dumb-init bash
runner        22  0.0  0.0   3976  2936 ?        Ss   05:34   0:00  \_ bash
runner        23  0.0  0.0 709732  7012 ?        Sl   05:34   0:00      \_ rootlesskit --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopba
runner        33  0.0  0.0 709732  7140 ?        Sl   05:34   0:00      |   \_ /proc/self/exe --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host
runner        63  0.1  0.1 1493308 54168 ?       Sl   05:34   0:00      |   |   \_ dockerd --config-file /home/runner/.config/docker/daemon.json
runner        86  0.1  0.1 1482280 35032 ?       Ssl  05:34   0:00      |   |       \_ containerd --config /run/user/1000/docker/containerd/containerd.toml --log-level info
runner        46  0.0  0.0  66428 24372 ?        Sl   05:34   0:00      |   \_ vpnkit --ethernet /tmp/rootlesskit1191930478/vpnkit-ethernet.sock --mtu 1500 --host-ip 0.0.0.0
runner        24  0.0  0.0   3976  3064 ?        S    05:34   0:00      \_ /bin/bash ./run.sh
runner       280  0.0  0.0   3976  3056 ?        S    05:34   0:00          \_ /bin/bash /runner/run-helper.sh
runner       284  1.8  0.2 3699340 86704 ?       Sl   05:34   0:01              \_ /runner/bin/Runner.Listener run

As long as our entrypoint (The new actions-runner.sh, the modified startup.sh and rootless-startup.sh respectively) sends SIGTERM to dumb-init, it should reap any remaining zombie processes (if any). The updated entrypoint scripts do exactly that and therefore you should encounter no zombie processes issues even if our init (pid 1) is no longer dumb-init.

Note also that runner with dind sidecar can still fail to stop gracefully due to that the dind sidecar has no feature to wait until its client(=actions/runner agent) stops first. You'll see messages like the ones seen in the below log when this happens. The fix for this should be made in another pull request. I've already addressed it in this PR.

org-runnerdeploy-rw4pd-7vw24 runner √ Connected to GitHub
org-runnerdeploy-rw4pd-7vw24 runner 
org-runnerdeploy-rw4pd-7vw24 runner Current runner version: '2.296.2'
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:30Z: Listening for Jobs
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:42Z: Running job: test0
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.357093068Z" level=info msg="Processing signal 'terminated'"
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.358539782Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.359930049Z" level=info msg="Daemon shutdown complete"
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.359983931Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.360236297Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.363  NOTICE --- Executing actions-runner-controller's SIGTERM handler.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.367  NOTICE --- Note that if this takes more time than terminationGracePeriodSeconds, the runner will be forcefully terminated by Kubernetes, which may result in the in-progress workflow job, if any, to fail.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.382  NOTICE --- Ensuring dockerd is still running.
org-runnerdeploy-rw4pd-7vw24 runner Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.439  WARNING --- Detected configuration error: dockerd should be running but is already nowhere. This is wrong. Ensure that your init system to NOT pass SIGTERM directly to dockerd!
org-runnerdeploy-rw4pd-7vw24 runner /runner /
org-runnerdeploy-rw4pd-7vw24 runner 
org-runnerdeploy-rw4pd-7vw24 runner # Runner removal
org-runnerdeploy-rw4pd-7vw24 runner 
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:46.375563741Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
org-runnerdeploy-rw4pd-7vw24 runner Failed: Removing runner from the server
org-runnerdeploy-rw4pd-7vw24 runner Runner "org-runnerdeploy-rw4pd-7vw24" is still running a job"
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:49.385  NOTICE --- Waiting for RUNNER_GRACEFUL_STOP_TIMEOUT=1780 seconds until the runner agent to stop by itself.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:54Z: Job test0 completed with result: Failed
org-runnerdeploy-rw4pd-7vw24 runner √ Removed .credentials
org-runnerdeploy-rw4pd-7vw24 runner √ Removed .runner
org-runnerdeploy-rw4pd-7vw24 runner Runner listener exit with 0 return code, stop the service, no retry needed.
org-runnerdeploy-rw4pd-7vw24 runner Exiting runner...
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.547  NOTICE --- The runner agent stopped before RUNNER_GRACEFUL_STOP_TIMEOUT=1780
org-runnerdeploy-rw4pd-7vw24 runner /
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.575  NOTICE --- The actions runner process exited.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.592  NOTICE --- Holding on until runner init (pid 13) exits, so that there will hopefully be no zombie processes remaining.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.604  NOTICE --- Graceful stop completed.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.617  NOTICE --- Runner init exited. Exiting this process with code 0 so that the container and the pod is GC'ed Kubernetes soon.

Ref #1535
Ref #1581

TingluoHuang · 2022-08-29T03:29:11Z

runner/graceful-stop.bash

+  #   Cannot connect to server, because config files are missing. Skipping removing runner from the server.
+  #   Does not exist. Skipping Removing .credentials
+  #   Does not exist. Skipping Removing .runner
+  cd /runner


do you want to use pushd /runner? so you can popd to your original pwd after the ./config remove

Absolutely! Addressed in f5610db

TingluoHuang · 2022-08-29T03:35:05Z

runner/actions-runner.sh

+#!/bin/bash
+source logger.bash
+source graceful-stop.bash
+


do we want to move trap graceful_stop TERM in here to make it more readable since it looks like actopns-runner.sh is the new main entrypoint script?
So, we can add trap - TERM at the end of the file to remove the trap?
You can ignore me, not a big deal. 😄

Sounds much cleaner! Let me try. Thanks for your feedback 😄

Addressed in a666a2f

toast-gear

Might be nice to add shellcheck linting into the pipeline as part of this change seen as we doing major surgery

runner/wait.bash

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.

@CASABECI

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759. This change updates and reintroduces #1517 contributed by @CASABECI in a way it becomes applicable to today's code-base.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759. This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.

…eriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT when runner pod was terminated externally too early after its creation While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timouet of 10 minutes. You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?

…ockerd start

runner/entrypoint.sh

toast-gear

LGTM

mumoshu · 2022-11-01T11:29:11Z

Thanks for reviewing!

dongho-jung · 2022-11-02T02:28:32Z

Wow I was gonna open a discussion for this because I've been suffering from this since 2 weeks ago and wow it just merged 15 hours ago what a timing 👍

This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.

While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.

This fixes the said issue I have found while testing #1759.

While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.

…nt (#1975) This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.

* Fix permission issue when you use PV for rootless dind cache This fixes the said issue I have found while testing #1759. Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>

…decar prestop hook The runner graceful stop timeout has never been propagated to the dind sidecar due to configuration error in E2E. This fixes it, so that we can verify that the dind sidecar prestop can respect the graceful stop timeout. Related to #1759

…decar prestop hook (#2044) The runner graceful stop timeout has never been propagated to the dind sidecar due to configuration error in E2E. This fixes it, so that we can verify that the dind sidecar prestop can respect the graceful stop timeout. Related to #1759

mumoshu requested a review from toast-gear as a code owner August 27, 2022 08:25

mumoshu changed the title ~~Fix RunnerSet-managed rootless-dind runners to gracefully stop on pod eviction~~ Fix rootless-dind runners to gracefully stop on pod eviction Aug 28, 2022

mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch 2 times, most recently from dcfb4f2 to 94b6d89 Compare August 28, 2022 02:55

mumoshu changed the title ~~Fix rootless-dind runners to gracefully stop on pod eviction~~ Fix rootless/rootful dind runners to gracefully stop on pod eviction Aug 28, 2022

mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch 2 times, most recently from e2006d4 to 10cb56d Compare August 28, 2022 06:16

mumoshu changed the title ~~Fix rootless/rootful dind runners to gracefully stop on pod eviction~~ Fix runners to gracefully stop on pod eviction Aug 29, 2022

mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch from c82c99c to 214f5c6 Compare August 29, 2022 01:02

This was referenced Aug 29, 2022

Better behavior with Cluster Autoscaler #1581

Closed

Pods created by RunnerSet are terminated while still running jobs on a rolling update #1535

Open

TingluoHuang reviewed Aug 29, 2022

View reviewed changes

TingluoHuang previously approved these changes Aug 29, 2022

View reviewed changes

toast-gear requested changes Aug 30, 2022

View reviewed changes

toast-gear reviewed Aug 30, 2022

View reviewed changes

runner/wait.bash Outdated Show resolved Hide resolved

mumoshu dismissed TingluoHuang’s stale review via f5610db September 21, 2022 02:36

mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch from 214f5c6 to f5610db Compare September 21, 2022 02:36

mumoshu mentioned this pull request Sep 25, 2022

makefile: Add shellcheck installation and run targets #1852

Merged

mumoshu mentioned this pull request Sep 25, 2022

Add workflow for validating runner scripts with shellcheck #1853

Merged

mumoshu changed the title ~~Fix runners to gracefully stop on pod eviction~~ Fix runners to do their best to gracefully stop on pod eviction Sep 25, 2022

mumoshu mentioned this pull request Sep 25, 2022

runner: Address all shellcheck findings #1854

Merged

mumoshu mentioned this pull request Sep 25, 2022

Prevent runner controller from recreating runner pod when terminated externally #1851

Merged

mumoshu mentioned this pull request Sep 25, 2022

Handle missing runner ID more gracefully #1855

Merged

toast-gear previously approved these changes Oct 24, 2022

View reviewed changes

mumoshu added 2 commits October 25, 2022 02:10

Fix dind sidecar to not hang in terminating

bfc876e

Fix dind sidecar to not hang in terminating when pod deleted before d…

9c03004

…ockerd start

mumoshu dismissed toast-gear’s stale review via 9c03004 October 25, 2022 02:11

Fix unit test cases for NewRunnerPod

4b1556a

toast-gear reviewed Oct 31, 2022

View reviewed changes

runner/entrypoint.sh Outdated Show resolved Hide resolved

runner: Add -f to rm on /runner/.runner

b42d884

toast-gear approved these changes Nov 1, 2022

View reviewed changes

mumoshu merged commit c74ad61 into master Nov 1, 2022

mumoshu added a commit that referenced this pull request Nov 3, 2022

Fix dead-lock when runner unregistration triggered before PV attachment

c7cc4df

This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.

mumoshu mentioned this pull request Nov 3, 2022

Fix dead-lock when runner unregistration triggered before PV attachment #1975

Merged

mumoshu mentioned this pull request Nov 3, 2022

runner: Fix rootless dind to respect specified MTU #1976

Merged

mumoshu added a commit that referenced this pull request Nov 3, 2022

Fix permission issue when you use PV for rootless dind cache

fc99439

This fixes the said issue I have found while testing #1759.

mumoshu mentioned this pull request Nov 3, 2022

Fix permission issue when you use PV for rootless dind cache #1977

Merged

mumoshu added a commit that referenced this pull request Nov 3, 2022

Fix dead-lock when runner unregistration triggered before PV attachme…

23c8fe4

…nt (#1975) This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.

mumoshu mentioned this pull request Nov 24, 2022

0.26.0 Disable debug log statusUpdateHook #1939

Closed

7 tasks

mumoshu mentioned this pull request Nov 25, 2022

e2e: Do honor the runner graceful stop timeout also in the dockerd sidecar prestop hook #2044

Merged

Link- deleted the graceful-stop-on-rootless-dind-runner-eviction branch March 13, 2023 15:07

jkroepke mentioned this pull request Apr 26, 2024

Please implement graceful shutdown of runners myoung34/docker-github-actions-runner#358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix runners to do their best to gracefully stop on pod eviction #1759

Fix runners to do their best to gracefully stop on pod eviction #1759

mumoshu commented Aug 27, 2022 •

edited

Loading

TingluoHuang Aug 29, 2022

mumoshu Sep 21, 2022

TingluoHuang Aug 29, 2022

mumoshu Sep 21, 2022

mumoshu Sep 21, 2022

toast-gear left a comment

toast-gear left a comment

mumoshu commented Nov 1, 2022

dongho-jung commented Nov 2, 2022

Fix runners to do their best to gracefully stop on pod eviction #1759

Fix runners to do their best to gracefully stop on pod eviction #1759

Conversation

mumoshu commented Aug 27, 2022 • edited Loading

TingluoHuang Aug 29, 2022

Choose a reason for hiding this comment

mumoshu Sep 21, 2022

Choose a reason for hiding this comment

TingluoHuang Aug 29, 2022

Choose a reason for hiding this comment

mumoshu Sep 21, 2022

Choose a reason for hiding this comment

mumoshu Sep 21, 2022

Choose a reason for hiding this comment

toast-gear left a comment

Choose a reason for hiding this comment

toast-gear left a comment

Choose a reason for hiding this comment

mumoshu commented Nov 1, 2022

dongho-jung commented Nov 2, 2022

mumoshu commented Aug 27, 2022 •

edited

Loading