Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runners to do their best to gracefully stop on pod eviction #1759

Merged
merged 21 commits into from
Nov 1, 2022

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Aug 27, 2022

Please see the updated README for more details.

Perhaps this may already work with RunnerDeployment-managed rootless-dind runners too. I'll give it a shot and update the PR title and README if it works. Worked. This works for rootless-dind pods managed by either RunnerDeployment or RunnerSet.

UPDATE: Added support for rootful dind runners too.
UPDATE: Added support for runners with dind sidecars.

Since this change, the runner which received SIGTERM while still running a job will show log like the below:

√ Connected to GitHub

2022-08-28 05:18:11.927  NOTICE --- Executing actions-runner-controller's SIGTERM handler.
2022-08-28 05:18:11.928  NOTICE --- Note that if this takes more time than terminationGracePeriodSeconds, the runner will be forcefully terminated by Kubernetes, which may result in the in-progress workflow job, if any, to fail.
2022-08-28 05:18:11.930  NOTICE --- Ensuring dockerd is still running.
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

# Runner removal

Current runner version: '2.296.0'
2022-08-28 05:18:12Z: Listening for Jobs
Failed: Removing runner from the server
Runner "org-runnerdeploy-tnshq-2k5p2" is still running a job"
2022-08-28 05:18:14.788  NOTICE --- Waiting for RUNNER_GRACEFUL_STOP_TIMEOUT=100 seconds until the runner agent to stop by itself.
2022-08-28 05:18:15Z: Running job: test2
2022-08-28 05:18:55Z: Job test2 completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner...
2022-08-28 05:18:57.502  NOTICE --- The runner agent stopped before RUNNER_GRACEFUL_STOP_TIMEOUT=100
2022-08-28 05:18:57.506  NOTICE --- The actions runner process exited.
2022-08-28 05:18:57.508  NOTICE --- Holding on until runner init (pid 16) exits, so that there will hopefully be no zombie processes remaining.
2022-08-28 05:18:57.510  NOTICE --- Graceful stop completed.
2022-08-28 05:18:57.512  NOTICE --- Runner init exited. Exiting this process with code 0 so that the container and the pod is GC'ed Kubernetes soon.

NOTICE messages are from our new graceful termination handler and other messages are from actions/runner's Runner.Listener(main agent) or config.sh remove.

As you could see, this change enables you to configure RUNNER_GRACEFUL_STOP_TIMEOUT.

Configure the timeout along with the existing terminationGracePeriodSeconds, so that the new graceful termination handler does two things for you:

  • It does its best not to disrupt the running workflow job, and
  • If it's impossible to finish the job before the termationGracePeriodSeconds elapses, do its best not to leave the job hanging for 10 minutes in the GitHub Actions Workflow Runs/Jobs UI

I had to change the entrypoint of the runner images from dumb-init to bash along the way.

If not done correctly, it might have resulted in leaving zombie processes in the container or the host (I think) in case either dockerd or actions/runner had some bugs. However, I've modified the entrypoint scripts to do internally use dumb-init for running dockerd and actions/runner.

The process tree after this change now looks like the ones below.

For runner with dind sidecar:

$ ps auxwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner       216  0.0  0.0   2608   596 pts/0    Ss   00:52   0:00 sh
runner       222  0.0  0.0   5892  2904 pts/0    R+   00:52   0:00  \_ ps auxwf
runner         1  0.0  0.0   3976  3004 ?        Ss   00:38   0:00 /bin/bash /usr/bin/actions-runner.sh
runner        13  0.0  0.0    216     4 ?        S    00:38   0:00 dumb-init bash
runner        16  0.0  0.0   3976  2944 ?        Ss   00:38   0:00  \_ bash
runner        17  0.0  0.0   3976  3108 ?        S    00:38   0:00      \_ /bin/bash ./run.sh
runner       125  0.0  0.0   3976  3016 ?        S    00:38   0:00          \_ /bin/bash /runner/run-helper.sh
runner       129  0.2  0.2 3699168 87076 ?       Sl   00:38   0:01              \_ /runner/bin/Runner.Listener run

For rootful dind runner:

runner@org-runnerdeploy-6dlsh-8dd56:/$ ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner       307  0.0  0.0   4240  3532 pts/0    Ss   05:27   0:00 bash
runner       314  0.0  0.0   5892  2904 pts/0    R+   05:27   0:00  \_ ps auxwwf
runner         1  0.0  0.0   3976  3092 ?        Ss   05:24   0:00 /bin/bash /usr/bin/startup.sh
runner        15  0.0  0.0    216     4 ?        S    05:24   0:00 dumb-init bash
runner        17  0.0  0.0   3976  3084 ?        Ss   05:24   0:00  \_ bash
root          23  0.0  0.0   6276  4200 ?        S    05:24   0:00      \_ sudo /usr/bin/supervisord -n
root          27  0.0  0.0  28448 23444 ?        S    05:24   0:00      |   \_ /usr/bin/python3 /usr/bin/supervisord -n
root          40  0.1  0.1 2014364 54036 ?       Sl   05:24   0:00      |       \_ /usr/local/bin/dockerd
root          55  0.0  0.0 2066508 29460 ?       Ssl  05:24   0:00      |           \_ containerd --config /var/run/docker/containerd/containerd.toml --log-level info
runner       158  0.0  0.0   3976  2968 ?        S    05:24   0:00      \_ /bin/bash ./run.sh
runner       262  0.0  0.0   3976  3104 ?        S    05:24   0:00          \_ /bin/bash /runner/run-helper.sh
runner       266  0.8  0.2 3699352 84988 ?       Sl   05:24   0:01              \_ /runner/bin/Runner.Listener run

For rootless dind runner:

runner@org-runnerdeploy-gpn9b-tx7wl:/$ ps auxwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner       314  0.3  0.0   4240  3504 pts/0    Ss   05:36   0:00 bash
runner       322  0.0  0.0   5892  2904 pts/0    R+   05:36   0:00  \_ ps auxwf
runner         1  0.0  0.0   3976  3180 ?        Ss   05:34   0:00 /bin/bash /usr/bin/rootless-startup.sh
runner        19  0.0  0.0    216     4 ?        S    05:34   0:00 dumb-init bash
runner        22  0.0  0.0   3976  2936 ?        Ss   05:34   0:00  \_ bash
runner        23  0.0  0.0 709732  7012 ?        Sl   05:34   0:00      \_ rootlesskit --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopba
runner        33  0.0  0.0 709732  7140 ?        Sl   05:34   0:00      |   \_ /proc/self/exe --net=vpnkit --mtu=1500 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host
runner        63  0.1  0.1 1493308 54168 ?       Sl   05:34   0:00      |   |   \_ dockerd --config-file /home/runner/.config/docker/daemon.json
runner        86  0.1  0.1 1482280 35032 ?       Ssl  05:34   0:00      |   |       \_ containerd --config /run/user/1000/docker/containerd/containerd.toml --log-level info
runner        46  0.0  0.0  66428 24372 ?        Sl   05:34   0:00      |   \_ vpnkit --ethernet /tmp/rootlesskit1191930478/vpnkit-ethernet.sock --mtu 1500 --host-ip 0.0.0.0
runner        24  0.0  0.0   3976  3064 ?        S    05:34   0:00      \_ /bin/bash ./run.sh
runner       280  0.0  0.0   3976  3056 ?        S    05:34   0:00          \_ /bin/bash /runner/run-helper.sh
runner       284  1.8  0.2 3699340 86704 ?       Sl   05:34   0:01              \_ /runner/bin/Runner.Listener run

As long as our entrypoint (The new actions-runner.sh, the modified startup.sh and rootless-startup.sh respectively) sends SIGTERM to dumb-init, it should reap any remaining zombie processes (if any). The updated entrypoint scripts do exactly that and therefore you should encounter no zombie processes issues even if our init (pid 1) is no longer dumb-init.

Note also that runner with dind sidecar can still fail to stop gracefully due to that the dind sidecar has no feature to wait until its client(=actions/runner agent) stops first. You'll see messages like the ones seen in the below log when this happens. The fix for this should be made in another pull request. I've already addressed it in this PR.

org-runnerdeploy-rw4pd-7vw24 runner √ Connected to GitHub
org-runnerdeploy-rw4pd-7vw24 runner 
org-runnerdeploy-rw4pd-7vw24 runner Current runner version: '2.296.2'
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:30Z: Listening for Jobs
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:42Z: Running job: test0
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.357093068Z" level=info msg="Processing signal 'terminated'"
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.358539782Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.359930049Z" level=info msg="Daemon shutdown complete"
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.359983931Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:45.360236297Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.363  NOTICE --- Executing actions-runner-controller's SIGTERM handler.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.367  NOTICE --- Note that if this takes more time than terminationGracePeriodSeconds, the runner will be forcefully terminated by Kubernetes, which may result in the in-progress workflow job, if any, to fail.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.382  NOTICE --- Ensuring dockerd is still running.
org-runnerdeploy-rw4pd-7vw24 runner Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:45.439  WARNING --- Detected configuration error: dockerd should be running but is already nowhere. This is wrong. Ensure that your init system to NOT pass SIGTERM directly to dockerd!
org-runnerdeploy-rw4pd-7vw24 runner /runner /
org-runnerdeploy-rw4pd-7vw24 runner 
org-runnerdeploy-rw4pd-7vw24 runner # Runner removal
org-runnerdeploy-rw4pd-7vw24 runner 
org-runnerdeploy-rw4pd-7vw24 docker time="2022-09-26T00:45:46.375563741Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
org-runnerdeploy-rw4pd-7vw24 runner Failed: Removing runner from the server
org-runnerdeploy-rw4pd-7vw24 runner Runner "org-runnerdeploy-rw4pd-7vw24" is still running a job"
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:45:49.385  NOTICE --- Waiting for RUNNER_GRACEFUL_STOP_TIMEOUT=1780 seconds until the runner agent to stop by itself.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:54Z: Job test0 completed with result: Failed
org-runnerdeploy-rw4pd-7vw24 runner √ Removed .credentials
org-runnerdeploy-rw4pd-7vw24 runner √ Removed .runner
org-runnerdeploy-rw4pd-7vw24 runner Runner listener exit with 0 return code, stop the service, no retry needed.
org-runnerdeploy-rw4pd-7vw24 runner Exiting runner...
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.547  NOTICE --- The runner agent stopped before RUNNER_GRACEFUL_STOP_TIMEOUT=1780
org-runnerdeploy-rw4pd-7vw24 runner /
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.575  NOTICE --- The actions runner process exited.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.592  NOTICE --- Holding on until runner init (pid 13) exits, so that there will hopefully be no zombie processes remaining.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.604  NOTICE --- Graceful stop completed.
org-runnerdeploy-rw4pd-7vw24 runner 2022-09-26 00:46:55.617  NOTICE --- Runner init exited. Exiting this process with code 0 so that the container and the pod is GC'ed Kubernetes soon.

Ref #1535
Ref #1581

@mumoshu mumoshu changed the title Fix RunnerSet-managed rootless-dind runners to gracefully stop on pod eviction Fix rootless-dind runners to gracefully stop on pod eviction Aug 28, 2022
@mumoshu mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch 2 times, most recently from dcfb4f2 to 94b6d89 Compare August 28, 2022 02:55
@mumoshu mumoshu changed the title Fix rootless-dind runners to gracefully stop on pod eviction Fix rootless/rootful dind runners to gracefully stop on pod eviction Aug 28, 2022
@mumoshu mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch 2 times, most recently from e2006d4 to 10cb56d Compare August 28, 2022 06:16
@mumoshu mumoshu changed the title Fix rootless/rootful dind runners to gracefully stop on pod eviction Fix runners to gracefully stop on pod eviction Aug 29, 2022
@mumoshu mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch from c82c99c to 214f5c6 Compare August 29, 2022 01:02
# Cannot connect to server, because config files are missing. Skipping removing runner from the server.
# Does not exist. Skipping Removing .credentials
# Does not exist. Skipping Removing .runner
cd /runner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to use pushd /runner? so you can popd to your original pwd after the ./config remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Addressed in f5610db

#!/bin/bash
source logger.bash
source graceful-stop.bash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to move trap graceful_stop TERM in here to make it more readable since it looks like actopns-runner.sh is the new main entrypoint script?
So, we can add trap - TERM at the end of the file to remove the trap?
You can ignore me, not a big deal. 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds much cleaner! Let me try. Thanks for your feedback 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in a666a2f

TingluoHuang
TingluoHuang previously approved these changes Aug 29, 2022
Copy link
Collaborator

@toast-gear toast-gear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to add shellcheck linting into the pipeline as part of this change seen as we doing major surgery

runner/wait.bash Outdated Show resolved Hide resolved
@mumoshu mumoshu force-pushed the graceful-stop-on-rootless-dind-runner-eviction branch from 214f5c6 to f5610db Compare September 21, 2022 02:36
mumoshu added a commit that referenced this pull request Sep 25, 2022
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759.
This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.
mumoshu added a commit that referenced this pull request Sep 25, 2022
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759.

This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request.

Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.
@mumoshu mumoshu changed the title Fix runners to gracefully stop on pod eviction Fix runners to do their best to gracefully stop on pod eviction Sep 25, 2022
mumoshu added a commit that referenced this pull request Sep 25, 2022
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759.

This change updates and reintroduces #1517 contributed by @CASABECI in a way it becomes applicable to today's code-base.
mumoshu added a commit that referenced this pull request Sep 25, 2022
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759.
This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.
mumoshu added a commit that referenced this pull request Sep 25, 2022
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759.

This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request.

Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.
mumoshu added a commit that referenced this pull request Sep 25, 2022
…eriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT when runner pod was terminated externally too early after its creation

While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timouet of 10 minutes.

You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
toast-gear
toast-gear previously approved these changes Oct 24, 2022
runner/entrypoint.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@toast-gear toast-gear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mumoshu
Copy link
Collaborator Author

mumoshu commented Nov 1, 2022

Thanks for reviewing!

@mumoshu mumoshu merged commit c74ad61 into master Nov 1, 2022
@dongho-jung
Copy link
Contributor

Wow I was gonna open a discussion for this because I've been suffering from this since 2 weeks ago and wow it just merged 15 hours ago what a timing 👍

mumoshu added a commit that referenced this pull request Nov 3, 2022
This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.
mumoshu added a commit that referenced this pull request Nov 3, 2022
While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.
mumoshu added a commit that referenced this pull request Nov 3, 2022
This fixes the said issue I have found while testing #1759.
mumoshu added a commit that referenced this pull request Nov 3, 2022
While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.
mumoshu added a commit that referenced this pull request Nov 3, 2022
…nt (#1975)

This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.
mumoshu added a commit that referenced this pull request Nov 3, 2022
* Fix permission issue when you use PV for rootless dind cache

This fixes the said issue I have found while testing #1759.

Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
mumoshu added a commit that referenced this pull request Nov 25, 2022
…decar prestop hook

The runner graceful stop timeout has never been propagated to the dind sidecar due to configuration error in E2E. This fixes it, so that we can verify that the dind sidecar prestop can respect the graceful stop timeout.

Related to #1759
mumoshu added a commit that referenced this pull request Nov 27, 2022
…decar prestop hook (#2044)

The runner graceful stop timeout has never been propagated to the dind sidecar due to configuration error in E2E. This fixes it, so that we can verify that the dind sidecar prestop can respect the graceful stop timeout.

Related to #1759
@Link- Link- deleted the graceful-stop-on-rootless-dind-runner-eviction branch March 13, 2023 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants