Prevent runners from being stuck in Terminating when pod disappeared without standard termination process #1318

mumoshu · 2022-04-08T00:47:21Z

This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.

Note that there might be plenty of possible causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.

Please beware that ARC leaves dangling runners on Actions in such circumstances, with or without this patch, because ARC can't unregister runners because Actions doesn't automatically change disappeared runners from busy to non-busy, at least for several hours, and there's no API to forcefully change runners to non-busy or force-delete runners(DeleteRunner API returns 422 on busy runners). It is currently the users' responsibility to clean up any dangling runner resources on GitHub Actions.

Ref #1307
Might also relate to #1273

…t standard termination process This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway. Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions. Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions. Ref #1307 Might also relate to #1273

* master: (30 commits) chore(deps): update azure/setup-helm action to v2.1 (actions#1328) [skip ci] Add more usages of RUNNER_VERSION to Renovate config. (actions#1313) chore: fix typo (actions#1316) [skip ci] chore: bump chart to latest (actions#1319) Fix release workflow Prevent runners from stuck in Terminating when pod disappeared without standard termination process (actions#1318) ci: pin go version to the known working version (actions#1303) chore: bump chart to latest (actions#1300) Fix runner pod to be cleaned up earlier regardless of the sync period (actions#1299) Make the hard-coded runner startup timeout to avoid race on token expiration longer (actions#1296) docs: highlight why persistent are not ideal fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.2 chore(deps): update dependency actions/runner to v2.289.2 chore(deps): update helm/chart-releaser-action action to v1.4.0 (actions#1287) refactor(runner/entrypoint): check for externalstmp (actions#1277) docs: redundant words docs: wording docs: add autoscaling also causes problems chore: new line for consistency docs: use the right font ...

mumoshu changed the title ~~Prevent runners from stuck in Terminating when pod disappeared without standard termination process~~ Prevent runners from being stuck in Terminating when pod disappeared without standard termination process Apr 8, 2022

mumoshu added this to the v0.22.3 milestone Apr 8, 2022

mumoshu mentioned this pull request Apr 8, 2022

Runner pods get stuck in CrashLoop with 404 response from https://api.github.com/actions/runner-registration #1311

Closed

2 tasks

mumoshu merged commit b09c540 into master Apr 8, 2022

mumoshu deleted the give-up-runner-unregistration-on-pod-failure branch April 8, 2022 01:17

mumoshu mentioned this pull request Apr 8, 2022

Runners stuck in Terminating state on v0.22.2 #1307

Closed

2 tasks

mumoshu mentioned this pull request Apr 14, 2022

Runner not removed from GitHub #1235

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent runners from being stuck in Terminating when pod disappeared without standard termination process #1318

Prevent runners from being stuck in Terminating when pod disappeared without standard termination process #1318

mumoshu commented Apr 8, 2022 •

edited

Loading

Prevent runners from being stuck in Terminating when pod disappeared without standard termination process #1318

Prevent runners from being stuck in Terminating when pod disappeared without standard termination process #1318

Conversation

mumoshu commented Apr 8, 2022 • edited Loading

mumoshu commented Apr 8, 2022 •

edited

Loading