RunnerSet Reliability Improvements #1167

mumoshu · 2022-03-01T02:35:22Z

Based on my experience fixing RunnerDeployment reliability (as a part of #1127), I'm going to add several enhancements to the RunnerSet and RunnerPod controllers to make RunnerSets even more reliable.

RunnerSet now creates one StatefulSet per Runner Pod, so that we have more control on when and how to scale down runner pods safely, as similar as we've done for RunnerDeployment in Add GitHub API cache to avoid rate limit #1127
RunnerSet spec got a new EffectiveTime field that is the same as RunnerDeployment's, which is used to prevent unnecessarily recreating completed runner pods when it's being scaled down quickly by webhook workflow_job completion events.
A runner pod will never be marked for deletion before unregistration. This makes it possible for a runner pod that is running a long and time-consuming workflow job to not get terminated prematurely due to pod's terminationGracePeriodSeconds.

Ref #920

…t changed

…tration

This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.

…faults to their own source

…e them distinguishable from other annotations

…e more consistent

mumoshu · 2022-03-02T10:03:14Z

Merging anyway. But I'll keep testing both RunnerDeployment and RunnerSet and perhaps submit a PR or two to tweak details and fix anything that can be considered as a regression!

msessa · 2022-03-02T14:39:57Z

Hi @mumoshu,

I'd be happy to test these improvements on our live cluster (which has been suffering from the issues you're trying to solve here)

Do you provide nightly builds docker images of the controller by any chance ?

mumoshu · 2022-03-04T01:54:28Z

@msessa Hey! Thanks for your help. I thought we'd been building controller images tagged with canary on each master build, which should be used for testing.
Would you mind checking it?

…nnerset Follow-up for #1167

…nnerset (#1178) Follow-up for #1167

…er-and-runnerset Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet. RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects. This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet. Follow-up for #1127, #1167, and 1178

msessa · 2022-03-06T17:55:54Z

Hi @mumoshu,

I've tried to run the canary images on my cluster ( arm64 architecture ) but unfortunately I stumbled on this issue when launching the webhook server.

msessa · 2022-03-10T15:56:16Z

@mumoshu

Just checking in on this one, after running the canary build for a few days in our live environment I can confirm this change has has a tremendous positive impact on the reliability of our runners especially with long-running jobs.

Even when running ARC with fairly aggressive scaling settings (scaleDownDelaySecondsAfterScaleOut set to 15 minutes) I have not encountered any premature pod termination or client disconnection so far.

The only thing I've encountered that is a bit off is that, when a runner pod is executing a long job and it ends up reaching the scale-down delay, the controller spams the log with several of these messages:

{"level":"error","ts":1646927418.3731344,"logger":"actions-runner-controller.runnerpod","msg":"Failed to unregister runner before deleting the pod.","runnerpod":"runners/runner-amd64-nf79g-0","error":"failed to remove runner: DELETE https://api.github.com/orgs/service-victoria/actions/runners/2728: 422 Bad request - Runner \"runner-amd64-nf79g-0\" is still running a job\" []","stacktrace":"<omitted>"}

I understand that is completely normal behavior at this point, so perhaps the error level should be lowered to info

mumoshu · 2022-03-11T00:12:35Z

@msessa Hey! Thanks for the positive report. That really encourages me as I was still afraid if it can be just "works on my(@mumoshu's) environment" 😄

Re log spam, that's really great feedback!
Let me try fixing it. Probably it doesn't need to retry that often in the first place, so I might make it only retry on, say, each 30 seconds.

msessa · 2022-03-11T01:27:29Z

Looking forward to this new code making the next release :) meanwhile I will continue running the canary and monitor for issues.

Thanks for your amazing work mate!

Ref #1167 (comment)

mumoshu · 2022-03-11T03:14:53Z

@msessa Thanks for your support! FYI #1204 is my attempt to fix it.

Ref #1167 (comment)

Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref #1167 (comment)

mumoshu · 2022-03-11T10:04:51Z

@msessa #1204 has been merged. I'd appreciate it if you could report back if it fixes it and the log will never get spammed 😄 (The intended behavior is that the 422 error is emitted at most once per a runner pod's whole lifecycle

mumoshu added 6 commits March 1, 2022 02:28

acceptance,e2e: Enhance deploy.sh to support more types of runnersets

639c49f

dockerfile,e2e: Use buildx and cache mounts for faster rebuilds in E2E

36be2d5

acceptance,e2e: Automate restarts of ARC pods in case image tag is no…

3f12ef2

…t changed

Prevent runner pod deletion delay when pod disappeared before unregis…

762fd1f

…tration

Make RunnerSet much more reliable with or without webhook

8bcf647

Prevent runnerset pod unregistration until it gets runner ID

064ede8

This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.

This was referenced Mar 2, 2022

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

Closed

Random Operation Cancelled Runner Decommissions #911

Closed

mumoshu added 4 commits March 2, 2022 02:19

Fix typo in comment

d8b1590

refactor: Extract definitions of various annotation keys and other de…

7ebbc1b

…faults to their own source

fix: Prefix runner pod related annotation keys with actions/ to mak…

5cf73d0

…e them distinguishable from other annotations

chore: Tweak runner-id annotation name and the annotation prefix to b…

c3a4c21

…e more consistent

mumoshu merged commit 1917cf9 into master Mar 2, 2022

mumoshu deleted the runnerset-reliability branch March 2, 2022 10:03

mumoshu added a commit that referenced this pull request Mar 5, 2022

Fix regression that registration-timeout check was not working for ru…

7815d8e

…nnerset Follow-up for #1167

mumoshu mentioned this pull request Mar 5, 2022

Fix regression that registration-timeout check was not working for runnerset #1178

Merged

mumoshu added a commit that referenced this pull request Mar 5, 2022

Fix regression that registration-timeout check was not working for ru…

95a5770

…nnerset (#1178) Follow-up for #1167

mumoshu mentioned this pull request Mar 6, 2022

Refactor Runner and RunnerSet #1179

Merged

mumoshu added a commit that referenced this pull request Mar 11, 2022

Prevent RemoveRunner spam on ephemeral runner scale down

7e52798

Ref #1167 (comment)

mumoshu mentioned this pull request Mar 11, 2022

Prevent RemoveRunner spam on busy ephemeral runner scale down #1204

Merged

mumoshu added a commit that referenced this pull request Mar 11, 2022

Prevent RemoveRunner spam on ephemeral runner scale down

03f5cca

Ref #1167 (comment)

mumoshu mentioned this pull request Apr 21, 2022

Autoscaler now creating a runnerset-per-pod/replica after 0.22.3 #1368

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunnerSet Reliability Improvements #1167

RunnerSet Reliability Improvements #1167

mumoshu commented Mar 1, 2022 •

edited

mumoshu commented Mar 2, 2022

msessa commented Mar 2, 2022

mumoshu commented Mar 4, 2022

msessa commented Mar 6, 2022

msessa commented Mar 10, 2022

mumoshu commented Mar 11, 2022

msessa commented Mar 11, 2022

mumoshu commented Mar 11, 2022

mumoshu commented Mar 11, 2022

RunnerSet Reliability Improvements #1167

RunnerSet Reliability Improvements #1167

Conversation

mumoshu commented Mar 1, 2022 • edited

mumoshu commented Mar 2, 2022

msessa commented Mar 2, 2022

mumoshu commented Mar 4, 2022

msessa commented Mar 6, 2022

msessa commented Mar 10, 2022

mumoshu commented Mar 11, 2022

msessa commented Mar 11, 2022

mumoshu commented Mar 11, 2022

mumoshu commented Mar 11, 2022

mumoshu commented Mar 1, 2022 •

edited