Refactor Runner and RunnerSet #1179

mumoshu · 2022-03-05T12:24:36Z

so that they use the same library code that powers RunnerSet.

RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects.

This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet.

Follow-up for #1127, #1167, and 1178

…ller so that it can potentially be reusable from runnerreplicaset controller

…hat backs RunnerSet

While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it. So I added a hard-coded 30 seconds timeout(can be made customizable later) to prevent runner pod recreation.

While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it. In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation. Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.

…1180) While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it. In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation. Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.

…UNNER_TOKEN injected twice Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.

I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.

It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated. This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.

#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.

…nners Fix runner{set,deployment} rollouts and static runner scaling I was testing static runners as a preparation to cut the next release of ARC, v0.22.0, and found several problems that I thought worth being fixed. In particular, this PR fixes static runners reliability issues in two means. c4b24f8 fixes the issue that ARC gives up retrying RemoveRunner calls too early, especially on static runners, that resulted in static runners to often get terminated prematurely while running jobs. 791634f fixes the issue that ARC was unable to scale up any static runners when the corresponding desired replicas number in e.g. RunnerDeployment gets updated. It was caused by a bug in the mechanism that is intended to prevent ephemeral runners from being recreated in unwanted circumstances, mistakenly triggered also for static runners. Since #1179, RunnerDeployment was not able to gracefully terminate old RunnerReplicaSet on update. c612e87 fixes that by changing RunnerDeployment to firstly scale old RunnerReplicaSet(s) down to zero and waits for sync, and set the deletion timestamp only after that. That way, RunnerDeployment can ensure that all the old RunnerReplicaSets that are being deleted are already scaled to zero passing the standard unregister-and-then-delete runner termination process. It revealed a hidden bug in #1179 that sometimes the scale-to-zero-before-runnerreplicaset-termination does not work as intended. 4551309 fixes that, so that RunnerDeployment can actually terminate old RunnerReplicaSets gracefully.

refactor: Extract runner pod owner management out of runnerset contro…

c95e84a

…ller so that it can potentially be reusable from runnerreplicaset controller

mumoshu force-pushed the refactor-runner-and-runnerset branch 2 times, most recently from e217ebb to 88d7208 Compare March 6, 2022 05:52

refactor: Make RunnerReplicaSet and Runner backed by the same logic t…

14a878b

…hat backs RunnerSet

mumoshu force-pushed the refactor-runner-and-runnerset branch from 88d7208 to 14a878b Compare March 6, 2022 05:53

mumoshu changed the title ~~wip: Refactor Runner and RunnerSet~~ Refactor Runner and RunnerSet Mar 6, 2022

mumoshu merged commit bed9270 into master Mar 6, 2022

mumoshu deleted the refactor-runner-and-runnerset branch March 6, 2022 06:56

mumoshu mentioned this pull request Mar 6, 2022

Auto-correct replicas number on missing webhook_job completion event #1180

Merged

mumoshu mentioned this pull request Mar 12, 2022

Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice #1212

Merged

mumoshu mentioned this pull request Mar 12, 2022

Fix RunnerDeployment to be able to finish rollout #1213

Merged

mumoshu mentioned this pull request Mar 13, 2022

Fix runner{set,deployment} rollouts and static runner scaling #1214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Runner and RunnerSet #1179

Refactor Runner and RunnerSet #1179

mumoshu commented Mar 5, 2022 •

edited

Refactor Runner and RunnerSet #1179

Refactor Runner and RunnerSet #1179

Conversation

mumoshu commented Mar 5, 2022 • edited

mumoshu commented Mar 5, 2022 •

edited