Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Runner and RunnerSet #1179

Merged
merged 2 commits into from
Mar 6, 2022
Merged

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Mar 5, 2022

so that they use the same library code that powers RunnerSet.

RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects.

This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet.

Follow-up for #1127, #1167, and 1178

…ller

so that it can potentially be reusable from runnerreplicaset controller
@mumoshu mumoshu force-pushed the refactor-runner-and-runnerset branch 2 times, most recently from e217ebb to 88d7208 Compare March 6, 2022 05:52
@mumoshu mumoshu force-pushed the refactor-runner-and-runnerset branch from 88d7208 to 14a878b Compare March 6, 2022 05:53
@mumoshu mumoshu changed the title wip: Refactor Runner and RunnerSet Refactor Runner and RunnerSet Mar 6, 2022
@mumoshu mumoshu merged commit bed9270 into master Mar 6, 2022
@mumoshu mumoshu deleted the refactor-runner-and-runnerset branch March 6, 2022 06:56
mumoshu added a commit that referenced this pull request Mar 6, 2022
While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it.

So I added a hard-coded 30 seconds timeout(can be made customizable later) to prevent runner pod recreation.
mumoshu added a commit that referenced this pull request Mar 6, 2022
While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it.

In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation.
Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.
mumoshu added a commit that referenced this pull request Mar 7, 2022
…1180)

While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it.

In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation.
Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.
mumoshu added a commit that referenced this pull request Mar 12, 2022
…UNNER_TOKEN injected twice

Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.
mumoshu added a commit that referenced this pull request Mar 12, 2022
I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.
mumoshu added a commit that referenced this pull request Mar 13, 2022
I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.
mumoshu added a commit that referenced this pull request Mar 13, 2022
It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated.
This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.
mumoshu added a commit that referenced this pull request Mar 13, 2022
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.
mumoshu added a commit that referenced this pull request Mar 13, 2022
…nners

Fix runner{set,deployment} rollouts and static runner scaling

I was testing static runners as a preparation to cut the next release of ARC, v0.22.0, and found several problems that I thought worth being fixed.

In particular, this PR fixes static runners reliability issues in two means.

c4b24f8 fixes the issue that ARC gives up retrying RemoveRunner calls too early, especially on static runners, that resulted in static runners to often get terminated prematurely while running jobs.

791634f fixes the issue that ARC was unable to scale up any static runners when the corresponding desired replicas number in e.g. RunnerDeployment gets updated. It was caused by a bug in the mechanism that is intended to prevent ephemeral runners from being recreated in unwanted circumstances, mistakenly triggered also for static runners.

Since #1179, RunnerDeployment was not able to gracefully terminate old RunnerReplicaSet on update. c612e87 fixes that by changing RunnerDeployment to firstly scale old RunnerReplicaSet(s) down to zero and waits for sync, and set the deletion timestamp only after that. That way, RunnerDeployment can ensure that all the old RunnerReplicaSets that are being deleted are already scaled to zero passing the standard unregister-and-then-delete runner termination process.

It revealed a hidden bug in #1179 that sometimes the scale-to-zero-before-runnerreplicaset-termination does not work as intended. 4551309 fixes that, so that RunnerDeployment can actually terminate old RunnerReplicaSets gracefully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant