Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runner{set,deployment} rollouts and static runner scaling #1214

Merged
merged 13 commits into from
Mar 13, 2022

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Mar 13, 2022

I was testing static runners as a preparation to cut the next release of ARC, v0.22.0, and found several problems that I thought worth being fixed.

In particular, this PR fixes static runners reliability issues in two means.

c4b24f8 fixes the issue that ARC gives up retrying RemoveRunner calls too early, especially on static runners, that resulted in static runners to often get terminated prematurely while running jobs.

791634f fixes the issue that ARC was unable to scale up any static runners when the corresponding desired replicas number in e.g. RunnerDeployment gets updated. It was caused by a bug in the mechanism that is intended to prevent ephemeral runners from being recreated in unwanted circumstances, mistakenly triggered also for static runners.

Since #1179, RunnerDeployment was not able to gracefully terminate old RunnerReplicaSet on update. c612e87 fixes that by changing RunnerDeployment to firstly scale old RunnerReplicaSet(s) down to zero and waits for sync, and set the deletion timestamp only after that. That way, RunnerDeployment can ensure that all the old RunnerReplicaSets that are being deleted are already scaled to zero passing the standard unregister-and-then-delete runner termination process.

It revealed a hidden bug in #1179 that sometimes the scale-to-zero-before-runnerreplicaset-termination does not work as intended. 4551309 fixes that, so that RunnerDeployment can actually terminate old RunnerReplicaSets gracefully.

The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated.
This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.
…nating it

so that hopefully RunnerDeployment can gracefully termiante older RunnerReplicaSet on update.
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.
As GitHub App does not allow ARC to access enterprise runner related API endpoints, like the create-registration-token API.
@mumoshu mumoshu changed the title Fix static runner scaling Fix runner{set,deployment} rollouts and static runner scaling Mar 13, 2022
@mumoshu mumoshu merged commit e7200f2 into master Mar 13, 2022
@mumoshu mumoshu deleted the fix-static-runners branch March 13, 2022 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant