Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce "No runner matching the specified labels was found" errors while runner replacement #392

Merged

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Mar 15, 2021

We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes Running, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to Running only after the runner in the runner pod appears to be registered.

A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.

…ion timeout check

Setting a short sync-period on actions-runner-controller had resulted in a series of paginated list-runners GitHub API calls for each sync on every idle runner. This was not necessarily. The original reason we wanted to call the GitHub API here was to determine if the runner needs to be recreated due to the registration timeout. A registration timeout happens only after 10 minutes after the pod creation so we do not need to check earlier than that. Since this change, the controller checks for the registration timeout only after approx 10 minutes regardless of how short the sync period is.
…le runner replacement

We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered.
…ve resync

Each runner resource is reconciled on any update in its spec or status, e.g. runner pod status update, and even runner status change triggered by the runner controller itself. As result, several ListRunners GitHub API have been called until single runner resource gets ready. Ideally it should be called only once or twice.

This updates the runner controller and runner status resource to reduce the number of calls to twice in most cases. RunnerStatus has been obtained the new `LastRegistrationCheckTime` field to enable the controller to do so.

I have also tweaked timings and contents of log messages emitted while a RunnerDeployment is being updated and RunnerReplicaSets and Runners are replaced. It should be more readable and less scary.
@mumoshu mumoshu merged commit 3f23501 into master Mar 16, 2021
@mumoshu mumoshu deleted the runnerreplicaset-replacement-perserving-available-runners branch March 16, 2021 01:52
mumoshu added a commit that referenced this pull request Mar 18, 2021
mumoshu added a commit that referenced this pull request Mar 19, 2021
Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded).

More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.
mumoshu added a commit that referenced this pull request Mar 19, 2021
Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded).

More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant