Try to unconfig runner before deleting the pod to recreate #1125

TingluoHuang · 2022-02-16T22:04:17Z

There is a race condition between ARC and GitHub service about deleting runner pod.

The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
The ARC delete the runner pod which cause the in-progress job to end up canceled.

To avoid this race condition, I am calling r.unregisterRunner() before deleting the pod.

r.unregisterRunner() will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
r.unregisterRunner will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.

TODO: I need to do some E2E tests to force the race condition to happen.

Ref #911

mumoshu · 2022-02-17T01:18:45Z

@TingluoHuang Hey! Thanks a lot for your contribution.

I completely forgot the fact that the deletion of a runner on scale down is usually handled by the processRunnerDeletion function in the same Reconcile process, rather than where you patched.

processRunnerDeletion works in concert with Kubernetes's built-in functionality that cascade-deletes resources. It firstly unregisters the runner, then removes the k8s resource finalizer(which is a marker that prevents K8s from cascade-deletion), so that after the reconciliation success, K8s detects the runner has no finalizer and commit cascade-deleting the runner and the runner pod.

Your patch would instead fix a race issue when the runner pod needs to be recreated (due to token update, possible registration timeout, recreating the ephemeral runner's pod that stopped by itself). It's still a nice thing to have so I'm definitely going to merge this soon!

But note that this fix ~~isn't perfect~~ can be incomplete, as we apparently saw the usual runner pod termination process(=processRunnerDeletion), which is already handled the way you do in this PR, doesn't seem like race-free.

That said, after this PR gets merged, we might need to implement a much longer grace period between the time ARC unregisters the runner and it deletes the runner pod, as we've been discussing at #1085 (comment)

mumoshu · 2022-02-19T12:07:10Z

But note that this fix isn't perfect can be incomplete, as we apparently saw the usual runner pod termination process(=processRunnerDeletion), which is already handled the way you do in this PR, doesn't seem like race-free.

I noticed that this can be my mistake. If we can safely assume that GitHub Actions can guarantee that no runner will get a new job assigned after a successful RemoveRunner call, this can be the enough and sufficient fix for the race issue.

cc/ @jbkc85

TingluoHuang added 2 commits February 15, 2022 04:10

unconfig runner before delete the pod.

b64a5b0

address feedback.

8558532

mumoshu changed the title ~~Try to unconfig runner before delete the pod.~~ Try to unconfig runner before deleting the pod to recreate Feb 17, 2022

mumoshu mentioned this pull request Feb 19, 2022

fix: Do recreate runner pod earlier on registration token update #1087

Merged

mumoshu merged commit 0b9bef2 into actions:master Feb 19, 2022

TingluoHuang deleted the users/tihuang/racefix branch February 20, 2022 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to unconfig runner before deleting the pod to recreate #1125

Try to unconfig runner before deleting the pod to recreate #1125

TingluoHuang commented Feb 16, 2022 •

edited by mumoshu

mumoshu commented Feb 17, 2022 •

edited

mumoshu commented Feb 19, 2022

Try to unconfig runner before deleting the pod to recreate #1125

Try to unconfig runner before deleting the pod to recreate #1125

Conversation

TingluoHuang commented Feb 16, 2022 • edited by mumoshu

mumoshu commented Feb 17, 2022 • edited

mumoshu commented Feb 19, 2022

TingluoHuang commented Feb 16, 2022 •

edited by mumoshu

mumoshu commented Feb 17, 2022 •

edited