Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Do recreate runner pod earlier on registration token update #1087

Merged
merged 1 commit into from
Feb 19, 2022

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Feb 2, 2022

Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.

This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.

Ref #1085 (comment)

@mumoshu mumoshu changed the title fix: Do recreate runner pod on registration token update fix: Do recreate runner pod earlier on registration token update Feb 2, 2022
@jbkc85
Copy link

jbkc85 commented Feb 7, 2022

one thing I noticed here @mumoshu - is that the updated registration token....is it going to trigger a restart in the middle of a pod running? If so, we just have another potential active runner deletion.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 8, 2022

is it going to trigger a restart in the middle of a pod running? If so, we just have another potential active runner deletion.

@jbkc85 Yes, you're absolutely right!

To be clear, as ARC has no way to atomically delete the runner if and only if the runner isn't busy, actions/runner needs to be able to gracefully stop at any time. (I.e. polling "list runners" API to detect if the runner is busy or not and deleting the runner only after that, isn't a solution. it's still subject to race condition)

This fixes the issue that runner takes too much time to restart even after the token is already expired.
This doesn't fix the issue that actions/runner is unable to stop gracefully at any time.

@jbkc85
Copy link

jbkc85 commented Feb 8, 2022

This fixes the issue that runner takes too much time to restart even after the token is already expired.

Got it! That makes perfect sense.

@jbkc85
Copy link

jbkc85 commented Feb 9, 2022

@mumoshu - I am not sure this is related but....

https://github.com/actions-runner-controller/actions-runner-controller/blob/b1ea9099c198cf4ac2054993f22e3f8769b4d1bb/controllers/runner_controller.go#L162

With Ephemeral containers, there is a race condition possibility here - in which a Completed ephemeral runner is triggered to restart because its marked as stopped. I think we can add the IsEphemeral flag here too to avoid potential unnecessary restarts?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 16, 2022

@jbkc85 Good catch! Yeah we'd definitely better to take IsEphemeral into account to prevent potentially unnecessary restarts.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 17, 2022

@jbkc85 Ah sorry I'm pretty confused.

From my observation, the "runner" container for an ephemeral runner does exit with code 0 (without restarting by itself) when it finishes running a job.

We should definitely recreate the pod in that case. Otherwise, the pod gets stuck forever and you'll end up with zero ephemeral runner pods actually running the runner containers, right?

Assuming that's right, the assumption for the fix I made in #1085 turns false.

We do need to recreate the ephemeral runner pod in the said condition, so the change made in #1085 is useful.

Instead, we'd introduce the grace period between the runner un-registration and pod-deletion, as we discussed, and that will fix it. Could you confirm?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 19, 2022

ARC has no way to atomically delete the runner if and only if the runner isn't busy

To be clear, this turned out to be my mistake. RemoveRunner seems like the API to atomically delete the runner. See #1125 (comment)

@mumoshu mumoshu merged commit 921f547 into master Feb 19, 2022
@mumoshu mumoshu deleted the trigger-runner-recreation-on-token-update branch February 19, 2022 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants