Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pod deletion timeout and forcefully delete stuck pods preventing reconciler from doings its jobs when a node crashed or became unresponsive #307

Merged
merged 1 commit into from
Feb 15, 2021

Conversation

jonico
Copy link
Contributor

@jonico jonico commented Feb 14, 2021

  • if a k8s node becomes unresponsive, the kube controller will soft delete all pods after the eviction time (default 5 mins)
  • as long as the node stays unresponsive, the pod will never leave the last status and hence the runner controller will assume that everything is fine with the pod and will not try to create new pods
  • this can result in a situation where the runner replica set thinks that those nodes are still running and ready (although not busy but starting up back soon) and the horizontal autoscaler not scheduling any further runners / pods, resulting in a broken runner deployment until the runnerreplicaset is deleted or the node comes back online
  • introducing a pod deletion timeout (1 minute) after which the runner controller will try to reboot the runner and create a pod on a working node
  • use forceful deletion and requeue for pods that have been stuck for more than one minute in terminating state
  • gracefully handling race conditions if pod gets finally forcefully deleted within the (next) reconciliation loop

* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken  runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
jonico added a commit to jonico/actions-runner-controller that referenced this pull request Feb 14, 2021
* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes)
* pods stuck in "Terminating" status on unreachable nodes will only be freed once actions#307 gets merged
@jonico jonico force-pushed the master branch 2 times, most recently from c19f5ef to 22474f9 Compare February 14, 2021 22:31
Copy link
Collaborator

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Impressive job @jonico!

@mumoshu mumoshu merged commit 9c8d730 into actions:master Feb 15, 2021
mumoshu pushed a commit that referenced this pull request Feb 15, 2021
* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes)
* pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants