Introduce pod deletion timeout and forcefully delete stuck pods preventing reconciler from doings its jobs when a node crashed or became unresponsive #307

jonico · 2021-02-14T00:05:03Z

if a k8s node becomes unresponsive, the kube controller will soft delete all pods after the eviction time (default 5 mins)
as long as the node stays unresponsive, the pod will never leave the last status and hence the runner controller will assume that everything is fine with the pod and will not try to create new pods
this can result in a situation where the runner replica set thinks that those nodes are still running and ready (although not busy but starting up back soon) and the horizontal autoscaler not scheduling any further runners / pods, resulting in a broken runner deployment until the runnerreplicaset is deleted or the node comes back online
introducing a pod deletion timeout (1 minute) after which the runner controller will try to reboot the runner and create a pod on a working node
use forceful deletion and requeue for pods that have been stuck for more than one minute in terminating state
gracefully handling race conditions if pod gets finally forcefully deleted within the (next) reconciliation loop

* if a k8s node becomes unresponsive, the kube controller will soft delete all pods after the eviction time (default 5 mins) * as long as the node stays unresponsive, the pod will never leave the last status and hence the runner controller will assume that everything is fine with the pod and will not try to create new pods * this can result in a situation where a horizontal autoscaler thinks that none of its runners are currently busy and will not schedule any further runners / pods, resulting in a broken runner deployment until the runnerreplicaset is deleted or the node comes back online * introducing a pod deletion timeout (1 minute) after which the runner controller will try to reboot the runner and create a pod on a working node * use forceful deletion and requeue for pods that have been stuck for more than one minute in terminating state * gracefully handling race conditions if pod gets finally forcefully deleted within

* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes) * pods stuck in "Terminating" status on unreachable nodes will only be freed once actions#307 gets merged

mumoshu

LGTM. Impressive job @jonico!

* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes) * pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged

jonico mentioned this pull request Feb 14, 2021

Pod eviction policy examples to not wait 400 seconds until runner controller creates new pods for crashed nodes #308

Merged

jonico force-pushed the master branch 2 times, most recently from c19f5ef to 22474f9 Compare February 14, 2021 22:31

mumoshu approved these changes Feb 15, 2021

View reviewed changes

mumoshu merged commit 9c8d730 into actions:master Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce pod deletion timeout and forcefully delete stuck pods preventing reconciler from doings its jobs when a node crashed or became unresponsive #307

Introduce pod deletion timeout and forcefully delete stuck pods preventing reconciler from doings its jobs when a node crashed or became unresponsive #307

jonico commented Feb 14, 2021

mumoshu left a comment

Introduce pod deletion timeout and forcefully delete stuck pods preventing reconciler from doings its jobs when a node crashed or became unresponsive #307

Introduce pod deletion timeout and forcefully delete stuck pods preventing reconciler from doings its jobs when a node crashed or became unresponsive #307

Conversation

jonico commented Feb 14, 2021

mumoshu left a comment

Choose a reason for hiding this comment