-
Notifications
You must be signed in to change notification settings - Fork 49
update-agent: Added reboot-wait parameter #192
base: master
Are you sure you want to change the base?
Conversation
This adds an reboot-wait parameter, which waits, after the last pod was terminated, an fixed amount of time to finalize operations before reboot. This solves some problems this storage provisioners like rook.
Can one of the admins verify this patch? |
@johannwagner thanks for the PR! However this lacks quite a bit of accompanying context, and on the surface it seems to just be trying to hide an existing race. What is the problem this is trying to solve related to rook? How do you determine what is a good reboot-wait time (and is there some upper-bound guarantee on the value)? Why isn't pod draining sufficient here? |
Hi @lucab, not the author but hope I can provide some context as well. This seems to be useful because of situations like #191 - beyond pod termination there are some downstream operations (e.g., unmounting a volume because it is going to be attached to another node - this can take a short period of time; I suppose that's why it's useful for rook) that are not fully grasped by CLUO at the moment. Implementing the |
Jeep, that’s the reason. We want to be able to shutdown our CEPH Cluster properly, which only works, if everything shutdowns properly after the last Pod was terminated. @embik assumes the correct context. |
If a drain isn't sufficient for your situation, CLUO supports before-reboot (and after-reboot) required annotations. A custom DaemonSet can run before/after a reboot to perform custom cleanup/setup actions (e.g. waiting a fixed period, implement Ceph shutdown, etc.) and then add the required annotation, which CLUO awaits. |
@dghubble We are already using custom wait parameters to cleanly unmount and reboot nodes, but at least the rbd driver does some internal network related tasks which we can't wait for at the moment. |
I think this would probably be better done as a systemd unit with an |
If you follow these instructions https://github.com/rook/rook/blob/master/Documentation/container-linux.md on a CoreOS cluster, it will cause the nodes to hang during shutdown because the network is already brought down by systemd before the rbd module is able to complete its unmount and unlocking the objects/drive, that's why our initial implementation used a delay of approx. 30s to ensure this will be finished. Basically the before-reboot hook runs before the update operator drains the node and maybe it's good to have the same hook mechanic being executed WDYT? |
@dghubble in that case the custom DaemonSet would be required to implement draining the node, because in the scenario described we need to Honestly, if I need to write custom logic and deploy a custom DaemonSet and all that, I'd probably just fork CLUO and apply this patch because it solves the problem well enough. I could do that, no problem, I just think you should be aware this patch could be helpful for production users. |
This adds an reboot-wait parameter, which waits, after the last pod was terminated, an
fixed amount of time to finalize operations before reboot. This solves some problems
this storage provisioners like rook.
/cc: @martin31821