update-agent: Added reboot-wait parameter #192

johannwagner · 2019-06-19T10:35:37Z

This adds an reboot-wait parameter, which waits, after the last pod was terminated, an
fixed amount of time to finalize operations before reboot. This solves some problems
this storage provisioners like rook.

/cc: @martin31821

This adds an reboot-wait parameter, which waits, after the last pod was terminated, an fixed amount of time to finalize operations before reboot. This solves some problems this storage provisioners like rook.

coreosbot · 2019-06-19T10:37:02Z

Can one of the admins verify this patch?

lucab · 2019-06-26T12:39:24Z

@johannwagner thanks for the PR! However this lacks quite a bit of accompanying context, and on the surface it seems to just be trying to hide an existing race.

What is the problem this is trying to solve related to rook? How do you determine what is a good reboot-wait time (and is there some upper-bound guarantee on the value)? Why isn't pod draining sufficient here?

embik · 2019-06-26T13:47:33Z

Hi @lucab, not the author but hope I can provide some context as well.

This seems to be useful because of situations like #191 - beyond pod termination there are some downstream operations (e.g., unmounting a volume because it is going to be attached to another node - this can take a short period of time; I suppose that's why it's useful for rook) that are not fully grasped by CLUO at the moment.

Implementing the reboot-wait parameter would be a bit hacky, but it's a nice way around implementing checks for all possible downstream operations (which even might be proprietary).

johannwagner · 2019-06-26T13:50:01Z

Jeep, that’s the reason. We want to be able to shutdown our CEPH Cluster properly, which only works, if everything shutdowns properly after the last Pod was terminated. @embik assumes the correct context.

dghubble · 2019-06-26T16:51:34Z

If a drain isn't sufficient for your situation, CLUO supports before-reboot (and after-reboot) required annotations. A custom DaemonSet can run before/after a reboot to perform custom cleanup/setup actions (e.g. waiting a fixed period, implement Ceph shutdown, etc.) and then add the required annotation, which CLUO awaits.

https://github.com/coreos/container-linux-update-operator/blob/master/doc/before-after-reboot-checks.md

martin31821 · 2019-06-26T17:23:47Z

@dghubble We are already using custom wait parameters to cleanly unmount and reboot nodes, but at least the rbd driver does some internal network related tasks which we can't wait for at the moment.

cgwalters · 2019-06-26T17:37:03Z

I think this would probably be better done as a systemd unit with an ExecStop= that would run at shutdown time and call some API that waits for whatever Rook/Ceph needs to do.

martin31821 · 2019-06-26T19:11:29Z

If you follow these instructions https://github.com/rook/rook/blob/master/Documentation/container-linux.md on a CoreOS cluster, it will cause the nodes to hang during shutdown because the network is already brought down by systemd before the rbd module is able to complete its unmount and unlocking the objects/drive, that's why our initial implementation used a delay of approx. 30s to ensure this will be finished.

Basically the before-reboot hook runs before the update operator drains the node and maybe it's good to have the same hook mechanic being executed after-drain-before-reboot.

WDYT?

embik · 2019-06-27T06:35:19Z

@dghubble in that case the custom DaemonSet would be required to implement draining the node, because in the scenario described we need to drain -> wait -> reboot, not wait -> drain -> reboot. Which means it will reimplement logic already present in CLUO. The workflow would be drain (custom logic) -> wait (custom logic) -> already cordoned, no drain (CLUO) -> reboot (CLUO).

Honestly, if I need to write custom logic and deploy a custom DaemonSet and all that, I'd probably just fork CLUO and apply this patch because it solves the problem well enough. I could do that, no problem, I just think you should be aware this patch could be helpful for production users.

update-agent: Added reboot-wait parameter

5cd90ae

This adds an reboot-wait parameter, which waits, after the last pod was terminated, an fixed amount of time to finalize operations before reboot. This solves some problems this storage provisioners like rook.

Jasper-Ben mentioned this pull request Feb 11, 2021

update-agent: Added reboot-wait parameter flatcar/flatcar-linux-update-operator#62

Open

invidian mentioned this pull request Feb 14, 2022

Consider adding ability to wait for all volumes to be detached from node before rebooting by agent flatcar/flatcar-linux-update-operator#132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update-agent: Added reboot-wait parameter #192

update-agent: Added reboot-wait parameter #192

johannwagner commented Jun 19, 2019 •

edited

Loading

coreosbot commented Jun 19, 2019

lucab commented Jun 26, 2019

embik commented Jun 26, 2019 •

edited

Loading

johannwagner commented Jun 26, 2019

dghubble commented Jun 26, 2019

martin31821 commented Jun 26, 2019

cgwalters commented Jun 26, 2019

martin31821 commented Jun 26, 2019

embik commented Jun 27, 2019 •

edited

Loading

update-agent: Added reboot-wait parameter #192

Are you sure you want to change the base?

update-agent: Added reboot-wait parameter #192

Conversation

johannwagner commented Jun 19, 2019 • edited Loading

coreosbot commented Jun 19, 2019

lucab commented Jun 26, 2019

embik commented Jun 26, 2019 • edited Loading

johannwagner commented Jun 26, 2019

dghubble commented Jun 26, 2019

martin31821 commented Jun 26, 2019

cgwalters commented Jun 26, 2019

martin31821 commented Jun 26, 2019

embik commented Jun 27, 2019 • edited Loading

johannwagner commented Jun 19, 2019 •

edited

Loading

embik commented Jun 26, 2019 •

edited

Loading

embik commented Jun 27, 2019 •

edited

Loading