Skip to content
This repository has been archived by the owner on Sep 18, 2020. It is now read-only.

update-agent: Added reboot-wait parameter #192

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johannwagner
Copy link

@johannwagner johannwagner commented Jun 19, 2019

This adds an reboot-wait parameter, which waits, after the last pod was terminated, an
fixed amount of time to finalize operations before reboot. This solves some problems
this storage provisioners like rook.

/cc: @martin31821

This adds an reboot-wait parameter, which waits, after the last pod was terminated, an
fixed amount of time to finalize operations before reboot. This solves some problems
this storage provisioners like rook.
@coreosbot
Copy link

Can one of the admins verify this patch?

@lucab
Copy link
Contributor

lucab commented Jun 26, 2019

@johannwagner thanks for the PR! However this lacks quite a bit of accompanying context, and on the surface it seems to just be trying to hide an existing race.

What is the problem this is trying to solve related to rook? How do you determine what is a good reboot-wait time (and is there some upper-bound guarantee on the value)? Why isn't pod draining sufficient here?

@embik
Copy link

embik commented Jun 26, 2019

Hi @lucab, not the author but hope I can provide some context as well.

This seems to be useful because of situations like #191 - beyond pod termination there are some downstream operations (e.g., unmounting a volume because it is going to be attached to another node - this can take a short period of time; I suppose that's why it's useful for rook) that are not fully grasped by CLUO at the moment.

Implementing the reboot-wait parameter would be a bit hacky, but it's a nice way around implementing checks for all possible downstream operations (which even might be proprietary).

@johannwagner
Copy link
Author

Jeep, that’s the reason. We want to be able to shutdown our CEPH Cluster properly, which only works, if everything shutdowns properly after the last Pod was terminated. @embik assumes the correct context.

@dghubble
Copy link
Member

If a drain isn't sufficient for your situation, CLUO supports before-reboot (and after-reboot) required annotations. A custom DaemonSet can run before/after a reboot to perform custom cleanup/setup actions (e.g. waiting a fixed period, implement Ceph shutdown, etc.) and then add the required annotation, which CLUO awaits.

https://github.com/coreos/container-linux-update-operator/blob/master/doc/before-after-reboot-checks.md

@martin31821
Copy link

@dghubble We are already using custom wait parameters to cleanly unmount and reboot nodes, but at least the rbd driver does some internal network related tasks which we can't wait for at the moment.

@cgwalters
Copy link
Member

I think this would probably be better done as a systemd unit with an ExecStop= that would run at shutdown time and call some API that waits for whatever Rook/Ceph needs to do.

@martin31821
Copy link

If you follow these instructions https://github.com/rook/rook/blob/master/Documentation/container-linux.md on a CoreOS cluster, it will cause the nodes to hang during shutdown because the network is already brought down by systemd before the rbd module is able to complete its unmount and unlocking the objects/drive, that's why our initial implementation used a delay of approx. 30s to ensure this will be finished.

Basically the before-reboot hook runs before the update operator drains the node and maybe it's good to have the same hook mechanic being executed after-drain-before-reboot.

WDYT?

@embik
Copy link

embik commented Jun 27, 2019

@dghubble in that case the custom DaemonSet would be required to implement draining the node, because in the scenario described we need to drain -> wait -> reboot, not wait -> drain -> reboot. Which means it will reimplement logic already present in CLUO. The workflow would be drain (custom logic) -> wait (custom logic) -> already cordoned, no drain (CLUO) -> reboot (CLUO).

Honestly, if I need to write custom logic and deploy a custom DaemonSet and all that, I'd probably just fork CLUO and apply this patch because it solves the problem well enough. I could do that, no problem, I just think you should be aware this patch could be helpful for production users.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants