Ensure update-agent waits for all volumes to be detached before rebooting #30

invidian · 2020-10-30T16:04:34Z

Original issue: coreos/container-linux-update-operator#191

Perhaps waiting for kubectl get volumeattachments to get empty with the right selector would be sufficient?

The text was updated successfully, but these errors were encountered:

Jasper-Ben · 2021-02-11T19:09:07Z

Scenario: Running a ceph cluster using the rook operator. During drain, the volumes are detatched, however it might take some time to propagate the the kernel unmount. I have not looked into details, but according to @martin31821 this is caused by the ceph kernel client doing some foo during unmount, thus trying to change this from userspace is not possible. #62 introduces a quick workaround by just adding some wait time after draining the node.

martin31821 · 2021-02-12T08:08:39Z

Maybe we can solve this by introducing a possibility to run one or more kubernetes jobs prior to rebooting, which could be used e.g. to change DNS records, wait a certain amount of time or run host commands prior to rebooting.

invidian · 2021-02-12T12:57:35Z

Not ideal, but I guess we could test against that on Lokomotive, as we have there a pipeline testing FLUO and Rook together. CC @surajssd

invidian · 2021-04-06T21:27:06Z

Note: existing capabilities for running hooks runs before node is drained, which indeed can make it impossible right now to deploy a custom hook which could ensure it. Perhaps this could be addressed.

invidian · 2021-04-07T23:21:09Z

As part of #37, I'm analyzing how FLUO works in details, as there is no documentation or tests and what comes to my mind is, that perhaps hooks model could be extended, so it's possible to run a workflow between each significant action taken, which would be:

before draining the node (currently supported)
before rebooting, but after draining the node (currently missing, requested by this issue)
after rebooting, but before uncordoning the node (currently supported)
after uncordoning? (currently missing/acts as a regular DaemonSet?)

However, existing state tracking model is overly complex and right now I don't feel comfortable adding another step to it. Perhaps we try to simplify it first, then extend with extra hook.

Nuckal777 · 2021-12-01T16:39:05Z

We are affected by this is well. Some seconds of sleep after draining, like in #62 would help mitigating.

invidian · 2022-06-03T12:24:30Z

Just realized I think I hit this issue on my cluster as well 😄

This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

invidian · 2022-06-13T17:12:17Z

Created PoC/draft PR to play around with this and things seem to improve nicely: #169.

This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

invidian added the enhancement New feature or request label Oct 30, 2020

Jasper-Ben mentioned this issue Feb 11, 2021

update-agent: Added reboot-wait parameter #62

Open

invidian mentioned this issue Apr 7, 2021

Simplify node state tracking model #74

Open

invidian added the component/agent Agent-related issue label Jan 14, 2022

invidian mentioned this issue Feb 17, 2022

Consider adding ability to wait for all volumes to be detached from node before rebooting by agent #132

Closed

invidian mentioned this issue Jun 7, 2022

Add simple example of before/after reboot annotations #168

Open

invidian linked a pull request Jun 13, 2022 that will close this issue

WIP: pkg/agent: wait for all volumes to be detached before rebooting #169

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure update-agent waits for all volumes to be detached before rebooting #30

Ensure update-agent waits for all volumes to be detached before rebooting #30

invidian commented Oct 30, 2020

Jasper-Ben commented Feb 11, 2021 •

edited

Loading

martin31821 commented Feb 12, 2021

invidian commented Feb 12, 2021

invidian commented Apr 6, 2021

invidian commented Apr 7, 2021 •

edited

Loading

Nuckal777 commented Dec 1, 2021

invidian commented Jun 3, 2022

invidian commented Jun 13, 2022

Ensure update-agent waits for all volumes to be detached before rebooting #30

Ensure update-agent waits for all volumes to be detached before rebooting #30

Comments

invidian commented Oct 30, 2020

Jasper-Ben commented Feb 11, 2021 • edited Loading

martin31821 commented Feb 12, 2021

invidian commented Feb 12, 2021

invidian commented Apr 6, 2021

invidian commented Apr 7, 2021 • edited Loading

Nuckal777 commented Dec 1, 2021

invidian commented Jun 3, 2022

invidian commented Jun 13, 2022

Jasper-Ben commented Feb 11, 2021 •

edited

Loading

invidian commented Apr 7, 2021 •

edited

Loading