Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure update-agent waits for all volumes to be detached before rebooting #30

Open
invidian opened this issue Oct 30, 2020 · 8 comments · May be fixed by #169
Open

Ensure update-agent waits for all volumes to be detached before rebooting #30

invidian opened this issue Oct 30, 2020 · 8 comments · May be fixed by #169
Labels
component/agent Agent-related issue enhancement New feature or request

Comments

@invidian
Copy link
Member

Original issue: coreos/container-linux-update-operator#191

Perhaps waiting for kubectl get volumeattachments to get empty with the right selector would be sufficient?

@Jasper-Ben
Copy link

Jasper-Ben commented Feb 11, 2021

Scenario: Running a ceph cluster using the rook operator. During drain, the volumes are detatched, however it might take some time to propagate the the kernel unmount. I have not looked into details, but according to @martin31821 this is caused by the ceph kernel client doing some foo during unmount, thus trying to change this from userspace is not possible. #62 introduces a quick workaround by just adding some wait time after draining the node.

@martin31821
Copy link

Maybe we can solve this by introducing a possibility to run one or more kubernetes jobs prior to rebooting, which could be used e.g. to change DNS records, wait a certain amount of time or run host commands prior to rebooting.

@invidian
Copy link
Member Author

Not ideal, but I guess we could test against that on Lokomotive, as we have there a pipeline testing FLUO and Rook together. CC @surajssd

@invidian
Copy link
Member Author

invidian commented Apr 6, 2021

Note: existing capabilities for running hooks runs before node is drained, which indeed can make it impossible right now to deploy a custom hook which could ensure it. Perhaps this could be addressed.

@invidian
Copy link
Member Author

invidian commented Apr 7, 2021

As part of #37, I'm analyzing how FLUO works in details, as there is no documentation or tests and what comes to my mind is, that perhaps hooks model could be extended, so it's possible to run a workflow between each significant action taken, which would be:

  • before draining the node (currently supported)
  • before rebooting, but after draining the node (currently missing, requested by this issue)
  • after rebooting, but before uncordoning the node (currently supported)
  • after uncordoning? (currently missing/acts as a regular DaemonSet?)

However, existing state tracking model is overly complex and right now I don't feel comfortable adding another step to it. Perhaps we try to simplify it first, then extend with extra hook.

@Nuckal777
Copy link

We are affected by this is well. Some seconds of sleep after draining, like in #62 would help mitigating.

@invidian
Copy link
Member Author

invidian commented Jun 3, 2022

Just realized I think I hit this issue on my cluster as well 😄

invidian added a commit that referenced this issue Jun 13, 2022
This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.

This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.

This commit still requires tests and better interface for the users.

If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:

quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e

Closes #30

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
@invidian
Copy link
Member Author

Created PoC/draft PR to play around with this and things seem to improve nicely: #169.

invidian added a commit that referenced this issue Jan 11, 2023
This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.

This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.

This commit still requires tests and better interface for the users.

If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:

quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e

Closes #30

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/agent Agent-related issue enhancement New feature or request
Projects
None yet
4 participants