Skip to content
This repository has been archived by the owner on Sep 18, 2020. It is now read-only.

Persistent volumes in stuck state after reboot #191

Open
dustinmm80 opened this issue Mar 28, 2019 · 5 comments
Open

Persistent volumes in stuck state after reboot #191

dustinmm80 opened this issue Mar 28, 2019 · 5 comments

Comments

@dustinmm80
Copy link

It appears there is a race condition when using persistent volumes, where the pod is deleted and the node is rebooted, but the attached volume is still in the process of detaching. Once this happens, the persistent volume is stuck and must be manually removed and recreated.

I'm seeing this on AWS with EBS volumes.

@metalmatze
Copy link
Member

I'm not sure hat that relates to CLUO? Are you running its Pods with volumes?

@dustinmm80
Copy link
Author

dustinmm80 commented Apr 2, 2019

No, not running CLUO with volumes. Is it possible that when the operator terminates the pods, it reboots before PV are properly detached?

@embik
Copy link

embik commented Jun 26, 2019

Hey @dustinmm80, are you by chance using the CSI implemention of EBS volumes?

I see something similar with the Cinder CSI driver and I suspect it to be related to VolumeAttachment resources or rather the fact that CSI components might not be fast enough to detach volumes before CLUO reboots the machine.

Just wanted to check in before investigating this.

@embik
Copy link

embik commented Jun 26, 2019

Thinking about it, I'm not sure it's only related to CSI. But it's probably a major issue with StatefulSets, because Kubernetes won't create a new statefulset-example-0 on another node before the old one finished deletion.

And only after scheduling the new StatefulSet pod to a new node the CSI components will start churning and update the VolumeAttachment, which will unmount the volume on the old node and mount it on the new node. But that process takes a few second and the old node is already being rebooted.

@yannh
Copy link

yannh commented Sep 9, 2019

We had the same issue a few weeks ago - it turned out to be an issue with newer, Nitro instances (c5, t3, ...) - AWS investigated and claims to have now solved the problem. Were you seeing that problem with nitro instances? Are you still seeing the problem?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants