-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnmountVolume failed #7308
Comments
Anyone?? |
sorry for the delay! thanks for opening the issue. Can you post a reproducer so we can try it out? |
I'm using (on amd64 arch):
Additional info: issue started when upgrading cr-io 1.27.x to 1.28.x. It also fails using cri-o from the suggested repo (https://github.com/cri-o/cri-o/blob/main/install.md#apt-based-operating-systems) so it is not related to alvistack's build. Let me know if you need additional info! |
@msilcher, I am sorry you are having issues. To confirm: are you also seeing these issues when not using a custom CSI? Are there any problems when using local volumes? Can you set the log level to debug and collect logs from the failure case? Also, would you have deployment definitions (templates) handy that manifest this issue? I couldn't reproduce the issues you are seeing with the identical versions. However, I did not use the same CSI as you, so there is this difference. |
I've not tested with other CSI or local volumes so far. I'll do some more testing based on you input (also seting log level to debug) in a few days when I'm back from my business trip. Thank you! |
@msilcher, thank you! For reference, I used a few different types of CSI to test: Running the following versions of Kubernetes and CRI-O:
|
@kwilczynski, thanks for your feedback! I'm looking forward to test it when I'm back. Just as a comment: I used newer versions of crun/runc & conmon. I'm not sure if this makes a difference:
Those are available at alvistack's repo I mentined before. |
@msilcher, ah nice! I thought I used a reasonably up-to-date versions, but I see that some, like the runc, were a bit behind. Not to worry. I will repeat my tests. Perhaps this time, there will be a solid reproducer. |
@msilcher, let me see about a package. That said, I can build you the necessary binaries, would that work? |
@msilcher, you can download static builds from the following link:
Part of the artefacts produced for the following job: You would need to replace your current system binaries with these to test. We don't offer binary packages built from specific branches or Pull Requests - at least not yet. 😄 |
@kwilczynski: Thanks a lot, I'll give it a try and report back soon! |
I've been doing some testing and it seems to work fine! Had to delete all previously created PVC & PV because I observed some strange behaviour, but after starting with new volumes / claims everything looks to be working fine. Just as additional info, I'm using the provided static build (version 1.29.0 ?): I also switched to "crun" version 1.10 (latest) :) Let me know if you need any additional info or feedback. |
@msilcher, thank you for testing! Much appreciated! I am glad to know that the fix from the Pull Request resolves the issue. 🎉 That said, what sort of strange issues have you seen with the pre-existing PVC/PVs? Errors? Something didn't work? Some other type of breakage? |
I just retested it quickly to give you an accurate feedback: I rolled back crio to version 1.28.1 and created two deployments with persisten storage (influx + grafana), once I delete a pod (grafana for example) the old pod gets stuck on terminating and I can observe "unmount failures". Then I switched crio to the patched version you sent me and problem itself remains, the pod is still in terminating state and "unmount failures" can be observed on kubelet's log file. I then delete the mount path manually (example "sudo rm -rf /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount") and moments later the pod in terminating state disapears. Creating new pods / volume mount with the patched crio version works fine. |
I would guess it has to do with crio retaining containers across restarts, though that's odd because the container isn't actually created... |
I have no idea but maybe it is related to the way the mount is stored for a running container. |
/assign kwilczynski |
crio-NOT OK.txt
crio-OK.txt
kubelet-NOT OK.txt
kubelet-OK.txt
What happened?
Hi team!
I'm using k8s in a homelab mainly for testing/aducational purposes. Recently I upgraded cri-o and kubernetes to 1.28.0 and I started noticing that removing pods, that rely on volume provisioning via csi, get stuck in "terminating" state. Ephemeral pods are not affected. Looking further into logs I see the following:
nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hpe.com^Data_K8s_pvc-4c72b78c-5333-49bb-8061-28f545a48f89 podName:9e70d42e-fead-4c03-900f-e0f0fe61e64e nodeName:}" failed. No retries permitted until 2023-09-14 18:15:02.834237518 -0300 -03 m=+1515.964317342 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "dnsmasq-pvc" (UniqueName: "kubernetes.io/csi/csi.hpe.com^Data_K8s_pvc-4c72b78c-5333-49bb-8061-28f545a48f89") pod "9e70d42e-fead-4c03-900f-e0f0fe61e64e" (UID: "9e70d42e-fead-4c03-900f-e0f0fe61e64e") : kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: remove /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount: directory not empty
Narrowing down the problem, it looks like related to cri-o (version 1.28.0 and 1.28.1). Using k8s version 1.28.x with cri-o 1.27.x works fine!
Also tested k8s 1.28.x with containerd (1.6.22) and it works fine!
I tested with cri-o with crun and runc (both latest versions) and the issues arises with both of them.
What did you expect to happen?
Volume gets correctly unmountd and pod terminated.
How can we reproduce it (as minimally and precisely as possible)?
Narrowing down the problem, it looks like related to cri-o (version 1.28.0 and 1.28.1). Using k8s version 1.28.x with cri-o 1.27.x works fine!
Anything else we need to know?
No response
CRI-O and Kubernetes version
Additional environment details (AWS, VirtualBox, physical, etc.)
The text was updated successfully, but these errors were encountered: