Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnmountVolume failed #7308

Closed
msilcher opened this issue Sep 14, 2023 · 19 comments · Fixed by #7408
Closed

UnmountVolume failed #7308

msilcher opened this issue Sep 14, 2023 · 19 comments · Fixed by #7408
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@msilcher
Copy link

msilcher commented Sep 14, 2023

crio-NOT OK.txt
crio-OK.txt
kubelet-NOT OK.txt
kubelet-OK.txt

What happened?

Hi team!

I'm using k8s in a homelab mainly for testing/aducational purposes. Recently I upgraded cri-o and kubernetes to 1.28.0 and I started noticing that removing pods, that rely on volume provisioning via csi, get stuck in "terminating" state. Ephemeral pods are not affected. Looking further into logs I see the following:

nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hpe.com^Data_K8s_pvc-4c72b78c-5333-49bb-8061-28f545a48f89 podName:9e70d42e-fead-4c03-900f-e0f0fe61e64e nodeName:}" failed. No retries permitted until 2023-09-14 18:15:02.834237518 -0300 -03 m=+1515.964317342 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "dnsmasq-pvc" (UniqueName: "kubernetes.io/csi/csi.hpe.com^Data_K8s_pvc-4c72b78c-5333-49bb-8061-28f545a48f89") pod "9e70d42e-fead-4c03-900f-e0f0fe61e64e" (UID: "9e70d42e-fead-4c03-900f-e0f0fe61e64e") : kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: remove /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount: directory not empty

Narrowing down the problem, it looks like related to cri-o (version 1.28.0 and 1.28.1). Using k8s version 1.28.x with cri-o 1.27.x works fine!
Also tested k8s 1.28.x with containerd (1.6.22) and it works fine!

I tested with cri-o with crun and runc (both latest versions) and the issues arises with both of them.

What did you expect to happen?

Volume gets correctly unmountd and pod terminated.

How can we reproduce it (as minimally and precisely as possible)?

Narrowing down the problem, it looks like related to cri-o (version 1.28.0 and 1.28.1). Using k8s version 1.28.x with cri-o 1.27.x works fine!

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
crio version 1.28.1
Version:        1.28.1
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:
  netgo
  osusergo
  exclude_graphdriver_devicemapper
  exclude_graphdriver_btrfs
  containers_image_openpgp
  seccomp
  apparmor
LDFlags:          -s -w
SeccompEnabled:   true
AppArmorEnabled:  false

$ kubectl version --output=json
{
  "clientVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.2",
    "gitCommit": "89a4ea3e1e4ddd7f7572286090359983e0387b2f",
    "gitTreeState": "clean",
    "buildDate": "2023-09-13T09:35:49Z",
    "goVersion": "go1.20.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3",
  "serverVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.2",
    "gitCommit": "89a4ea3e1e4ddd7f7572286090359983e0387b2f",
    "gitTreeState": "clean",
    "buildDate": "2023-09-13T09:29:07Z",
    "goVersion": "go1.20.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}
</details>


### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian

$ uname -a
Linux debian-test 6.4.0-0.deb12.2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.4-3~bpo12+1 (2023-08-08) x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

Proxmox 8.0 VM
@msilcher msilcher added the kind/bug Categorizes issue or PR as related to a bug. label Sep 14, 2023
@msilcher
Copy link
Author

msilcher commented Oct 7, 2023

Anyone??

@haircommander
Copy link
Member

sorry for the delay! thanks for opening the issue. Can you post a reproducer so we can try it out?

@msilcher
Copy link
Author

msilcher commented Oct 10, 2023

sorry for the delay! thanks for opening the issue. Can you post a reproducer so we can try it out?

I'm using (on amd64 arch):

Additional info: issue started when upgrading cr-io 1.27.x to 1.28.x. It also fails using cri-o from the suggested repo (https://github.com/cri-o/cri-o/blob/main/install.md#apt-based-operating-systems) so it is not related to alvistack's build.
Same applies to crun or runc, it fails for both of them. Also using kubernetes using cri-o 1.27x works fine, switching to 1.28.x fails.
Based on the last comment I suspect it is not related to the CSI driver I'm using, although I reported it also to its maintainer.

Let me know if you need additional info!

@kwilczynski
Copy link
Member

@msilcher, I am sorry you are having issues.

To confirm: are you also seeing these issues when not using a custom CSI? Are there any problems when using local volumes?

Can you set the log level to debug and collect logs from the failure case? Also, would you have deployment definitions (templates) handy that manifest this issue?

I couldn't reproduce the issues you are seeing with the identical versions. However, I did not use the same CSI as you, so there is this difference.

@msilcher
Copy link
Author

@msilcher, I am sorry you are having issues.

To confirm: are you also seeing these issues when not using a custom CSI? Are there any problems when using local volumes?

Can you set the log level to debug and collect logs from the failure case? Also, would you have deployment definitions (templates) handy that manifest this issue?

I couldn't reproduce the issues you are seeing with the identical versions. However, I did not use the same CSI as you, so there is this difference.

I've not tested with other CSI or local volumes so far. I'll do some more testing based on you input (also seting log level to debug) in a few days when I'm back from my business trip.

Thank you!

@kwilczynski
Copy link
Member

@msilcher, thank you!

For reference, I used a few different types of CSI to test:

Running the following versions of Kubernetes and CRI-O:

  • kubelet: 1.28.2
  • CRI-O: 1.28.1
  • runc: 1.0.1
  • conmon: 2.1.2

@msilcher
Copy link
Author

@kwilczynski, thanks for your feedback!

I'm looking forward to test it when I'm back. Just as a comment: I used newer versions of crun/runc & conmon. I'm not sure if this makes a difference:

  • runc: 1.1.9
  • crun: 1.9.2
  • conmon: 2.1.8

Those are available at alvistack's repo I mentined before.

@kwilczynski
Copy link
Member

@msilcher, ah nice! I thought I used a reasonably up-to-date versions, but I see that some, like the runc, were a bit behind.

Not to worry. I will repeat my tests. Perhaps this time, there will be a solid reproducer.

@haircommander
Copy link
Member

ah hah! I think #7408 should fix your issue, are you able to test it @msilcher ?

@msilcher
Copy link
Author

msilcher commented Oct 20, 2023

ah hah! I think #7408 should fix your issue, are you able to test it @msilcher ?

Sure but I'll need a package for Debian 12 (amd64). I'm not sure if I'm able to build the stuff on my own, I can give it a try though.

@kwilczynski
Copy link
Member

@msilcher, let me see about a package. That said, I can build you the necessary binaries, would that work?

@kwilczynski
Copy link
Member

kwilczynski commented Oct 23, 2023

@msilcher, you can download static builds from the following link:

Part of the artefacts produced for the following job:

You would need to replace your current system binaries with these to test. We don't offer binary packages built from specific branches or Pull Requests - at least not yet. 😄

@msilcher
Copy link
Author

@kwilczynski: Thanks a lot, I'll give it a try and report back soon!

@msilcher
Copy link
Author

msilcher commented Oct 23, 2023

I've been doing some testing and it seems to work fine! Had to delete all previously created PVC & PV because I observed some strange behaviour, but after starting with new volumes / claims everything looks to be working fine.
In the attached log you can now observe the following: "UnmountVolume.TearDown succeeded for volume"

Just as additional info, I'm using the provided static build (version 1.29.0 ?):
msilcher@debian-k8s:~$ crio -v
crio version 1.29.0
Version: 1.29.0
GitCommit: 1fcc19f
GitCommitDate: 2023-10-19T20:46:22Z
GitTreeState: clean
BuildDate: 1970-01-01T00:00:00Z
GoVersion: go1.21.1
Compiler: gc
Platform: linux/amd64
Linkmode: static
BuildTags:
static
netgo
osusergo
exclude_graphdriver_btrfs
exclude_graphdriver_devicemapper
seccomp
apparmor
selinux
LDFlags: unknown
SeccompEnabled: true
AppArmorEnabled: false

I also switched to "crun" version 1.10 (latest) :)

Let me know if you need any additional info or feedback.

kubelet-issue7308.txt

@kwilczynski
Copy link
Member

@msilcher, thank you for testing! Much appreciated!

I am glad to know that the fix from the Pull Request resolves the issue. 🎉

That said, what sort of strange issues have you seen with the pre-existing PVC/PVs? Errors? Something didn't work? Some other type of breakage?

@msilcher
Copy link
Author

msilcher commented Oct 24, 2023

I just retested it quickly to give you an accurate feedback: I rolled back crio to version 1.28.1 and created two deployments with persisten storage (influx + grafana), once I delete a pod (grafana for example) the old pod gets stuck on terminating and I can observe "unmount failures". Then I switched crio to the patched version you sent me and problem itself remains, the pod is still in terminating state and "unmount failures" can be observed on kubelet's log file. I then delete the mount path manually (example "sudo rm -rf /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount") and moments later the pod in terminating state disapears. Creating new pods / volume mount with the patched crio version works fine.
So the strange behaviour I mentioned before is related to previously created mounts, it seems to me that the way the mount was done in version 1.28.x doesn't get fixed with the patched version. You really need to force remount the volume.

@haircommander
Copy link
Member

I would guess it has to do with crio retaining containers across restarts, though that's odd because the container isn't actually created...

@msilcher
Copy link
Author

msilcher commented Oct 24, 2023

I would guess it has to do with crio retaining containers across restarts, though that's odd because the container isn't actually created...

I have no idea but maybe it is related to the way the mount is stored for a running container.
I can test again without triggering the unmount failure in v1.28.x. I can deploy pods with version 1.28.1 and then switch to the patched version of crio, without deleting the pod in between. This might not trigger the unmount issue in v1.28.x where things get nasty... I would then check the unmount behaviuor directly on the patched crio. Does this make sense?

@kwilczynski
Copy link
Member

/assign kwilczynski
/assign haircommander

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants