UnmountVolume failed #7308

msilcher · 2023-09-14T21:39:47Z

crio-NOT OK.txt
crio-OK.txt
kubelet-NOT OK.txt
kubelet-OK.txt

What happened?

Hi team!

I'm using k8s in a homelab mainly for testing/aducational purposes. Recently I upgraded cri-o and kubernetes to 1.28.0 and I started noticing that removing pods, that rely on volume provisioning via csi, get stuck in "terminating" state. Ephemeral pods are not affected. Looking further into logs I see the following:

nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hpe.com^Data_K8s_pvc-4c72b78c-5333-49bb-8061-28f545a48f89 podName:9e70d42e-fead-4c03-900f-e0f0fe61e64e nodeName:}" failed. No retries permitted until 2023-09-14 18:15:02.834237518 -0300 -03 m=+1515.964317342 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "dnsmasq-pvc" (UniqueName: "kubernetes.io/csi/csi.hpe.com^Data_K8s_pvc-4c72b78c-5333-49bb-8061-28f545a48f89") pod "9e70d42e-fead-4c03-900f-e0f0fe61e64e" (UID: "9e70d42e-fead-4c03-900f-e0f0fe61e64e") : kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: remove /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount: directory not empty

Narrowing down the problem, it looks like related to cri-o (version 1.28.0 and 1.28.1). Using k8s version 1.28.x with cri-o 1.27.x works fine!
Also tested k8s 1.28.x with containerd (1.6.22) and it works fine!

I tested with cri-o with crun and runc (both latest versions) and the issues arises with both of them.

What did you expect to happen?

Volume gets correctly unmountd and pod terminated.

How can we reproduce it (as minimally and precisely as possible)?

Narrowing down the problem, it looks like related to cri-o (version 1.28.0 and 1.28.1). Using k8s version 1.28.x with cri-o 1.27.x works fine!

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
crio version 1.28.1
Version:        1.28.1
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:
  netgo
  osusergo
  exclude_graphdriver_devicemapper
  exclude_graphdriver_btrfs
  containers_image_openpgp
  seccomp
  apparmor
LDFlags:          -s -w
SeccompEnabled:   true
AppArmorEnabled:  false

$ kubectl version --output=json
{
  "clientVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.2",
    "gitCommit": "89a4ea3e1e4ddd7f7572286090359983e0387b2f",
    "gitTreeState": "clean",
    "buildDate": "2023-09-13T09:35:49Z",
    "goVersion": "go1.20.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3",
  "serverVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.2",
    "gitCommit": "89a4ea3e1e4ddd7f7572286090359983e0387b2f",
    "gitTreeState": "clean",
    "buildDate": "2023-09-13T09:29:07Z",
    "goVersion": "go1.20.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}
</details>


### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian

$ uname -a
Linux debian-test 6.4.0-0.deb12.2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.4-3~bpo12+1 (2023-08-08) x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

Proxmox 8.0 VM

msilcher · 2023-10-07T21:42:45Z

Anyone??

haircommander · 2023-10-10T14:21:16Z

sorry for the delay! thanks for opening the issue. Can you post a reproducer so we can try it out?

msilcher · 2023-10-10T14:40:03Z

sorry for the delay! thanks for opening the issue. Can you post a reproducer so we can try it out?

I'm using (on amd64 arch):

Debian 12
k8s 1.28.2 ---> oficial debian repo
cri-o 1.28.x ---> using alvistack repo: http://download.opensuse.org/repositories/home:/alvistack/Debian_12/
crun/runc (both fail) ---> using alvistack repo: http://download.opensuse.org/repositories/home:/alvistack/Debian_12/
trueNAS csp (based on HPE csi driver) ---> https://github.com/hpe-storage/truenas-csp

Additional info: issue started when upgrading cr-io 1.27.x to 1.28.x. It also fails using cri-o from the suggested repo (https://github.com/cri-o/cri-o/blob/main/install.md#apt-based-operating-systems) so it is not related to alvistack's build.
Same applies to crun or runc, it fails for both of them. Also using kubernetes using cri-o 1.27x works fine, switching to 1.28.x fails.
Based on the last comment I suspect it is not related to the CSI driver I'm using, although I reported it also to its maintainer.

Let me know if you need additional info!

kwilczynski · 2023-10-12T01:34:38Z

@msilcher, I am sorry you are having issues.

To confirm: are you also seeing these issues when not using a custom CSI? Are there any problems when using local volumes?

Can you set the log level to debug and collect logs from the failure case? Also, would you have deployment definitions (templates) handy that manifest this issue?

I couldn't reproduce the issues you are seeing with the identical versions. However, I did not use the same CSI as you, so there is this difference.

msilcher · 2023-10-13T08:29:35Z

@msilcher, I am sorry you are having issues.

To confirm: are you also seeing these issues when not using a custom CSI? Are there any problems when using local volumes?

Can you set the log level to debug and collect logs from the failure case? Also, would you have deployment definitions (templates) handy that manifest this issue?

I couldn't reproduce the issues you are seeing with the identical versions. However, I did not use the same CSI as you, so there is this difference.

I've not tested with other CSI or local volumes so far. I'll do some more testing based on you input (also seting log level to debug) in a few days when I'm back from my business trip.

Thank you!

kwilczynski · 2023-10-15T16:30:29Z

@msilcher, thank you!

For reference, I used a few different types of CSI to test:

Running the following versions of Kubernetes and CRI-O:

kubelet: 1.28.2
CRI-O: 1.28.1
runc: 1.0.1
conmon: 2.1.2

msilcher · 2023-10-16T08:16:33Z

@kwilczynski, thanks for your feedback!

I'm looking forward to test it when I'm back. Just as a comment: I used newer versions of crun/runc & conmon. I'm not sure if this makes a difference:

runc: 1.1.9
crun: 1.9.2
conmon: 2.1.8

Those are available at alvistack's repo I mentined before.

kwilczynski · 2023-10-17T03:39:15Z

@msilcher, ah nice! I thought I used a reasonably up-to-date versions, but I see that some, like the runc, were a bit behind.

Not to worry. I will repeat my tests. Perhaps this time, there will be a solid reproducer.

haircommander · 2023-10-19T20:46:43Z

ah hah! I think #7408 should fix your issue, are you able to test it @msilcher ?

msilcher · 2023-10-20T07:56:29Z

ah hah! I think #7408 should fix your issue, are you able to test it @msilcher ?

Sure but I'll need a package for Debian 12 (amd64). I'm not sure if I'm able to build the stuff on my own, I can give it a try though.

kwilczynski · 2023-10-23T06:58:59Z

@msilcher, let me see about a package. That said, I can build you the necessary binaries, would that work?

kwilczynski · 2023-10-23T08:42:32Z

@msilcher, you can download static builds from the following link:

build-static-amd64 (amd64; x86_64)

Part of the artefacts produced for the following job:

cri-o/cri-o/actions/runs#6580057375

You would need to replace your current system binaries with these to test. We don't offer binary packages built from specific branches or Pull Requests - at least not yet. 😄

msilcher · 2023-10-23T10:44:50Z

@kwilczynski: Thanks a lot, I'll give it a try and report back soon!

msilcher · 2023-10-23T14:19:50Z

I've been doing some testing and it seems to work fine! Had to delete all previously created PVC & PV because I observed some strange behaviour, but after starting with new volumes / claims everything looks to be working fine.
In the attached log you can now observe the following: "UnmountVolume.TearDown succeeded for volume"

Just as additional info, I'm using the provided static build (version 1.29.0 ?):
msilcher@debian-k8s:~$ crio -v
crio version 1.29.0
Version: 1.29.0
GitCommit: 1fcc19f
GitCommitDate: 2023-10-19T20:46:22Z
GitTreeState: clean
BuildDate: 1970-01-01T00:00:00Z
GoVersion: go1.21.1
Compiler: gc
Platform: linux/amd64
Linkmode: static
BuildTags:
static
netgo
osusergo
exclude_graphdriver_btrfs
exclude_graphdriver_devicemapper
seccomp
apparmor
selinux
LDFlags: unknown
SeccompEnabled: true
AppArmorEnabled: false

I also switched to "crun" version 1.10 (latest) :)

Let me know if you need any additional info or feedback.

kubelet-issue7308.txt

kwilczynski · 2023-10-24T02:06:57Z

@msilcher, thank you for testing! Much appreciated!

I am glad to know that the fix from the Pull Request resolves the issue. 🎉

That said, what sort of strange issues have you seen with the pre-existing PVC/PVs? Errors? Something didn't work? Some other type of breakage?

msilcher · 2023-10-24T10:49:05Z

I just retested it quickly to give you an accurate feedback: I rolled back crio to version 1.28.1 and created two deployments with persisten storage (influx + grafana), once I delete a pod (grafana for example) the old pod gets stuck on terminating and I can observe "unmount failures". Then I switched crio to the patched version you sent me and problem itself remains, the pod is still in terminating state and "unmount failures" can be observed on kubelet's log file. I then delete the mount path manually (example "sudo rm -rf /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount") and moments later the pod in terminating state disapears. Creating new pods / volume mount with the patched crio version works fine.
So the strange behaviour I mentioned before is related to previously created mounts, it seems to me that the way the mount was done in version 1.28.x doesn't get fixed with the patched version. You really need to force remount the volume.

haircommander · 2023-10-24T13:41:37Z

I would guess it has to do with crio retaining containers across restarts, though that's odd because the container isn't actually created...

msilcher · 2023-10-24T14:10:42Z

I would guess it has to do with crio retaining containers across restarts, though that's odd because the container isn't actually created...

I have no idea but maybe it is related to the way the mount is stored for a running container.
I can test again without triggering the unmount failure in v1.28.x. I can deploy pods with version 1.28.1 and then switch to the patched version of crio, without deleting the pod in between. This might not trigger the unmount issue in v1.28.x where things get nasty... I would then check the unmount behaviuor directly on the patched crio. Does this make sense?

kwilczynski · 2023-12-21T03:58:41Z

/assign kwilczynski
/assign haircommander

msilcher added the kind/bug Categorizes issue or PR as related to a bug. label Sep 14, 2023

msilcher mentioned this issue Oct 10, 2023

Issues witk k8s 1.28 hpe-storage/truenas-csp#43

Closed

haircommander mentioned this issue Oct 19, 2023

server: allow Bidirectional mounts that contain storage root #7408

Merged

openshift-ci bot closed this as completed in #7408 Oct 31, 2023

openshift-ci bot assigned haircommander and kwilczynski Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnmountVolume failed #7308

UnmountVolume failed #7308

msilcher commented Sep 14, 2023 •

edited

msilcher commented Oct 7, 2023

haircommander commented Oct 10, 2023

msilcher commented Oct 10, 2023 •

edited

kwilczynski commented Oct 12, 2023

msilcher commented Oct 13, 2023

kwilczynski commented Oct 15, 2023

msilcher commented Oct 16, 2023

kwilczynski commented Oct 17, 2023

haircommander commented Oct 19, 2023

msilcher commented Oct 20, 2023 •

edited

kwilczynski commented Oct 23, 2023

kwilczynski commented Oct 23, 2023 •

edited

msilcher commented Oct 23, 2023

msilcher commented Oct 23, 2023 •

edited

kwilczynski commented Oct 24, 2023

msilcher commented Oct 24, 2023 •

edited

haircommander commented Oct 24, 2023

msilcher commented Oct 24, 2023 •

edited

kwilczynski commented Dec 21, 2023

UnmountVolume failed #7308

UnmountVolume failed #7308

Comments

msilcher commented Sep 14, 2023 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

CRI-O and Kubernetes version

Additional environment details (AWS, VirtualBox, physical, etc.)

msilcher commented Oct 7, 2023

haircommander commented Oct 10, 2023

msilcher commented Oct 10, 2023 • edited

kwilczynski commented Oct 12, 2023

msilcher commented Oct 13, 2023

kwilczynski commented Oct 15, 2023

msilcher commented Oct 16, 2023

kwilczynski commented Oct 17, 2023

haircommander commented Oct 19, 2023

msilcher commented Oct 20, 2023 • edited

kwilczynski commented Oct 23, 2023

kwilczynski commented Oct 23, 2023 • edited

msilcher commented Oct 23, 2023

msilcher commented Oct 23, 2023 • edited

kwilczynski commented Oct 24, 2023

msilcher commented Oct 24, 2023 • edited

haircommander commented Oct 24, 2023

msilcher commented Oct 24, 2023 • edited

kwilczynski commented Dec 21, 2023

msilcher commented Sep 14, 2023 •

edited

msilcher commented Oct 10, 2023 •

edited

msilcher commented Oct 20, 2023 •

edited

kwilczynski commented Oct 23, 2023 •

edited

msilcher commented Oct 23, 2023 •

edited

msilcher commented Oct 24, 2023 •

edited

msilcher commented Oct 24, 2023 •

edited