-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CephFS PVs cannot unmount when the same volume is mounted twice on the same node. #2238
Comments
Here's the bug report @humblec, thanks for helping me out last week! |
@ikogan Thanks for reverting here!! much appreciated.. I wanted to ask you that, is this cluster running in production? because recently I have been revisiting the 'locking operation' we do in various code paths and working on some improvements in the same area as mentioned or tracked under #2149. I have a feeling that, this bug fix could take care of this scenario. so just wanted to ask if it possible to give a try with test image (quay.io/humble/cephcsi:publish) with the fix at your end. Meanwhile I am taking some AIs here to reproduce it in my end and also RCAing it . Please revert with your thoughts. |
I can give it a shot, this is in my homelab so the definition of "production" varies depending on who's inconvenienced at the moment . I'm assuming, to test this, I can just set |
@ikogan That should do, if you already have a running cluster, you can directly edit the deamonset ( also deployment) and change the plugin/driver image path:
.....
|
Sorry for taking so long to get to this! Unfortunately, the only thing that changed after using # kubectl get pod -n rook-ceph -o jsonpath="{.items[*].spec.containers[*].image}" | tr -s '[[:space:]]' '\n' | sort | uniq
k8s.gcr.io/sig-storage/csi-attacher:v3.2.1
k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
k8s.gcr.io/sig-storage/csi-provisioner:v2.2.2
k8s.gcr.io/sig-storage/csi-resizer:v1.2.0
k8s.gcr.io/sig-storage/csi-snapshotter:v4.1.1
quay.io/humble/cephcsi:publish So it looks like I'm correctly running ---
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
name: cephfs-samenode-test1
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 50Ti
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-node-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: clusterfs-capacity
rootPath: /
staticVolume: "true"
volumeHandle: cephfs-samenode-test1
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
name: cephfs-samenode-test2
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 50Ti
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-node-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: clusterfs-capacity
rootPath: /
staticVolume: "true"
volumeHandle: cephfs-samenode-test2
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
finalizers:
- kubernetes.io/pvc-protection
name: cephfs-samenode-test1
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Ti
volumeMode: Filesystem
volumeName: cephfs-samenode-test1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
finalizers:
- kubernetes.io/pvc-protection
name: cephfs-samenode-test2
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Ti
volumeMode: Filesystem
volumeName: cephfs-samenode-test2
---
apiVersion: v1
kind: Pod
metadata:
name: cephfs-samenode-test1
namespace: default
spec:
containers:
- image: registry.kubernetes.local/docker-diagnostics:2021-05-12
imagePullPolicy: IfNotPresent
name: cephfs-samenode-test1
stdin: True
tty: true
securityContext:
runAsNonRoot: false
volumeMounts:
- mountPath: /mnt/test
name: test
imagePullSecrets:
- name: gaea
nodeName: worker1.kubernetes.local
restartPolicy: Never
volumes:
- name: test
persistentVolumeClaim:
claimName: cephfs-samenode-test1
---
apiVersion: v1
kind: Pod
metadata:
name: cephfs-samenode-test2
namespace: default
spec:
containers:
- image: registry.kubernetes.local/docker-diagnostics:2021-05-12
imagePullPolicy: IfNotPresent
name: cephfs-samenode-test2
stdin: True
tty: true
securityContext:
runAsNonRoot: false
volumeMounts:
- mountPath: /mnt/test
name: test
imagePullSecrets:
- name: gaea
nodeName: worker1.kubernetes.local
restartPolicy: Never
volumes:
- name: test
persistentVolumeClaim:
claimName: cephfs-samenode-test2 Once those pods are running I then delete them, their PVs and PVCs. Shortly thereafter (much quicker than before): {
"log": "E0715 04:04:23.623704 1470 nestedpendingoperations.go:301] Operation for \"{volumeName:kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com^cephfs-samenode-test2 podName: nodeName:}\" failed. No retries permitted until 2021-07-15 04:04:24.623686457 +0000 UTC m=+2548665.069510895 (durationBeforeRetry 1s). Error: \"GetDeviceMountRefs check failed for volume \\\"cephfs-samenode-test2\\\" (UniqueName: \\\"kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com^cephfs-samenode-test2\\\") on node \\\"worker1.kubernetes.local\\\" : The device mount path \\\"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/cephfs-samenode-test2/globalmount\\\" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/cephfs-samenode-test1/globalmount]\"\n",
"stream": "stderr",
"time": "2021-07-15T04:04:23.623743031Z"
} Also, it looks like my method to change the image in the Rook Helm chart won't work. Those workloads are deployed by the rook operator which bombs because it can't parse the version |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This is still an issue and the only workaround I can see so far is to ensure CephFS workloads using the same volume don't start on the same node. I'm running out of nodes. |
We've gone back to gluster. |
@ikogan am planning to work on this one. please provide the below details for now
|
Hi, thanks for working on this!
---
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
name: cephfs-samenode-test1
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 50Ti
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-node-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: clusterfs-capacity
rootPath: /
staticVolume: "true"
volumeHandle: cephfs-samenode-test1
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
name: cephfs-samenode-test2
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 50Ti
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-node-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: clusterfs-capacity
rootPath: /
staticVolume: "true"
volumeHandle: cephfs-samenode-test2
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
finalizers:
- kubernetes.io/pvc-protection
name: cephfs-samenode-test1
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Ti
volumeMode: Filesystem
volumeName: cephfs-samenode-test1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
finalizers:
- kubernetes.io/pvc-protection
name: cephfs-samenode-test2
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Ti
volumeMode: Filesystem
volumeName: cephfs-samenode-test2
---
apiVersion: v1
kind: Pod
metadata:
name: cephfs-samenode-test1
namespace: default
spec:
containers:
- image: alpine:latest
imagePullPolicy: IfNotPresent
name: cephfs-samenode-test1
stdin: True
tty: true
securityContext:
runAsNonRoot: false
volumeMounts:
- mountPath: /mnt/test
name: test
nodeName: kubernetes-lada.gaea.private
restartPolicy: Never
volumes:
- name: test
persistentVolumeClaim:
claimName: cephfs-samenode-test1
---
apiVersion: v1
kind: Pod
metadata:
name: cephfs-samenode-test2
namespace: default
spec:
containers:
- image: alpine:latest
imagePullPolicy: IfNotPresent
name: cephfs-samenode-test2
stdin: True
tty: true
securityContext:
runAsNonRoot: false
volumeMounts:
- mountPath: /mnt/test
name: test
nodeName: kubernetes-lada.gaea.private
restartPolicy: Never
volumes:
- name: test
persistentVolumeClaim:
claimName: cephfs-samenode-test2 If I then manually unmount one of the volumes from the node, eventually the other will get unmounted cleanly as well. Here's what my list of mounts looks like after shutting down the pods:
Then I |
Okay, one thing am failing to understand is why do you need multiple PV with the same
|
While I can re-use the same PVC if my workloads are in the same namespace, since a PVC is a namespaced object, I cannot use it across namespaces. I can also only have one PVC bound to a single PV. The only way to have two different namespaces use the same underlying filesystem path is to have two different PVs. |
got it. Let me see what can be done here. |
okay am seeing different problem on this one 🗡️ Create static PVC, PV and pods
Delete static PVC, PV, and pods
The strange thing here is kubelet is not sending any NodeUnstage Request and i can see the mounts still exists on the node. |
Kubelet logs
|
@ikogan this looks more of a kubernetes issue as cephcsi side we do mount/umount operations as requested from the kubelet. the kubelet is not sending any request for cephcsi to UnStage the volume. because of that, the volumes are left stale on the node. |
@Madhu-1 that's what I was thinking...possibly. Based on my very naive and inexperienced look at the kubelet source code, it's asking the, I think, the CSI driver for the device for the mount in So I'm really not sure whether the issue is that CephCSI is reporting the same device or that the kubelet is behaving in this way. I wonder what the NFS driver does since I didn't have this problem there... |
@ikogan NFS driver doesn't implement NodeStage and Unstage API maybe because of that things are working fine. As per the CSI spec, cephcsi is mounting the volume to a stagingPath and in this case, we are getting two stagingPath's (one stagingPath per PVC) as there are two PVC's but a common volume. |
Looks like this is being talked about on the k8s GitHub here: kubernetes/kubernetes#105323. |
Yes opened an issue in kubernetes to check Is this expected or not. but looks like no response |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue is still relevant, just waiting on work on the kubelet side. If folks thinks we should close the one here, that's fine by me. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
@ikogan even am more inclined towards closing this one as nothing can be done at cephcsi for this one. |
I think it makes sense to close this if we think that it must be fixed in the kubelet. I was wondering why Ceph doesn't choose to stage a CephFS mount once on a given node rather than mounting it multiple times? That seems a bit more logical given what it is. Wouldn't that also work around this issue? |
As the stagingPaths are different as per the CSI standard we need to make sure the volume is mounted to the given staging path. if it was a single stagingPath we don't have any problem. |
@gman0 can you please check do we have this problem for shallow cephfs pvc? Just making sure we are good. |
@Madhu-1 sure, I'll give it a go tomorrow. |
I tested this one with shallow support PR i don't see any issue. good to get some confirmation on this one. |
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-snapbacked-1
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 2Gi
csi:
driver: cephfs.csi.ceph.com
nodeStageSecretRef:
name: csi-cephfs-microosd
namespace: default
volumeAttributes:
clusterID: microosd
fsName: cephfs
rootPath: /volumes/csi/csi-vol-d256e2c1-e0f5-11ec-82db-0242ac110003/65f292b3-8fdb-4822-aa13-61e8cea40fd6
backingSnapshotID: de550711-e0f5-11ec-82db-0242ac110003
staticVolume: "true"
volumeHandle: pv-snapbacked-1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-snapbacked-1
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2Gi
volumeName: pv-snapbacked-1
storageClassName: ""
---
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-snapbacked-2
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 2Gi
csi:
driver: cephfs.csi.ceph.com
nodeStageSecretRef:
name: csi-cephfs-microosd
namespace: default
volumeAttributes:
clusterID: microosd
fsName: cephfs
rootPath: /volumes/csi/csi-vol-d256e2c1-e0f5-11ec-82db-0242ac110003/65f292b3-8fdb-4822-aa13-61e8cea40fd6
backingSnapshotID: de550711-e0f5-11ec-82db-0242ac110003
staticVolume: "true"
volumeHandle: pv-snapbacked-2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-snapbacked-2
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2Gi
volumeName: pv-snapbacked-2
storageClassName: "" ---
apiVersion: v1
kind: Pod
metadata:
name: backingsnap-1
spec:
containers:
- name: web-server
image: docker.io/library/nginx:latest
volumeMounts:
- name: mypvc
mountPath: /var/lib/www
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc-snapbacked-1
readOnly: false
---
apiVersion: v1
kind: Pod
metadata:
name: backingsnap-2
spec:
containers:
- name: web-server
image: docker.io/library/nginx:latest
volumeMounts:
- name: mypvc
mountPath: /var/lib/www
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc-snapbacked-2
readOnly: false
Deleting both Pods seems to work fine, and I couldn't find the error message from the original post in kubelet logs. |
@gman0 Thanks for confirming 🚀 |
Cann MountDevice only when the volume was not device-mounted before.
So just chiming in here, this will happen anytime you have a pod mount the same cephfs volume on the same node. IE I have 2 pods on the same node both using the same sharedfs PVC. It also causes issues as the globalmount has a kernel client and can sometimes acquire an exclusive lock on files/folders which it will then never let go until manually unmounted. |
@ADustyOldMuffin Hi, did you find any solution to this problem, I'm stuck with it right now too. |
Our issue turned out to be something different I believe. We bind mounted the kubelet directory and Kubelet had a bug (that someone fixed) where it thought the bind mounts were references for a pod so it never removed the staging mount. If all pods are off the machine you can umount the staging mount to get rid of it. |
Describe the bug
When two (or possibly more) ReadWriteMany PVs are used to attach the same CephFS volume or subvolume on the same node, these volumes will be stuck unable to unmount due to:
In my case, here are the two mounts:
This also happens with subvolumes. So when I try to mount
/volumes/shared
twice on the same node, the same problem occurs when one of those workloads stops. This is most obvious with cronjobs but replicas will do this too the moment the pod using the PV is stopped.Environment details
quay.io/cephcsi/cephcsi:v3.3.1
rook-ceph:1.6.5
Linux k8s-worker-5 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) : kernelServer Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
ceph version 15.2.13 (1f5c7871ec0e36ade641773b9b05b6211c308b9d) octopus (stable)
Steps to reproduce
Steps to reproduce the behavior:
Actual results
The CephFS mounts on the node remain forever and the log message appears continually as the kubelet tries to unmount each volume repeatedly. Notably, this does not prevent starting these workloads again on the same node as far as I've seen. The
VolumeAttachment
the PVs were using remain forever as well.Expected behavior
The CephFS mounts should be properly unmounted from the node when they are no longer needed and the
VolumeAttachment
objects should be properly removed. Errors should not continue to appear forever in logs.Logs
Unfortunately I don't have the csi logs at the moment but I'll try to update with those when I have a chance to replicate this. Here are some relevant Kubelet logs:
Additional Context
The Ceph cluster is deployed and managed by Proxmox and configured in k8s by Rook. Here's the cluster CRD's status:
While rook is using an admin secret, I've added another secret for lower privilege access for CephFS CSI mounting, here's one of the volumes from the logs above. The other is nearly identical except for the name, uid, handle, and claimRef:
And here is the scrubbed node secret:
And finally, the cephx permissions:
For now, I've worked around this by using pod antiaffinity to ensure the jobs don't run on the same node but I'll start running out of nodes eventually.
The text was updated successfully, but these errors were encountered: