You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Assuming Linux:
umount -f -l /mnt/myfolder
Will sort of fix the problem:
-f Force unmount (in case of an unreachable NFS system). (Requires kernel 2.1.116 or later.)
-l Lazy unmount. Detach the filesystem from the filesystem hierarchy now, and cleanup all references to the filesystem as soon as it is not busy anymore. (Requires kernel 2.4.11 or later.)
-f also exists on Solaris and AIX.
This is a known kubernetes issues kubernetes/kubernetes#101622. However, up until this will be fixed, I think that it would be great to add this as a feature for non compatible k8s versions which most likely will be using democratic-csi for long time.
Also, while looking for the origin of this issue and up until I figured that NFS doesn't work after the NFS server has been restarted/redeployed etc, I kept facing the following issues:
ContainerCreating - this happened because the consuming pods were not able to mount the NFS volume and kubelet just timed out over and over again.
Checking the actual node, I saw that running sudo findmnt --mountpoint /var/...... --output source,target,fstype,label,options,avail,size,used -b -J (check on GKE) kept hanging.
GKE allows debugging e.g using strace using its toolbox. So I ran this and saw that it was hanging at this point:
0x7ffcdbb465f0) = -1 EIO (Input/output error)
brk(0x559f1da8d000) = 0x559f1da8d000
getcwd("/root", 4096) = 6
lstat("/root/clearml-agent-caching.clearml.svc.cluster.local:", 0x7ffcdbb465b0) = -1 ENOENT (No such file or directory)
open("/etc/udev/udev.conf", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
getcwd("/root", 4096) = 6
lstat("/root/clearml-agent-caching.clearml.svc.cluster.local:", 0x7ffcdbb46640) = -1 ENOENT (No such file or directory)
access("/sys/subsystem/block/devices/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
access("/sys/bus/block/devices/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
access("/sys/class/block/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
open("clearml-agent-caching.clearml.svc.cluster.local:/", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
statfs("/media/root/var/lib/kubelet/plugins/kubernetes.io/csi/pv/clearml-agent-caching/globalmount", 0x7ffcdbb466c0) = -1 EIO (Input/output error)
statfs("/media/root/var/lib/kubelet/plugins/kubernetes.io/csi/pv/clearml-agent-caching/globalmount",
I was also able to see that the driver tried to unmount/UnPublish but just didn't actually do anything:
Basically listing all of the NFS volumes that were set by me (therefore caching keyword) and then just unmount with force and lazy.
The proposed solution, as discussed on k8s Slack channel is to add a timeout for fstat and in case it fails then the driver will just unmount with force & lazy and then remount if required.
I have also found few more GH issues which might be interesting to check:
Can you deploy the next branch images and try out the fix? Make sure to set the image pull policy to always to ensure you are getting the most recent images as the next image tag is mutable.
It is possible that the NFS server goes down for some reason, e.g: broken health check, new deployment, new IP/DNS etc.
Unfortunately NFS doesn't know how unmount a directory in case its IP address is not available anymore. In this case only way to unmount the folder is by using force and lazy options:
Source: https://serverfault.com/a/56606/166830
This is a known kubernetes issues kubernetes/kubernetes#101622. However, up until this will be fixed, I think that it would be great to add this as a feature for non compatible k8s versions which most likely will be using democratic-csi for long time.
Also, while looking for the origin of this issue and up until I figured that NFS doesn't work after the NFS server has been restarted/redeployed etc, I kept facing the following issues:
ContainerCreating
- this happened because the consuming pods were not able to mount the NFS volume and kubelet just timed out over and over again.sudo findmnt --mountpoint /var/...... --output source,target,fstype,label,options,avail,size,used -b -J
(check on GKE) kept hanging.strace
using its toolbox. So I ran this and saw that it was hanging at this point:This is most likely because the driver itself faced a hanging
fstat
/stat
caching
keyword) and then just unmount with force and lazy.The proposed solution, as discussed on k8s Slack channel is to add a timeout for
fstat
and in case it fails then the driver will just unmount with force & lazy and then remount if required.I have also found few more GH issues which might be interesting to check:
Shaked
The text was updated successfully, but these errors were encountered: