NFS: make sure NFS is unmounted in case of a failure #100

Shaked · 2021-06-20T16:20:58Z

It is possible that the NFS server goes down for some reason, e.g: broken health check, new deployment, new IP/DNS etc.

Unfortunately NFS doesn't know how unmount a directory in case its IP address is not available anymore. In this case only way to unmount the folder is by using force and lazy options:

Assuming Linux:
umount -f -l /mnt/myfolder
Will sort of fix the problem:
-f Force unmount (in case of an unreachable NFS system). (Requires kernel 2.1.116 or later.)
-l Lazy unmount. Detach the filesystem from the filesystem hierarchy now, and cleanup all references to the filesystem as soon as it is not busy anymore. (Requires kernel 2.4.11 or later.)
-f also exists on Solaris and AIX.

Source: https://serverfault.com/a/56606/166830

This is a known kubernetes issues kubernetes/kubernetes#101622. However, up until this will be fixed, I think that it would be great to add this as a feature for non compatible k8s versions which most likely will be using democratic-csi for long time.

Also, while looking for the origin of this issue and up until I figured that NFS doesn't work after the NFS server has been restarted/redeployed etc, I kept facing the following issues:

ContainerCreating - this happened because the consuming pods were not able to mount the NFS volume and kubelet just timed out over and over again.
Checking the actual node, I saw that running sudo findmnt --mountpoint /var/...... --output source,target,fstype,label,options,avail,size,used -b -J (check on GKE) kept hanging.
- GKE allows debugging e.g using strace using its toolbox. So I ran this and saw that it was hanging at this point:

0x7ffcdbb465f0) = -1 EIO (Input/output error)
brk(0x559f1da8d000) = 0x559f1da8d000
getcwd("/root", 4096) = 6
lstat("/root/clearml-agent-caching.clearml.svc.cluster.local:", 0x7ffcdbb465b0) = -1 ENOENT (No such file or directory)
open("/etc/udev/udev.conf", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
getcwd("/root", 4096) = 6
lstat("/root/clearml-agent-caching.clearml.svc.cluster.local:", 0x7ffcdbb46640) = -1 ENOENT (No such file or directory)
access("/sys/subsystem/block/devices/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
access("/sys/bus/block/devices/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
access("/sys/class/block/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
open("clearml-agent-caching.clearml.svc.cluster.local:/", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
statfs("/media/root/var/lib/kubelet/plugins/kubernetes.io/csi/pv/clearml-agent-caching/globalmount", 0x7ffcdbb466c0) = -1 EIO (Input/output error)
statfs("/media/root/var/lib/kubelet/plugins/kubernetes.io/csi/pv/clearml-agent-caching/globalmount",

I was also able to see that the driver tried to unmount/UnPublish but just didn't actually do anything:

{"service":"democratic-csi","level":"info","message":"new response - driver: NodeManualDriver method: NodeGetCapabilities response: {"capabilities":[{"rpc":{"type":"STAGE_UNSTAGE_VOLUME"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeGetCapabilities call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: NodeManualDriver method: NodeGetCapabilities response: {"capabilities":[{"rpc":{"type":"STAGE_UNSTAGE_VOLUME"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeGetCapabilities call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: NodeManualDriver method: NodeGetCapabilities response: {"capabilities":[{"rpc":{"type":"STAGE_UNSTAGE_VOLUME"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeUnpublishVolume call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{"volume_id":"unique-volumeid-1","target_path":"/var/lib/kubelet/pods/c0126221-5254-4f75-b832-0aafe2aea5fb/volumes/kubernetes.iocsi/clearml-agent-caching/mount"}}"}
executing mount command: findmnt --mountpoint /var/lib/kubelet/pods/c0126221-5254-4f75-b832-0aafe2aea5fb/volumes/kubernetes.iocsi/clearml-agent-caching/mount --output source,target,fstype,label,options,avail,size,used -b -J
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeUnpublishVolume call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{"volume_id":"unique-volumeid-1","target_path":"/var/lib/kubelet/pods/f201ec10-c0d5-41eb-9e02-691c4fc9f1c3/volumes/kubernetes.io~csi/clearml-agent-caching/mount"}}"}

This is most likely because the driver itself faced a hanging fstat/stat

The only way to fix this was by manually un-mounting all of that mounted-non-working volumes:

mount -l | grep caching | cut -f3 -d ' ' | sort -u  | xargs sudo umount --force --lazy

Basically listing all of the NFS volumes that were set by me (therefore caching keyword) and then just unmount with force and lazy.

The proposed solution, as discussed on k8s Slack channel is to add a timeout for fstat and in case it fails then the driver will just unmount with force & lazy and then remount if required.

I have also found few more GH issues which might be interesting to check:

Shaked

The text was updated successfully, but these errors were encountered:

travisghansen · 2021-06-23T18:46:03Z

Can you deploy the next branch images and try out the fix? Make sure to set the image pull policy to always to ensure you are getting the most recent images as the next image tag is mutable.

a26488c

travisghansen · 2021-09-03T05:30:15Z

Released in v1.3.0.

travisghansen closed this as completed Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFS: make sure NFS is unmounted in case of a failure #100

NFS: make sure NFS is unmounted in case of a failure #100

Shaked commented Jun 20, 2021

travisghansen commented Jun 23, 2021 •

edited

travisghansen commented Sep 3, 2021

NFS: make sure NFS is unmounted in case of a failure #100

NFS: make sure NFS is unmounted in case of a failure #100

Comments

Shaked commented Jun 20, 2021

travisghansen commented Jun 23, 2021 • edited

travisghansen commented Sep 3, 2021

travisghansen commented Jun 23, 2021 •

edited