Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS: make sure NFS is unmounted in case of a failure #100

Closed
Shaked opened this issue Jun 20, 2021 · 2 comments
Closed

NFS: make sure NFS is unmounted in case of a failure #100

Shaked opened this issue Jun 20, 2021 · 2 comments

Comments

@Shaked
Copy link

Shaked commented Jun 20, 2021

It is possible that the NFS server goes down for some reason, e.g: broken health check, new deployment, new IP/DNS etc.

Unfortunately NFS doesn't know how unmount a directory in case its IP address is not available anymore. In this case only way to unmount the folder is by using force and lazy options:

Assuming Linux:
umount -f -l /mnt/myfolder
Will sort of fix the problem:
-f Force unmount (in case of an unreachable NFS system). (Requires kernel 2.1.116 or later.)
-l Lazy unmount. Detach the filesystem from the filesystem hierarchy now, and cleanup all references to the filesystem as soon as it is not busy anymore. (Requires kernel 2.4.11 or later.)
-f also exists on Solaris and AIX.

Source: https://serverfault.com/a/56606/166830

This is a known kubernetes issues kubernetes/kubernetes#101622. However, up until this will be fixed, I think that it would be great to add this as a feature for non compatible k8s versions which most likely will be using democratic-csi for long time.

Also, while looking for the origin of this issue and up until I figured that NFS doesn't work after the NFS server has been restarted/redeployed etc, I kept facing the following issues:

  • ContainerCreating - this happened because the consuming pods were not able to mount the NFS volume and kubelet just timed out over and over again.
  • Checking the actual node, I saw that running sudo findmnt --mountpoint /var/...... --output source,target,fstype,label,options,avail,size,used -b -J (check on GKE) kept hanging.
    • GKE allows debugging e.g using strace using its toolbox. So I ran this and saw that it was hanging at this point:

0x7ffcdbb465f0) = -1 EIO (Input/output error)
brk(0x559f1da8d000) = 0x559f1da8d000
getcwd("/root", 4096) = 6
lstat("/root/clearml-agent-caching.clearml.svc.cluster.local:", 0x7ffcdbb465b0) = -1 ENOENT (No such file or directory)
open("/etc/udev/udev.conf", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
getcwd("/root", 4096) = 6
lstat("/root/clearml-agent-caching.clearml.svc.cluster.local:", 0x7ffcdbb46640) = -1 ENOENT (No such file or directory)
access("/sys/subsystem/block/devices/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
access("/sys/bus/block/devices/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
access("/sys/class/block/clearml-agent-caching.clearml.svc.cluster.local:!", F_OK) = -1 ENOENT (No such file or directory)
open("clearml-agent-caching.clearml.svc.cluster.local:/", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
statfs("/media/root/var/lib/kubelet/plugins/kubernetes.io/csi/pv/clearml-agent-caching/globalmount", 0x7ffcdbb466c0) = -1 EIO (Input/output error)
statfs("/media/root/var/lib/kubelet/plugins/kubernetes.io/csi/pv/clearml-agent-caching/globalmount",

  • I was also able to see that the driver tried to unmount/UnPublish but just didn't actually do anything:

{"service":"democratic-csi","level":"info","message":"new response - driver: NodeManualDriver method: NodeGetCapabilities response: {"capabilities":[{"rpc":{"type":"STAGE_UNSTAGE_VOLUME"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeGetCapabilities call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: NodeManualDriver method: NodeGetCapabilities response: {"capabilities":[{"rpc":{"type":"STAGE_UNSTAGE_VOLUME"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeGetCapabilities call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: NodeManualDriver method: NodeGetCapabilities response: {"capabilities":[{"rpc":{"type":"STAGE_UNSTAGE_VOLUME"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeUnpublishVolume call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{"volume_id":"unique-volumeid-1","target_path":"/var/lib/kubelet/pods/c0126221-5254-4f75-b832-0aafe2aea5fb/volumes/kubernetes.iocsi/clearml-agent-caching/mount"}}"}
executing mount command: findmnt --mountpoint /var/lib/kubelet/pods/c0126221-5254-4f75-b832-0aafe2aea5fb/volumes/kubernetes.io
csi/clearml-agent-caching/mount --output source,target,fstype,label,options,avail,size,used -b -J
{"service":"democratic-csi","level":"info","message":"new request - driver: NodeManualDriver method: NodeUnpublishVolume call: {"_events":{},"_eventsCount":1,"call":{},"cancelled":false,"metadata":{"_internal_repr":{"user-agent":["grpc-go/1.26.0"]},"flags":0},"request":{"volume_id":"unique-volumeid-1","target_path":"/var/lib/kubelet/pods/f201ec10-c0d5-41eb-9e02-691c4fc9f1c3/volumes/kubernetes.io~csi/clearml-agent-caching/mount"}}"}

This is most likely because the driver itself faced a hanging fstat/stat

  • The only way to fix this was by manually un-mounting all of that mounted-non-working volumes:
mount -l | grep caching | cut -f3 -d ' ' | sort -u  | xargs sudo umount --force --lazy
  • Basically listing all of the NFS volumes that were set by me (therefore caching keyword) and then just unmount with force and lazy.

The proposed solution, as discussed on k8s Slack channel is to add a timeout for fstat and in case it fails then the driver will just unmount with force & lazy and then remount if required.

I have also found few more GH issues which might be interesting to check:

Shaked

@travisghansen
Copy link
Member

travisghansen commented Jun 23, 2021

Can you deploy the next branch images and try out the fix? Make sure to set the image pull policy to always to ensure you are getting the most recent images as the next image tag is mutable.

a26488c

@travisghansen
Copy link
Member

Released in v1.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants