krbd: make sure the device node is accessible after the mapping #39606

idryomov · 2021-02-22T10:12:13Z

We have always assumed this to be the case and users' scripts and
orchestration tools have grown to depend on this. Let's add some
enforcement, prompted by [1]:

"I am running my Kubernetes worker node inside of an LXC container
which doesn't benefit from the device node created by the kernel, so
I'm using udev to create the /dev/rbd* device nodes inside of the LXC
container."

which, through the unfortunate interaction with ceph-csi rbd plugin,
results in data loss for "volumeMode: Filesystem" PVs because it ends
up recreating the filesystem every time the PV is attached to the pod:

"When deleting the pod and re-creating it, I can see that the RBD
image is indeed being reformatted. This seems to be because when
blkid is being run to check if the image is formatted, the /dev/rbd*
device has not yet been created by udev. By the time the code gets
down to running mkfs, the device is there and the damage is done."

[1] ceph/ceph-csi#1820

Fixes: https://tracker.ceph.com/issues/49410
Signed-off-by: Ilya Dryomov idryomov@gmail.com

We have always assumed this to be the case and users' scripts and orchestration tools have grown to depend on this. Let's add some enforcement, prompted by [1]: "I am running my Kubernetes worker node inside of an LXC container which doesn't benefit from the device node created by the kernel, so I'm using udev to create the /dev/rbd* device nodes inside of the LXC container." which, through the unfortunate interaction with ceph-csi rbd plugin, results in data loss for "volumeMode: Filesystem" PVs because it ends up recreating the filesystem every time the PV is attached to the pod: "When deleting the pod and re-creating it, I can see that the RBD image is indeed being reformatted. This seems to be because when blkid is being run to check if the image is formatted, the /dev/rbd* device has not yet been created by udev. By the time the code gets down to running mkfs, the device is there and the damage is done." [1] ceph/ceph-csi#1820 Fixes: https://tracker.ceph.com/issues/49410 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

idryomov · 2021-02-22T11:59:35Z

jenkins test make check

idryomov · 2021-02-22T11:59:49Z

jenkins test api

dillaman

lgtm -- but should it attempt to unmap the device in this case? I am just thinking of ceph-csi in its retry loop continuing to map the same device over and over again because it's getting -EINVAL failure codes due to the odd /dev setup.

idryomov · 2021-02-22T15:04:24Z

I'm not sure, but I'd rather not. I wanted it to be a warning at first, but given that actual data loss was observed I decided to make it an error. The error message says "mapping succeeded", clearly indicating that the device has been mapped. Outside of ceph-csi, I can imagine some *notify-based automation picking up the new device node and kicking off work. Even if that work is just logging for audit purposes, we probably don't want to interfere and attempt to unmap from a process called rbd map ....

We used to have a similar failure mode with a udev-related timeout: rbd map exited with ETIMEDOUT leaving the device mapped and ceph-csi handles it by parsing the error message and calling detachRBDImageOrDeviceSpec(). Overall, given the CSI spec, I think ceph-csi has to be prepared to deal with stale mappings, images and other artefacts because IIRC the CO (i.e. kubernetes) is allowed to give up and move the workload to a different node at any time, without attempting to unstage or in fact any notification, leaving all kinds of state behind.

idryomov · 2021-02-23T16:37:15Z

https://pulpito.ceph.com/dis-2021-02-21_21:59:51-krbd:unmap-wip-rbd-map-sanity-check-testing-basic-smithi/

idryomov added bug-fix rbd labels Feb 22, 2021

dillaman approved these changes Feb 22, 2021

View reviewed changes

dillaman added needs-qa wip-jason-testing labels Feb 23, 2021

idryomov merged commit b5c7d17 into ceph:master Feb 23, 2021

idryomov deleted the wip-rbd-map-sanity-check branch February 23, 2021 16:37

idryomov mentioned this pull request Mar 17, 2021

krbd: check device node accessibility only if we actually mapped #40178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

krbd: make sure the device node is accessible after the mapping #39606

krbd: make sure the device node is accessible after the mapping #39606

idryomov commented Feb 22, 2021

idryomov commented Feb 22, 2021

idryomov commented Feb 22, 2021

dillaman left a comment

idryomov commented Feb 22, 2021

idryomov commented Feb 23, 2021

krbd: make sure the device node is accessible after the mapping #39606

krbd: make sure the device node is accessible after the mapping #39606

Conversation

idryomov commented Feb 22, 2021

idryomov commented Feb 22, 2021

idryomov commented Feb 22, 2021

dillaman left a comment

Choose a reason for hiding this comment

idryomov commented Feb 22, 2021

idryomov commented Feb 23, 2021