Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

krbd: make sure the device node is accessible after the mapping #39606

Merged
merged 1 commit into from
Feb 23, 2021

Conversation

idryomov
Copy link
Contributor

We have always assumed this to be the case and users' scripts and
orchestration tools have grown to depend on this. Let's add some
enforcement, prompted by [1]:

"I am running my Kubernetes worker node inside of an LXC container
which doesn't benefit from the device node created by the kernel, so
I'm using udev to create the /dev/rbd* device nodes inside of the LXC
container."

which, through the unfortunate interaction with ceph-csi rbd plugin,
results in data loss for "volumeMode: Filesystem" PVs because it ends
up recreating the filesystem every time the PV is attached to the pod:

"When deleting the pod and re-creating it, I can see that the RBD
image is indeed being reformatted. This seems to be because when
blkid is being run to check if the image is formatted, the /dev/rbd*
device has not yet been created by udev. By the time the code gets
down to running mkfs, the device is there and the damage is done."

[1] ceph/ceph-csi#1820

Fixes: https://tracker.ceph.com/issues/49410
Signed-off-by: Ilya Dryomov idryomov@gmail.com

We have always assumed this to be the case and users' scripts and
orchestration tools have grown to depend on this.  Let's add some
enforcement, prompted by [1]:

  "I am running my Kubernetes worker node inside of an LXC container
   which doesn't benefit from the device node created by the kernel, so
   I'm using udev to create the /dev/rbd* device nodes inside of the LXC
   container."

which, through the unfortunate interaction with ceph-csi rbd plugin,
results in data loss for "volumeMode: Filesystem" PVs because it ends
up recreating the filesystem every time the PV is attached to the pod:

  "When deleting the pod and re-creating it, I can see that the RBD
   image is indeed being reformatted. This seems to be because when
   blkid is being run to check if the image is formatted, the /dev/rbd*
   device has not yet been created by udev. By the time the code gets
   down to running mkfs, the device is there and the damage is done."

[1] ceph/ceph-csi#1820

Fixes: https://tracker.ceph.com/issues/49410
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov
Copy link
Contributor Author

jenkins test make check

@idryomov
Copy link
Contributor Author

jenkins test api

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -- but should it attempt to unmap the device in this case? I am just thinking of ceph-csi in its retry loop continuing to map the same device over and over again because it's getting -EINVAL failure codes due to the odd /dev setup.

@idryomov
Copy link
Contributor Author

I'm not sure, but I'd rather not. I wanted it to be a warning at first, but given that actual data loss was observed I decided to make it an error. The error message says "mapping succeeded", clearly indicating that the device has been mapped. Outside of ceph-csi, I can imagine some *notify-based automation picking up the new device node and kicking off work. Even if that work is just logging for audit purposes, we probably don't want to interfere and attempt to unmap from a process called rbd map ....

We used to have a similar failure mode with a udev-related timeout: rbd map exited with ETIMEDOUT leaving the device mapped and ceph-csi handles it by parsing the error message and calling detachRBDImageOrDeviceSpec(). Overall, given the CSI spec, I think ceph-csi has to be prepared to deal with stale mappings, images and other artefacts because IIRC the CO (i.e. kubernetes) is allowed to give up and move the workload to a different node at any time, without attempting to unstage or in fact any notification, leaving all kinds of state behind.

@idryomov
Copy link
Contributor Author

@idryomov idryomov merged commit b5c7d17 into ceph:master Feb 23, 2021
@idryomov idryomov deleted the wip-rbd-map-sanity-check branch February 23, 2021 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants