os/bluestore/KernelDevice: use flock(2) for block device lock #26245

liewegas · 2019-02-01T17:39:53Z

The fcntl locks fail due to the classic posix lock gotcha: if you close
any fd to the same inode from the process, the lock(s) go away.

Use flock(2) instead. We have to be careful because we open the main
bluestore device via two KernelDevice instances: one for bluestore and
one for bluefs. Add a no-lock flag so that the bluefs instance does not
try to lock and does not conflict with bluestore's.

Fixes: http://tracker.ceph.com/issues/38150
Signed-off-by: Sage Weil sage@redhat.com

The fcntl locks fail due to the classic posix lock gotcha: if you close *any* fd to the same inode from the process, the lock(s) go away. Use flock(2) instead. We have to be careful because we open the main bluestore device via two KernelDevice instances: one for bluestore and one for bluefs. Add a no-lock flag so that the bluefs instance does not try to lock and does not conflict with bluestore's. Fixes: http://tracker.ceph.com/issues/38150 Signed-off-by: Sage Weil <sage@redhat.com>

jtlayton

Looks good.

jtlayton · 2019-02-01T17:55:37Z

src/os/bluestore/BlueFS.cc

+				       discard_cb[id], static_cast<void*>(this));
+  if (shared_with_bluestore) {
+    b->set_no_exclusive_lock();
+  }


A little convoluted...if shared_with_bluestore is true then lock_exclusive becomes false, but this does look like it'll do the right thing.

batrick

It makes sense to me but I remain curious if containers will cause issues. Do these locks still work if systemd/docker uses mknod to setup the OSD /dev namespace? Based on my testing, mknod produces a new inode always

jtlayton · 2019-02-01T18:44:39Z

It makes sense to me but I remain curious if containers will cause issues. Do these locks still work if systemd/docker uses mknod to setup the OSD /dev namespace? Based on my testing, mknod produces a new inode always.

That's correct -- you'll get a new inode in that case that just happens to be hooked up to the same device as the first one. I don't see a way to do meaningful locking in that situation. Maybe we should fix that in the kernel?

batrick · 2019-02-01T19:06:17Z

I don't see a way to do meaningful locking in that situation. Maybe we should fix that in the kernel?

I suspect the use of mknod by systemd et al. is to prevent DoS between containers. They would consider this a feature and not a bug.

Perhaps there needs to be an ioctl that locks the backing bdev?

jtlayton · 2019-02-01T19:15:21Z

Yeah, that could work.

Open the bdev, use the ioctl to switch the fd over the bd_inode's locking context rather than the one for the device, and then set locks on it instead of the inode attached to the filp.

We could also consider just doing this universally without the ioctl, but we'd need a pretty long testing cycle to see if anything would break, given that this is a subtle behavior change.

tchaikov · 2019-02-03T17:18:57Z

http://pulpito.ceph.com/kchai-2019-02-03_02:07:02-rados-wip-kefu2-testing-2019-02-03-0001-distro-basic-smithi/

failed tests are tracked by

Starting with Nautilus Ceph implements flock in the ceph-osd daemon directly. See: https://tracker.ceph.com/issues/38150, ceph/ceph#26245. There are backporting tickets for Mimic and Luminous, but they haven't been acted on since February 2019.

nh2 · 2020-06-21T13:54:53Z

This may have introduced a regression, see https://tracker.ceph.com/issues/46124

Also, OFD locks might be even better, see https://tracker.ceph.com/issues/46124#note-3

liewegas requested a review from ifed01 February 1, 2019 17:40

liewegas added bug-fix bluestore labels Feb 1, 2019

liewegas requested a review from jtlayton February 1, 2019 17:40

liewegas mentioned this pull request Feb 1, 2019

OSD pods should not allow running multiple instances rook/rook#2599

Closed

jtlayton approved these changes Feb 1, 2019

View reviewed changes

batrick approved these changes Feb 1, 2019

View reviewed changes

tchaikov added the wip-kefu2-testing label Feb 2, 2019

tchaikov merged commit 8805a28 into ceph:master Feb 3, 2019

shyukri mentioned this pull request Apr 10, 2020

luminous: os/bluestore/KernelDevice: use flock(2) for block device lock #34514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore/KernelDevice: use flock(2) for block device lock #26245

os/bluestore/KernelDevice: use flock(2) for block device lock #26245

liewegas commented Feb 1, 2019

jtlayton left a comment

jtlayton Feb 1, 2019

batrick left a comment

jtlayton commented Feb 1, 2019

batrick commented Feb 1, 2019

jtlayton commented Feb 1, 2019

tchaikov commented Feb 3, 2019

nh2 commented Jun 21, 2020

os/bluestore/KernelDevice: use flock(2) for block device lock #26245

os/bluestore/KernelDevice: use flock(2) for block device lock #26245

Conversation

liewegas commented Feb 1, 2019

jtlayton left a comment

Choose a reason for hiding this comment

jtlayton Feb 1, 2019

Choose a reason for hiding this comment

batrick left a comment

Choose a reason for hiding this comment

jtlayton commented Feb 1, 2019

batrick commented Feb 1, 2019

jtlayton commented Feb 1, 2019

tchaikov commented Feb 3, 2019

nh2 commented Jun 21, 2020