New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore/KernelDevice: use flock(2) for block device lock #26245
Conversation
The fcntl locks fail due to the classic posix lock gotcha: if you close *any* fd to the same inode from the process, the lock(s) go away. Use flock(2) instead. We have to be careful because we open the main bluestore device via two KernelDevice instances: one for bluestore and one for bluefs. Add a no-lock flag so that the bluefs instance does not try to lock and does not conflict with bluestore's. Fixes: http://tracker.ceph.com/issues/38150 Signed-off-by: Sage Weil <sage@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
discard_cb[id], static_cast<void*>(this)); | ||
if (shared_with_bluestore) { | ||
b->set_no_exclusive_lock(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little convoluted...if shared_with_bluestore is true then lock_exclusive becomes false, but this does look like it'll do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me but I remain curious if containers will cause issues. Do these locks still work if systemd/docker uses mknod
to setup the OSD /dev
namespace? Based on my testing, mknod
produces a new inode always
That's correct -- you'll get a new inode in that case that just happens to be hooked up to the same device as the first one. I don't see a way to do meaningful locking in that situation. Maybe we should fix that in the kernel? |
I suspect the use of Perhaps there needs to be an ioctl that locks the backing bdev? |
Yeah, that could work. Open the bdev, use the ioctl to switch the fd over the bd_inode's locking context rather than the one for the device, and then set locks on it instead of the inode attached to the filp. We could also consider just doing this universally without the ioctl, but we'd need a pretty long testing cycle to see if anything would break, given that this is a subtle behavior change. |
Starting with Nautilus Ceph implements flock in the ceph-osd daemon directly. See: https://tracker.ceph.com/issues/38150, ceph/ceph#26245. There are backporting tickets for Mimic and Luminous, but they haven't been acted on since February 2019.
This may have introduced a regression, see https://tracker.ceph.com/issues/46124 Also, OFD locks might be even better, see https://tracker.ceph.com/issues/46124#note-3 |
The fcntl locks fail due to the classic posix lock gotcha: if you close
any fd to the same inode from the process, the lock(s) go away.
Use flock(2) instead. We have to be careful because we open the main
bluestore device via two KernelDevice instances: one for bluestore and
one for bluefs. Add a no-lock flag so that the bluefs instance does not
try to lock and does not conflict with bluestore's.
Fixes: http://tracker.ceph.com/issues/38150
Signed-off-by: Sage Weil sage@redhat.com