Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore/KernelDevice: use flock(2) for block device lock #26245

Merged
merged 1 commit into from Feb 3, 2019

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Feb 1, 2019

The fcntl locks fail due to the classic posix lock gotcha: if you close
any fd to the same inode from the process, the lock(s) go away.

Use flock(2) instead. We have to be careful because we open the main
bluestore device via two KernelDevice instances: one for bluestore and
one for bluefs. Add a no-lock flag so that the bluefs instance does not
try to lock and does not conflict with bluestore's.

Fixes: http://tracker.ceph.com/issues/38150
Signed-off-by: Sage Weil sage@redhat.com

The fcntl locks fail due to the classic posix lock gotcha: if you close
*any* fd to the same inode from the process, the lock(s) go away.

Use flock(2) instead.  We have to be careful because we open the main
bluestore device via two KernelDevice instances: one for bluestore and
one for bluefs.  Add a no-lock flag so that the bluefs instance does not
try to lock and does not conflict with bluestore's.

Fixes: http://tracker.ceph.com/issues/38150
Signed-off-by: Sage Weil <sage@redhat.com>
Copy link
Contributor

@jtlayton jtlayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

discard_cb[id], static_cast<void*>(this));
if (shared_with_bluestore) {
b->set_no_exclusive_lock();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little convoluted...if shared_with_bluestore is true then lock_exclusive becomes false, but this does look like it'll do the right thing.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me but I remain curious if containers will cause issues. Do these locks still work if systemd/docker uses mknod to setup the OSD /dev namespace? Based on my testing, mknod produces a new inode always

@jtlayton
Copy link
Contributor

jtlayton commented Feb 1, 2019

It makes sense to me but I remain curious if containers will cause issues. Do these locks still work if systemd/docker uses mknod to setup the OSD /dev namespace? Based on my testing, mknod produces a new inode always.

That's correct -- you'll get a new inode in that case that just happens to be hooked up to the same device as the first one. I don't see a way to do meaningful locking in that situation. Maybe we should fix that in the kernel?

@batrick
Copy link
Member

batrick commented Feb 1, 2019

I don't see a way to do meaningful locking in that situation. Maybe we should fix that in the kernel?

I suspect the use of mknod by systemd et al. is to prevent DoS between containers. They would consider this a feature and not a bug.

Perhaps there needs to be an ioctl that locks the backing bdev?

@jtlayton
Copy link
Contributor

jtlayton commented Feb 1, 2019

Yeah, that could work.

Open the bdev, use the ioctl to switch the fd over the bd_inode's locking context rather than the one for the device, and then set locks on it instead of the inode attached to the filp.

We could also consider just doing this universally without the ioctl, but we'd need a pretty long testing cycle to see if anything would break, given that this is a subtle behavior change.

@tchaikov tchaikov merged commit 8805a28 into ceph:master Feb 3, 2019
elemental-lf added a commit to elemental-lf/ceph-with-helm that referenced this pull request Aug 27, 2019
Starting with Nautilus Ceph implements flock in the ceph-osd daemon
directly. See: https://tracker.ceph.com/issues/38150,
ceph/ceph#26245. There are backporting tickets
for Mimic and Luminous, but they haven't been acted on since February
2019.
@nh2
Copy link
Contributor

nh2 commented Jun 21, 2020

This may have introduced a regression, see https://tracker.ceph.com/issues/46124

Also, OFD locks might be even better, see https://tracker.ceph.com/issues/46124#note-3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants