New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jewel: librbd: reacquire lock should update lock owner client id #17385

Merged
merged 1 commit into from Sep 11, 2017

Conversation

Projects
None yet
5 participants
@dillaman
Contributor

dillaman commented Aug 31, 2017

librbd: reacquire lock should update lock owner client id
(derived from commit 21ce5e1)

Fixes: http://tracker.ceph.com/issues/19957
Signed-off-by: Jason Dillaman <dillaman@redhat.com>

@dillaman dillaman added this to the jewel milestone Aug 31, 2017

@trociny

LGTM

@smithfarm

This comment has been minimized.

Show comment
Hide comment
@smithfarm

smithfarm Sep 10, 2017

Contributor

@dillaman This passed an rbd suite at http://tracker.ceph.com/issues/20613#note-59 with one failure that passed on rerun.

OK to merge?

Contributor

smithfarm commented Sep 10, 2017

@dillaman This passed an rbd suite at http://tracker.ceph.com/issues/20613#note-59 with one failure that passed on rerun.

OK to merge?

@dillaman

This comment has been minimized.

Show comment
Hide comment
@dillaman

dillaman Sep 11, 2017

Contributor

@smithfarm lgtm -- I need to investigate that failure since it appears like an OSD crashed and then data became corrupt (unrelated to this PR).

Contributor

dillaman commented Sep 11, 2017

@smithfarm lgtm -- I need to investigate that failure since it appears like an OSD crashed and then data became corrupt (unrelated to this PR).

@smithfarm smithfarm merged commit 9afc62a into ceph:jewel Sep 11, 2017

4 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 12, 2017

Hi,

Should this commit be ported to the kernel client driver too?

ralt commented Dec 12, 2017

Hi,

Should this commit be ported to the kernel client driver too?

@dillaman dillaman deleted the dillaman:wip-19957 branch Dec 12, 2017

@dillaman

This comment has been minimized.

Show comment
Hide comment
@dillaman

dillaman Dec 12, 2017

Contributor

@ralt This commit is librbd so it's unrelated to krbd. However, you are correct that krbd should probably re-acquire the lock if the watch is lost.

Contributor

dillaman commented Dec 12, 2017

@ralt This commit is librbd so it's unrelated to krbd. However, you are correct that krbd should probably re-acquire the lock if the watch is lost.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 12, 2017

@dillaman sorry, I was indeed asking if the bugfix was relevant to krbd, and if it should be ported.

If that is the case, do you know how I should move forward with this?

ralt commented Dec 12, 2017

@dillaman sorry, I was indeed asking if the bugfix was relevant to krbd, and if it should be ported.

If that is the case, do you know how I should move forward with this?

@dillaman

This comment has been minimized.

Show comment
Hide comment
@dillaman

dillaman Dec 12, 2017

Contributor

I would double-check to see if it's an issue first. I know there are some deltas in how krbd vs librbd handle object watches, so it might just work. If you can mount a krbd image w/ exclusive lock enabled, disable networking / block comms to the OSDs from the krbd host for about a minute, restore, and see if running an "rbd snap create --debug-rbd=20" shows that it attempted to blacklist a dead client.

Contributor

dillaman commented Dec 12, 2017

I would double-check to see if it's an issue first. I know there are some deltas in how krbd vs librbd handle object watches, so it might just work. If you can mount a krbd image w/ exclusive lock enabled, disable networking / block comms to the OSDs from the krbd host for about a minute, restore, and see if running an "rbd snap create --debug-rbd=20" shows that it attempted to blacklist a dead client.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 12, 2017

@dillaman thanks, will do!

ralt commented Dec 12, 2017

@dillaman thanks, will do!

@idryomov

This comment has been minimized.

Show comment
Hide comment
@idryomov

idryomov Dec 12, 2017

Contributor

From a quick look at the code, the kernel client needs to be fixed too. I'll put it on my TODO list.

@dillaman I'm rusty on the details -- if the remote peer has the lock and simply acks the "request lock" notify w/o actually queuing the release, wouldn't the requesting side just loop indefinitely? The lock ower is alive after all...

Contributor

idryomov commented Dec 12, 2017

From a quick look at the code, the kernel client needs to be fixed too. I'll put it on my TODO list.

@dillaman I'm rusty on the details -- if the remote peer has the lock and simply acks the "request lock" notify w/o actually queuing the release, wouldn't the requesting side just loop indefinitely? The lock ower is alive after all...

@dillaman

This comment has been minimized.

Show comment
Hide comment
@dillaman

dillaman Dec 12, 2017

Contributor

@idryomov Correct -- the remote peer will keep requesting it until it can acquire the lock. Only if the remote peer notices that the lock owner is not watching the image will it blacklist the client and break the lock.

Contributor

dillaman commented Dec 12, 2017

@idryomov Correct -- the remote peer will keep requesting it until it can acquire the lock. Only if the remote peer notices that the lock owner is not watching the image will it blacklist the client and break the lock.

@idryomov

This comment has been minimized.

Show comment
Hide comment
@idryomov

idryomov Dec 12, 2017

Contributor

@dillaman ... and this bug is about lock owners that are watching, right? I'm picking on your "attempted to blacklist a dead client" above -- I haven't tried it yet, but I would expect "rbd snap create" to loop indefinitely here. Just want to make sure I'd be fixing the right bug ;)

Contributor

idryomov commented Dec 12, 2017

@dillaman ... and this bug is about lock owners that are watching, right? I'm picking on your "attempted to blacklist a dead client" above -- I haven't tried it yet, but I would expect "rbd snap create" to loop indefinitely here. Just want to make sure I'd be fixing the right bug ;)

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 12, 2017

@idryomov I confirm that the requesting side is indefinitely looping on the lockget command :)

ralt commented Dec 12, 2017

@idryomov I confirm that the requesting side is indefinitely looping on the lockget command :)

@dillaman

This comment has been minimized.

Show comment
Hide comment
@dillaman

dillaman Dec 12, 2017

Contributor

@idryomov Correct -- since if the lock owner wasn't watching than librbd would just blacklist, break the lock, and execute the action directly. librbd will first try to acquire the lock and if it's held by an alive client, it will forward the op request to the lock holder. Perhaps krbd returned -EOPNOTSUPP so now librbd is trying to cooperatively request the lock from krbd and it's not releasing it (and/or not returning an error code to indicate that it is refusing the release the lock if mounted w/ the "exclusive" option).

Contributor

dillaman commented Dec 12, 2017

@idryomov Correct -- since if the lock owner wasn't watching than librbd would just blacklist, break the lock, and execute the action directly. librbd will first try to acquire the lock and if it's held by an alive client, it will forward the op request to the lock holder. Perhaps krbd returned -EOPNOTSUPP so now librbd is trying to cooperatively request the lock from krbd and it's not releasing it (and/or not returning an error code to indicate that it is refusing the release the lock if mounted w/ the "exclusive" option).

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 12, 2017

@dillaman @idryomov if confirmation is needed, I can confirm that this testing scenario reproduced the bug correctly.

ralt commented Dec 12, 2017

@dillaman @idryomov if confirmation is needed, I can confirm that this testing scenario reproduced the bug correctly.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 13, 2017

@idryomov this patch in the kernel fixes the issue, but I'm unsure if there is any unintended consequence:

modified   drivers/block/rbd.c
@@ -3884,6 +3884,7 @@ static void rbd_reacquire_lock(struct rbd_device *rbd_dev)
 					   &rbd_dev->lock_dwork, 0);
 	} else {
 		strcpy(rbd_dev->lock_cookie, cookie);
+		queue_work(rbd_dev->task_wq, &rbd_dev->acquired_lock_work);
 	}
 }

ralt commented Dec 13, 2017

@idryomov this patch in the kernel fixes the issue, but I'm unsure if there is any unintended consequence:

modified   drivers/block/rbd.c
@@ -3884,6 +3884,7 @@ static void rbd_reacquire_lock(struct rbd_device *rbd_dev)
 					   &rbd_dev->lock_dwork, 0);
 	} else {
 		strcpy(rbd_dev->lock_cookie, cookie);
+		queue_work(rbd_dev->task_wq, &rbd_dev->acquired_lock_work);
 	}
 }
@idryomov

This comment has been minimized.

Show comment
Hide comment
@idryomov

idryomov Dec 13, 2017

Contributor

@ralt That's probably right, mind sending a proper patch to ceph-devel (plain text + sign-off)? I'll take a deeper look when I get a moment.

Contributor

idryomov commented Dec 13, 2017

@ralt That's probably right, mind sending a proper patch to ceph-devel (plain text + sign-off)? I'll take a deeper look when I get a moment.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 13, 2017

Sure, will do.

ralt commented Dec 13, 2017

Sure, will do.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 13, 2017

@idryomov on top of which branch should I make the patch?

ralt commented Dec 13, 2017

@idryomov on top of which branch should I make the patch?

@idryomov

This comment has been minimized.

Show comment
Hide comment
@idryomov

idryomov Dec 13, 2017

Contributor

ceph-client/master, or just 4.14 tag -- it doesn't really matter in this case.

Contributor

idryomov commented Dec 13, 2017

ceph-client/master, or just 4.14 tag -- it doesn't really matter in this case.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 20, 2017

@idryomov hi, when can I expect some progress on the patch? A rough idea would be nice :)

Cheers

ralt commented Dec 20, 2017

@idryomov hi, when can I expect some progress on the patch? A rough idea would be nice :)

Cheers

@idryomov

This comment has been minimized.

Show comment
Hide comment
@idryomov

idryomov Dec 21, 2017

Contributor

@ralt It'll be merged into 4.15-rc and backported to appropriate kernels, I just didn't get a chance to try it out with different ceph versions yet.

Contributor

idryomov commented Dec 21, 2017

@ralt It'll be merged into 4.15-rc and backported to appropriate kernels, I just didn't get a chance to try it out with different ceph versions yet.

@ralt

This comment has been minimized.

Show comment
Hide comment
@ralt

ralt Dec 21, 2017

@idryomov alright, thanks!

ralt commented Dec 21, 2017

@idryomov alright, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment