Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reef: librbd: kick ExclusiveLock state machine stalled waiting for lock from reacquire_lock() #53919

Merged
merged 1 commit into from Oct 19, 2023

Conversation

ajarr
Copy link
Contributor

@ajarr ajarr commented Oct 10, 2023

backport tracker: https://tracker.ceph.com/issues/63156


backport of #53829
parent tracker: https://tracker.ceph.com/issues/63009

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

... that is stalled waiting for lock. Do this when trying to reacquire
lock in the ImageWatcher's rewatch mechanism. This would enable the
ExclusiveLock state machine to propagate the blocklist error to the
caller trying to perform an image operation requiring an exclusive
lock.

Previous attempt, e66db76, to fix the hang due to exclusive lock
acquisiton (stuck waiting for lock) racing with client blocklisting
did not always work. e66db76 kickstarted the ExclusiveLock state
machine when the ImageWatcher tried to schedule a exclusive lock
request and the blocklisting was detected. However, there is a short
window between a watch getting deregistered and client blocklisting
getting detected as part of rewatching. If hit when trying to schedule
a lock request, the ExclusiveLock state machine wasn't kickstarted,
blocklist error wasn't propagated, and the hang resurfaced.

A more robust approach is taken to resume the ExclusiveLock state
machine stuck waiting for lock during client blocklisting. Whenever
a client's ImageWatcher loses connection to the cluster, as it happens
during blocklising, the ImageWatcher initiates a mechanism to rewatch
the image and tries to reacquire the lock. Piggyback on this rewatch
mechanism that gets triggered during client blocklisting. And when
trying to reacquire the lock, kickstart the ExclusiveLock state
machine stalled waiting for lock (STATE_WAITING_FOR_LOCK).

Fixes: https://tracker.ceph.com/issues/63009
Signed-off-by: Ramana Raja <rraja@redhat.com>
(cherry picked from commit 18b0185)
@ajarr ajarr added this to the reef milestone Oct 10, 2023
@ajarr ajarr requested a review from a team as a code owner October 10, 2023 12:44
@ajarr ajarr added the rbd label Oct 10, 2023
@idryomov
Copy link
Contributor

jenkins test windows

@yuriw yuriw merged commit 756bac2 into ceph:reef Oct 19, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants