Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: possible deadlock with synchronous maintenance operations #18909

Merged
merged 2 commits into from Nov 16, 2017

Conversation

Projects
None yet
2 participants
@dillaman
Copy link
Contributor

commented Nov 13, 2017

Fixes: http://tracker.ceph.com/issues/22120
Signed-off-by: Jason Dillaman dillaman@redhat.com

return r;
} else if (m_image_ctx.exclusive_lock != nullptr &&
!m_image_ctx.exclusive_lock->is_lock_owner()) {
return -EROFS;

This comment has been minimized.

Copy link
@trociny

trociny Nov 14, 2017

Contributor

@dillaman I saw this many times and used myself but was alway wondering if this is entirely correct. I suppose there is a chance that in a time window between we acquired the exclusive lock and got the owner lock the acquired lock was stolen again by other client request. We will return -EROFS in this case though might handle this situation a better way (e.g. block acquire lock requests from other clients until we get the owner lock)?

Also I am not sure I understand the accept_requests() check above. It looks like the purpose is to block local operations until it is true, but I don't see why accept_requests() is expected to be always true after try_acquire_lock() is completed?

This comment has been minimized.

Copy link
@dillaman

dillaman Nov 14, 2017

Author Contributor

It could use a loop to ensure it keeps the lock. With regard to accept_request, it looks like it's a carry over from this tracker ticket [1] for ensuring that the journal has been fully initialized before allowing the operation to proceed.

[1] http://tracker.ceph.com/issues/14510

This comment has been minimized.

Copy link
@trociny

trociny Nov 14, 2017

Contributor

Ah, so accept_requests() is used here for is_state_locked() test and not for block_requests as I thought initially.

As for using loop I suppose it is a bit troublesome to distinguish between the case when the lock was stolen and the case when another client did not want to release it? We want to loop only in the first case. Why not just increase m_request_blocked_count before calling try_acquire_lock() (putting the owner lock) and decreasing it when getting the owner lock back?

This comment has been minimized.

Copy link
@dillaman

dillaman Nov 14, 2017

Author Contributor

Sure -- tweak pushed

@dillaman dillaman force-pushed the dillaman:wip-22120 branch from b82f548 to 90b7ecd Nov 14, 2017

lderr(cct) << "No such snapshot found." << dendl;
return -ENOENT;
}
execute_snap_rollback(snap_namespace, snap_name, prog_ctx, &cond_ctx);
}

This comment has been minimized.

Copy link
@trociny

trociny Nov 14, 2017

Contributor

@dillaman

zhuzha:~/ceph/ceph.ci/build% PYTHONPATH=../src/test/pybind:../src/pybind:${PYTHONPATH} nosetests -v test_rbd:TestImage.test_rollback_with_resize
test_rbd.TestImage.test_rollback_with_resize ... /home/mgolub/ceph/ceph.ci/src/librbd/operation/ResizeRequest.cc: In function 'void librbd::operation::ResizeRequest<ImageCtxT>::send() [with ImageCtxT = librbd::ImageCtx]' thread 7fe17b7fe700 time 2017-11-14 18:32:11.241069
/home/mgolub/ceph/ceph.ci/src/librbd/operation/ResizeRequest.cc: 58: FAILED assert(image_ctx.owner_lock.is_locked())

I assume your change reveals the bug in SnapshotRollbackRequest<I>::send_resize_image(), which sends ResizeRequest without holding the owner lock, while ResizeRequest::send() expects this.

This comment has been minimized.

Copy link
@dillaman

dillaman Nov 14, 2017

Author Contributor

Hmm -- not immediately sure how that's failing since Operation::snap_rollback should be holding the lock when it invokes Operation::execute_snap_rollback

This comment has been minimized.

Copy link
@trociny

trociny Nov 14, 2017

Contributor

I suppose the main difference that previously we were holding the owner lock until SnapshotRollbackRequest is complete (waiting while holding the lock) and although ResizeRequest::send() was checking for owner_lock.is_locked() in a different thread it still succeeded?

This comment has been minimized.

Copy link
@dillaman

dillaman Nov 14, 2017

Author Contributor

Ah -- yeah, that's the issue. Thanks! I pushed a new commit to address it.

{
RWLock::RLocker owner_lock(m_image_ctx.owner_lock);
execute_rename(dstname, &cond_ctx);
}

This comment has been minimized.

Copy link
@trociny

trociny Nov 15, 2017

Contributor

@dillaman in execute_rename(), m_image_ctx.old_format case, RenameRequest is sent in a different contex and needs the owner lock. From qa/workunits/rbd/cli_generic.sh run:

rbd create --image-format 1 -s 1 foo
rbd rename foo foo2

2017-11-15 12:31:05.183 7f847cff9700 -1 /home/mgolub/ceph/ceph.ci/src/librbd/operation/Request.cc: In function 'void librbd::operation::Request<ImageCtxT>::send() [with ImageCtxT = librbd::ImageCtx]' thread 7f847cff9700 time 2017-11-15 12:31:05.189232
/home/mgolub/ceph/ceph.ci/src/librbd/operation/Request.cc: 25: FAILED assert(image_ctx.owner_lock.is_locked())

This comment has been minimized.

Copy link
@dillaman

dillaman Nov 15, 2017

Author Contributor

Thanks -- update pushed

This comment has been minimized.

Copy link
@trociny

trociny Nov 15, 2017

Contributor

nit: the log message now looks outdated

This comment has been minimized.

Copy link
@trociny

trociny Nov 15, 2017

Contributor

@dillaman I mean the commit log message

This comment has been minimized.

Copy link
@dillaman

dillaman Nov 15, 2017

Author Contributor

@trociny ... tweaked

@dillaman dillaman force-pushed the dillaman:wip-22120 branch from b2bd139 to bf44e05 Nov 15, 2017

librbd: added missing locks for snap rollback and rename
Signed-off-by: Jason Dillaman <dillaman@redhat.com>

@dillaman dillaman force-pushed the dillaman:wip-22120 branch from bf44e05 to 0ccd26f Nov 15, 2017

@trociny
Copy link
Contributor

left a comment

LGTM

@trociny trociny merged commit 1337855 into ceph:master Nov 16, 2017

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@dillaman dillaman deleted the dillaman:wip-22120 branch Nov 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.