New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: new API method to force break a peer's exclusive lock #12639

Merged
merged 7 commits into from Jan 8, 2017

Conversation

Projects
None yet
2 participants
@dillaman
Contributor

dillaman commented Dec 23, 2016

No description provided.

@trociny

This comment has been minimized.

Contributor

trociny commented Dec 23, 2016

@dillaman I assume you want to merge this before exclusive lock refactoring (managed lock) from rbd-mirror-ha branch?

@trociny

This comment has been minimized.

Contributor

trociny commented Dec 23, 2016

I have a similar implementation for "get lock owner" and "break lock" requests, for rbd-mirror-ha branch [1]. A difference is that I thought it would be more useful to check if the watcher is alive in "get locker" request, not in "break locker" request (returning -ENOTCONN if it is dead), so a user could use "get locker" request to get full locker status. Actually I think there is no much difference though.

[1] trociny@0a6fbc8

Context *ctx = create_context_callback<
AcquireRequest<I>, &AcquireRequest<I>::handle_get_lockers>(this);
auto req = GetLockerRequest<I>::create(m_image_ctx, &m_locker, ctx);
req->send();
}

This comment has been minimized.

@trociny

trociny Dec 23, 2016

Contributor

nit: After this change I would rename GET_LOCKERS state to GET_LOCKER (and the method to send_get_locker).

@dillaman

This comment has been minimized.

Contributor

dillaman commented Dec 23, 2016

@trociny Yeah, I think the changes in the wip-rbd-mirror-ha are too high risk for a backport all the way back to jewel.

@trociny

This comment has been minimized.

Contributor

trociny commented Dec 23, 2016

@dillaman I am thinking how useful it is in general case. I expect it works only if you have two clients, one of them misbehaves and you want to force another one. If you have 3 clients, and the lock owner misbehaves, there are chances that 2 other clients run forced lock acquire simultaneously, and one of them blacklists another one that just acquired the lock before. For rbd-mirror-HA leader locker, I (am trying to) solve this by providing "get owner" and "break lock" methods to rbd-mirror, so it calls "break lock" only for an owner it already knows and believes that it misbehaves.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 3, 2017

@trociny I am going to tweak this to expose a "get_lock_owner" / "break_lock_owner" API. I really wanted to avoid exposing the details of locking to make integration simpler, but I think you are correct that in order to cover all possible corner cases the client will potentially need better control.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 3, 2017

@trociny New approach is now posted

@trociny trociny self-assigned this Jan 4, 2017

template <typename I>
void BreakRequest<I>::send_blacklist() {
if (!m_image_ctx.blacklist_on_break_lock) {

This comment has been minimized.

@trociny

trociny Jan 4, 2017

Contributor

minor: In future we will need blacklist_on_break_lock as a parameter for BreakRequest::create. So it could be made from the start (otherwise when this is merged to generic lock implementation).

image.lock_acquire(RBD_LOCK_MODE_EXCLUSIVE)
assert_raises(ConnectionShutdown,
blacklist_image.lock_release())

This comment has been minimized.

@trociny

trociny Jan 4, 2017

Contributor

@dillaman The parenthesis after blacklist_image.lock_release look wrong. The function name should be passed, not the result of its call.

And fixing this reveals that actually ConnectionShutdown is not raised for me because ReleaseRequest::handle_unlock clears the return value:

2017-01-04 17:53:54.013369 7f28467fc700 10 librbd::exclusive_lock::ReleaseRequest: send_close_object_map
2017-01-04 17:53:54.013371 7f28467fc700 10 librbd::object_map::UnlockRequest: 0x7f2854000cf0 send_unlock: oid=rbd_object_map.17e02ae8944a
2017-01-04 17:53:54.013814 7f28597fa700 10 librbd::object_map::UnlockRequest: 0x7f2854000cf0 handle_unlock: r=-108
2017-01-04 17:53:54.013820 7f28597fa700 -1 librbd::object_map::UnlockRequest: failed to release object map lock: (108) Cannot send after transport endpoint shutdown
2017-01-04 17:53:54.013825 7f28597fa700 10 librbd::exclusive_lock::ReleaseRequest: handle_close_object_map: r=0
2017-01-04 17:53:54.013828 7f28597fa700 10 librbd::exclusive_lock::ReleaseRequest: send_unlock
2017-01-04 17:53:54.013829 7f28597fa700 10 librbd::ExclusiveLock: 0x7f28500187b0 handle_releasing_lock
2017-01-04 17:53:54.014153 7f28597fa700 10 librbd::exclusive_lock::ReleaseRequest: handle_unlock: r=-108
2017-01-04 17:53:54.014157 7f28597fa700 -1 librbd::exclusive_lock::ReleaseRequest: failed to unlock: (108) Cannot send after transport endpoint shutdown
2017-01-04 17:53:54.014163 7f28597fa700 10 librbd::ImageState: 0x2104a00 handle_prepare_lock_complete
2017-01-04 17:53:54.014168 7f28467fc700 10 librbd::ExclusiveLock: 0x7f28500187b0 handle_release_lock: r=0

This comment has been minimized.

@dillaman

dillaman Jan 4, 2017

Contributor

Good catch -- I thought it looked wrong as well but adding the parens made it pass, so I just assumed it was some oddity and didn't look into it further.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 4, 2017

@trociny update pushed

data = rand_data(256)
assert_raises(ConnectionShutdown,
blacklist_image.write, data, 0)

This comment has been minimized.

@trociny

trociny Jan 4, 2017

Contributor

It would be nice to ensure that blacklist_on_break_lock config option is set for the image?

This comment has been minimized.

@dillaman

dillaman Jan 4, 2017

Contributor

Sure -- we don't have any integration tests in teuthology that disable it, though, so it would be hard to get into that condition.

This comment has been minimized.

@dillaman

dillaman Jan 4, 2017

Contributor

updated

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 5, 2017

@dillaman Observing (almost always) this failure:

nosetests -v test_rbd:TestExclusiveLock.test_break_lock
2017-01-05 11:40:42.735371 7f4b286a9700 -1 WARNING: all dangerous and experimental features are enabled.
test_rbd.TestExclusiveLock.test_break_lock ... /home/mgolub/ceph/ceph.upstream/src/librbd/Journal.cc: In function 'librbd::Journal<ImageCtxT>::Future librbd::Journal<ImageCtxT>::wait_event(Mutex&, uint64_t, Context*) [with ImageCtxT = librbd::ImageCtx; librbd::Journal<ImageCtxT>::Future = journal::Future; uint64_t = long unsigned int]' thread 7f4a7effd700 time 2017-01-05 11:40:50.855664
/home/mgolub/ceph/ceph.upstream/src/librbd/Journal.cc: 1068: FAILED assert(it != m_events.end())

#2  0x00007fc7f58dc70f in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7fc7f59d15f3 "it != m_events.end()", 
    file=file@entry=0x7fc7f59d0cd8 "/home/mgolub/ceph/ceph.upstream/src/librbd/Journal.cc", line=line@entry=1068, 
    func=func@entry=0x7fc7f59d2cc0 <librbd::Journal<librbd::ImageCtx>::wait_event(Mutex&, unsigned long, Context*)::__PRETTY_FUNCTION__> "librbd::Journal<ImageCtxT>::Future librbd::Journal<ImageCtxT>::wait_event(Mutex&, uint64_t, Context*) [with ImageCtxT = librbd::ImageCtx; librbd::Journal<ImageCtxT>::Future = journal::Future; uint64_t"...) at /home/mgolub/ceph/ceph.upstream/src/common/assert.cc:78
#3  0x00007fc7f5759dc3 in librbd::Journal<librbd::ImageCtx>::wait_event (this=this@entry=0x7fc770025010, lock=..., tid=tid@entry=1, 
    on_safe=on_safe@entry=0x7fc778010f90) at /home/mgolub/ceph/ceph.upstream/src/librbd/Journal.cc:1068
#4  0x00007fc7f5759f00 in librbd::Journal<librbd::ImageCtx>::flush_event (this=0x7fc770025010, tid=tid@entry=1, on_safe=on_safe@entry=0x7fc778010f90)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/Journal.cc:1043
#5  0x00007fc7f5766ac7 in librbd::LibrbdWriteback::write (this=0x7fc770009680, oid=..., oloc=..., off=off@entry=0, len=len@entry=256, snapc=..., bl=..., 
    mtime=..., trunc_size=0, trunc_seq=0, journal_tid=1, oncommit=0x7fc778010b30) at /home/mgolub/ceph/ceph.upstream/src/librbd/LibrbdWriteback.cc:267
#6  0x00007fc7f58789a2 in ObjectCacher::bh_write (this=this@entry=0x7fc77000f1f0, bh=bh@entry=0x7fc778010a10)
    at /home/mgolub/ceph/ceph.upstream/src/osdc/ObjectCacher.cc:1069
#7  0x00007fc7f5887330 in ObjectCacher::flush_set (this=0x7fc77000f1f0, oset=0x7fc7700188b0, onfinish=onfinish@entry=0x7fc778000a20)
    at /home/mgolub/ceph/ceph.upstream/src/osdc/ObjectCacher.cc:2047
#8  0x00007fc7f5708c21 in librbd::ImageCtx::flush_cache (this=0x2aeaf00, onfinish=0x7fc778000a20)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/ImageCtx.cc:757
#9  0x00007fc7f56ec709 in Context::complete (this=0x7fc778006070, r=<optimized out>) at /home/mgolub/ceph/ceph.upstream/src/include/Context.h:70
#10 0x00007fc7f5709454 in librbd::ImageCtx::flush_async_operations (this=0x2aeaf00, on_finish=0x7fc778006070)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/ImageCtx.cc:856
#11 0x00007fc7f570992c in librbd::ImageCtx::flush (this=<optimized out>, on_safe=<optimized out>, on_safe@entry=0x7fc77800e6c0)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/ImageCtx.cc:873
#12 0x00007fc7f56ef7f9 in librbd::AioImageRequestWQ::shut_down (this=0x2aeb760, on_shutdown=0x7fc77800e6c0)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/AioImageRequestWQ.cc:227
#13 0x00007fc7f5791a06 in librbd::image::CloseRequest<librbd::ImageCtx>::send_shut_down_aio_queue (this=this@entry=0x2aec200)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/image/CloseRequest.cc:96
#14 0x00007fc7f57922e0 in librbd::image::CloseRequest<librbd::ImageCtx>::handle_unregister_image_watcher (this=0x2aec200, r=0)
    at /home/mgolub/ceph/ceph.upstream/src/librbd/image/CloseRequest.cc:87
#15 0x00007fc7f56ec709 in Context::complete (this=0x7fc778009c90, r=<optimized out>) at /home/mgolub/ceph/ceph.upstream/src/include/Context.h:70
#16 0x00007fc7f58e387f in ThreadPool::worker (this=0x2ae52d0, wt=0x2ae5540) at /home/mgolub/ceph/ceph.upstream/src/common/WorkQueue.cc:132
@trociny

This comment has been minimized.

Contributor

trociny commented Jan 5, 2017

@dillaman Also observed this failure on teuthology

2017-01-05T09:26:28.412 INFO:tasks.workunit.client.0.smithi106.stdout:[ RUN      ] TestLibRBD.RenameViaLockOwner
2017-01-05T09:26:28.415 INFO:tasks.workunit.client.0.smithi106.stdout:using new format!
2017-01-05T09:26:28.537 INFO:tasks.workunit.client.0.smithi106.stdout:/tmp/buildd/ceph-11.1.0-6449-g1cf191e/src/test/librbd/test_librbd.cc:3505: Failure
2017-01-05T09:26:28.540 INFO:tasks.workunit.client.0.smithi106.stdout:      Expected: 0
2017-01-05T09:26:28.542 INFO:tasks.workunit.client.0.smithi106.stdout:To be equal to: rbd.rename(ioctx, name.c_str(), new_name.c_str())
2017-01-05T09:26:28.545 INFO:tasks.workunit.client.0.smithi106.stdout:      Which is: -2
2017-01-05T09:26:28.551 INFO:tasks.workunit.client.0.smithi106.stdout:[  FAILED  ] TestLibRBD.RenameViaLockOwner (145 ms)

http://qa-proxy.ceph.com/teuthology/trociny-2017-01-04_21:20:49-rbd-wip-mgolub-testing---basic-smithi/689831/teuthology.log

It might be not related and a known issue -- I recall we observed rename failures after socket failure injections, though I thought it was fixed.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 5, 2017

@trociny I think that RenameViaLockOwner is definitely unrelated. I couldn't find an old tracker ticket against it, though. There is an existing issue with the LockingPP test that I traced back to a RADOS bug in the cache tier [1]

[1] http://tracker.ceph.com/issues/11502

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 5, 2017

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 5, 2017

@trociny Appended a new commit that should address the journal assertion failure due to being blacklisted

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 5, 2017

@trociny Saw that one test run [1] failed due to a race where closing the image bubbled up the ConnectionShutDown exception. I tweaked the test to ignore that exception just in case the image watcher hasn't discovered that it's been blacklisted yet.

[1] http://qa-proxy.ceph.com/teuthology/trociny-2017-01-05_20:55:31-rbd-wip-mgolub-testing---basic-smithi/692252/teuthology.log

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 5, 2017

@trociny Opened a ticket [1] against the failed RenameViaLockOwner test -- the duplicate rename requests (due to injecting socket errors) weren't being filtered out correctly.

[1] http://tracker.ceph.com/issues/18435

@trociny

@dillaman It looks like the python test is still racy: I observe sporadic failures due to "AssertionError: ConnectionShutdown not raised" [1].

Also I am experimenting with C version for this test [2], though the cluster is overloaded today and I was able to run only limited number of tests so far.

[1] http://pulpito.ceph.com/trociny-2017-01-06_10:08:01-rbd-wip-mgolub-testing---basic-smithi/
[2] ceph/ceph-ci@686e62e

char **lock_owners,
size_t *max_lock_owners);
CEPH_RBD_API int rbd_lock_get_owners_cleanup(char **lock_owners,
size_t lock_owner_count);

This comment has been minimized.

@trociny

trociny Jan 6, 2017

Contributor

@dillaman All our other cleanup functions return void. Do you see any use in returning int instead?

max_lock_owners = 2;
ASSERT_EQ(0, rbd_lock_get_owners(image1, &lock_mode, lock_owners,
&max_lock_owners));

This comment has been minimized.

@trociny

trociny Jan 6, 2017

Contributor

needs rbd_lock_get_owners_cleanup

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 6, 2017

@trociny The problem w/ the C version is that you would need to exclude it for unittest_librbd -- since librados_test_stub doesn't support blacklisting.

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 6, 2017

@dillaman Yes, and it is excluded in my implementation.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 6, 2017

@trociny This failure scares me that blacklisting doesn't actually work the way we expect it to (i.e. it lazily blacklists and the OSDs will eventually pick up the change but it's not guaranteed to be complete by the time the blacklist op call completes). That means when breaking the lock from an active client, we can still get stale IO written by the blacklisted client.

dillaman added some commits Dec 22, 2016

librbd: separate break lock logic into standalone state machine
The current lockers are now queried before the lock is attempted to
prevent any possible race conditions when one or more clients attempt
to break the lock of a dead client.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
librbd: separate locker query into standalone state machine
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 6, 2017

@trociny I addressed your comments and hopefully addressed the blacklist race condition by having the test wait for the latest OSD map and having the blacklist librados call do the same.

dillaman added some commits Jan 3, 2017

librbd: add new lock_get_owners / lock_break_lock API methods
If the client application supports failover, let the application
force break the current lock and blacklist the owner. This is
required in case the current lock owner is alive from the point-of-view
of librbd but failover was required due to a higher level reason.

Fixes: http://tracker.ceph.com/issues/18327
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
librbd: ignore blacklist error when releasing exclusive lock
This ensures the journal and object map are properly closed so that the
image can be properly closed w/o failing any assertions.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
librados: blacklist_add should wait for latest OSD map
This ensures that future operations against the OSDs force
a OSD map update to notice the blacklisted client.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@@ -241,6 +241,9 @@ class CEPH_RBD_API Image
int is_exclusive_lock_owner(bool *is_owner);
int lock_acquire(rbd_lock_mode_t lock_mode);
int lock_release();
int lock_get_owners(rbd_lock_mode_t *lock_mode,
std::list<std::string> *lock_owners);
int lock_break(rbd_lock_mode_t lock_mode, const std::string &lock_owner);

This comment has been minimized.

@trociny

trociny Jan 6, 2017

Contributor

@dillaman I can't think out a use case for shared lock mode, but if we plan to have this eventually, won't we want to pass several (all) owners to blacklist? Not sure how useful it is -- if the list only partially matches the current owners should we just fail?

This comment has been minimized.

@dillaman

dillaman Jan 6, 2017

Contributor

I am keeping it generic so the end-user app can break a single shared locker if need-be.

This comment has been minimized.

@trociny

trociny Jan 6, 2017

Contributor

or don't blacklist (ignore lock_owner param) for shared mode?

This comment has been minimized.

@dillaman

dillaman Jan 6, 2017

Contributor

I can build a clustering-aware application with multiple instances, each using librbd shared locks. If my management layer decides to STONITH one of the shared lock owners, it should be possible.

This comment has been minimized.

@trociny

trociny Jan 6, 2017

Contributor

ah, never mind, I thought rados_break_lock would break the lock for all clients

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 8, 2017

@ceph-jenkins retest this please

@trociny

trociny approved these changes Jan 8, 2017

LGTM

@trociny trociny merged commit f1eb435 into ceph:master Jan 8, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@dillaman dillaman deleted the dillaman:wip-18327 branch Jan 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment