New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: replace remote pool polling with add/remove notifications #12364

Merged
merged 4 commits into from Mar 17, 2017

Conversation

Projects
None yet
2 participants
@dillaman
Contributor

dillaman commented Dec 7, 2016

No description provided.

trociny pushed a commit that referenced this pull request Dec 8, 2016

Mykola Golub
Merge branch 'wip-rbd-mirror-notifications' into wip-mgolub-testing
[DNM] rbd-mirror: replace remote pool polling with add/remove notifications #12364
@trociny

This comment has been minimized.

Contributor

trociny commented Dec 9, 2016

@dillaman observing crashes when running rbd_mirror(_stress).sh on teuthology:

http://pulpito.ceph.com/trociny-2016-12-09_06:43:36-rbd-wip-mgolub-testing---basic-mira/

and locally:

2016-12-09 11:09:33.119818 7f783754cc40 -1 rbd::mirror::Mirror: 0x7f7840a47140 update_replayers: removing blacklisted replayer for uuid: 91c12cc7-d872-4cef-a12e-8b03c00b5aad cluster: cluster2 client: client.admin
2016-12-09 11:09:33.119843 7f783754cc40  5 rbd::mirror::PoolWatcher: 0x7f7840c6e830 shut_down: 
2016-12-09 11:09:33.119845 7f783754cc40  5 rbd::mirror::PoolWatcher: 0x7f7840c6e830 unregister_watcher: 
2016-12-09 11:09:33.119849 7f783754cc40  5 rbd::mirror::PoolWatcher: 0x7f7840c6e830 operator(): unregister_watcher: r=0
2016-12-09 11:09:33.120949 7f783754cc40 -1 /home/mgolub/ceph/ceph.upstream/src/librbd/Watcher.cc: In function 'virtual librbd::Watcher::~Watcher()' thread 7f783754cc40 time 2016-12-09 11:09:33.119857
/home/mgolub/ceph/ceph.upstream/src/librbd/Watcher.cc: 81: FAILED assert(m_watch_state != WATCH_STATE_REGISTERED)

 ceph version 11.0.2-2357-g9e1803f (9e1803fbaf6c82b47ac9ccf56a65a07fb4bcd8b6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7f7837173a32]
 2: (librbd::Watcher::~Watcher()+0x337) [0x7f7837010937]
 3: (rbd::mirror::MirrorStatusWatchCtx::Watcher::~Watcher()+0x17) [0x7f7836f15947]
 4: (rbd::mirror::Replayer::~Replayer()+0x349) [0x7f7836f11b69]
 5: (std::_Rb_tree<std::pair<long, rbd::mirror::peer_t>, std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > >, std::_Select1st<std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > > >, std::less<std::pair<long, rbd::mirror::peer_t> >, std::allocator<std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<std::pair<long, rbd::mirror::peer_t> const, std::unique_ptr<rbd::mirror::Replayer, std::default_delete<rbd::mirror::Replayer> > > >)+0x3b) [0x7f7836f0c1ab]
 6: (rbd::mirror::Mirror::update_replayers(std::map<long, std::set<rbd::mirror::peer_t, std::less<rbd::mirror::peer_t>, std::allocator<rbd::mirror::peer_t> >, std::less<long>, std::allocator<std::pair<long const, std::set<rbd::mirror::peer_t, std::less<rbd::mirror::peer_t>, std::allocator<rbd::mirror::peer_t> > > > > const&)+0x763) [0x7f7836f06253]
 7: (rbd::mirror::Mirror::run()+0xde) [0x7f7836f06dee]
 8: (main()+0x20e) [0x7f7836efbfde]
 9: (__libc_start_main()+0xf5) [0x7f782b946b45]
 10: (()+0x21e246) [0x7f7836f04246]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2016-12-09 11:32:35.302769 7f4f80ff9700 -1 /home/mgolub/ceph/ceph.upstream/src/common/RWLock.h: In function 'void RWLock::get_read() const' thread 7f4f80ff9700 time 2016-12-09 11:32:35.301393
/home/mgolub/ceph/ceph.upstream/src/common/RWLock.h: 105: FAILED assert(r == 0)

 ceph version 11.0.2-2357-g9e1803f (9e1803fbaf6c82b47ac9ccf56a65a07fb4bcd8b6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7f4fc1eb7a32]
 2: (()+0x2128c2) [0x7f4fc1c3c8c2]
 3: (librbd::ExclusiveLock<librbd::ImageCtx>::send_reacquire_lock()+0x4c8) [0x7f4fc1cd4738]
 4: (librbd::ExclusiveLock<librbd::ImageCtx>::reacquire_lock(Context*)+0x121) [0x7f4fc1cd56c1]
 5: (librbd::ImageWatcher<librbd::ImageCtx>::handle_rewatch_complete(int)+0x144) [0x7f4fc1cf6754]
 6: (librbd::Watcher::handle_rewatch(int)+0x437) [0x7f4fc1d55c97]
 7: (librbd::watcher::RewatchRequest::finish(int)+0x8d) [0x7f4fc1d57aed]
 8: (librbd::watcher::RewatchRequest::handle_rewatch(int)+0x105) [0x7f4fc1d58405]
 9: (librados::C_AioSafe::finish(int)+0x1d) [0x7f4fb8fa377d]
 10: (Context::complete(int)+0x9) [0x7f4fc1c597f9]
 11: (Finisher::finisher_thread_entry()+0x1f4) [0x7f4fc1eb6cd4]
 12: (()+0x80a4) [0x7f4fb8cf00a4]
 13: (clone()+0x6d) [0x7f4fb674f04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Dec 13, 2016

@trociny Pushed a fix for that crash

@trociny

This comment has been minimized.

Contributor

trociny commented Dec 14, 2016

@dillaman looks like related:

[ RUN      ] TestLibRBD.FlushCacheWithCopyupOnExternalSnapshot
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:4853: Failure
      Expected: 0
To be equal to: rbd.clone(ioctx, name.c_str(), "one", ioctx, clone_name.c_str(), (1ULL<<0), &order)
      Which is: -2
[  FAILED  ] TestLibRBD.FlushCacheWithCopyupOnExternalSnapshot (10 ms)
}
void expect_mirroring_watcher_is_unregister(MockMirroringWatcher &mock_mirroring_watcher,
bool unregistered) {

This comment has been minimized.

@trociny

trociny Jan 3, 2017

Contributor

@dillaman May be the function name was supposed to be expect_mirroring_watcher_is_unregistered?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 7, 2017

Note: there is a race condition with the ImageDeleter that cannot be fixed until #10896 is merged

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 11, 2017

Outdated

@trociny trociny closed this Jan 11, 2017

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 11, 2017

Sorry, wrong.

@trociny trociny reopened this Jan 11, 2017

@trociny

This comment has been minimized.

Contributor

trociny commented Feb 13, 2017

@dillaman May be a subset of this PR (namely, "utilize global image id as internal unique key" and "preliminary support to track multiple remote peer image sources") could be merged as a separate PR, not waiting for #10896?

I am just starting working on InstanceReplayerInterface [1] and it looks to me it would be good if your patches were merged before.

[1] http://tracker.ceph.com/issues/18785

if (r < 0) {
derr << "error resolving remote pool " << m_remote_pool_id
derr << "error resolving remote pool " << m_local_pool_id

This comment has been minimized.

@trociny

trociny Feb 13, 2017

Contributor

@dillaman I think this error message is confusing now -- I suppose users would thinking the ID in the message is from the remote cluster.

This comment has been minimized.

@dillaman

dillaman Feb 13, 2017

Contributor

Agreed -- probably should just pass in the pool name string since there really isn't a need to look it up.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 13, 2017

@trociny Sure -- I'll pull out the parts I can into a cleanup PR

@dillaman dillaman changed the title from [DNM] rbd-mirror: replace remote pool polling with add/remove notifications to rbd-mirror: replace remote pool polling with add/remove notifications Mar 15, 2017

@trociny trociny self-assigned this Mar 15, 2017

stop_image_replayers(on_finish);
});
ctx = create_async_context_callback(m_threads->work_queue, ctx);
m_threads->timer->add_event_after(1, ctx);

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

@dillaman Observing teuthology failures [1]

It looks like all cases are due to the timer lock is not held here:

/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.0-1436-gb9f5564/rpm/el7/BUILD/ceph-12.0.0-1436-gb9f5564/src/common/Timer.cc: 127: FAILED assert(lock.is_locked())
 ceph version 12.0.0-1436-gb9f5564 (b9f556438d3c7612e9c0a042c8d8f0959cabd3a0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f351bd9c1a0]
 2: (()+0x2b840d) [0x7f351bd9440d]
 3: (rbd::mirror::Replayer::stop_image_replayers(Context*)+0x179) [0x7f3524d19119]
 4: (rbd::mirror::Replayer::handle_shut_down_pool_watcher(int, Context*)+0xb6) [0x7f3524d194a6]
 5: (FunctionContext::finish(int)+0x2a) [0x7f3524d1dc0a]
 6: (Context::complete(int)+0x9) [0x7f3524d1cc29]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7f351bda4649]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f351bda5660]
 9: (()+0x7dc5) [0x7f351a686dc5]
 10: (clone()+0x6d) [0x7f351914573d]

[1] http://pulpito.ceph.com/trociny-2017-03-16_07:53:00-rbd-wip-mgolub-testing---basic-smithi/

}
virtual void handle_rewatch_complete(int r) override {
m_pool_watcher->handle_rewatch_complete(r);

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

Is using both virtual and override intentional?

This comment has been minimized.

@dillaman

dillaman Mar 16, 2017

Contributor

It's just very old code from before the switch-over -- I'll correct it

m_pool_watcher->handle_rewatch_complete(r);
}
virtual void handle_mode_updated(cls::rbd::MirrorMode mirror_mode) {

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

override

virtual void handle_image_updated(cls::rbd::MirrorImageState state,
const std::string &remote_image_id,
const std::string &global_image_id) {

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

override

}
for (auto &updated_image : m_updated_images) {
updated_image.invalid = true;
}

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

Do we need this for loop, taking that after the code above we should have a single (invalid) in-flight request?

This comment has been minimized.

@dillaman

dillaman Mar 16, 2017

Contributor

Agreed -- missed this during a refactor

}
virtual void handle_update(const ImageIds &added_image_ids,
const ImageIds &removed_image_ids) override {

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

both virtual and override, it is intentional?

}
virtual void handle_update(const ImageIds &added_image_ids,
const ImageIds &removed_image_ids) {

This comment has been minimized.

@trociny

trociny Mar 16, 2017

Contributor

override

dillaman added some commits Nov 28, 2016

rbd-mirror: templatize Threads helper class for mock tests
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
rbd-mirror: refresh local images after acquiring leader role
The local image id set should be up-to-date when attempting to
determine which images need to be deleted.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
rbd-mirror: move replayer admin socket hook to anonymous namespace
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@trociny

This comment has been minimized.

Contributor

trociny commented Mar 17, 2017

@ceph-jenkins retest this please

@trociny

LGTM

@trociny

This comment has been minimized.

Contributor

trociny commented Mar 17, 2017

@ceph-jenkins try again: retest this please

@trociny trociny merged commit 3cea3ac into ceph:master Mar 17, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@dillaman dillaman deleted the dillaman:wip-rbd-mirror-notifications branch Mar 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment