New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: add support for active/passive daemon instances #12948

Merged
merged 14 commits into from Feb 1, 2017

Conversation

Projects
None yet
3 participants
@trociny
Contributor

trociny commented Jan 16, 2017

No description provided.

@trociny trociny changed the title from rbd-mirror: add support for active/passive daemon instances to [DNM] rbd-mirror: add support for active/passive daemon instances Jan 16, 2017

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 16, 2017

@dillaman updated according to your comments.

Still I am not sure about my approach for LeaderWatcher state machine. I use variables "acquiring" and "releasing" to make sure we don't start new state transition when we still execution the previous one. So if we are in transition state, leader watcher messages (that may trigger transition) are ignored, which I suppose is wrong.

Not sure what is correct to do in this case. May be I could use ManagedLock hooks to serialize events properly?

void notify_lock_acquired(Context *on_finish);
void notify_lock_released(Context *on_finish);
virtual void handle_heartbeat(Context *on_ack) = 0;

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: these methods are unused in this commit and refactored into protected, non-virtual methods in another commit

namespace {
static const uint64_t NOTIFY_TIMEOUT_MS = 5000;

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: unused variable

Context *on_notify_ack) {
dout(20) << "heartbeat" << dendl;
handle_heartbeat(on_notify_ack);

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Question: why do you have these overloaded methods just invoke other methods instead of just combining the two?

namespace {
static const uint64_t NOTIFY_TIMEOUT_MS = 5000;

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: unused variable

#include "librbd/Watcher.h"
namespace librbd {
class ImageCtx;

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: not used in this header

Mutex::Locker locker(m_lock);
if (r < 0) {
derr << "error acquiring leader lock: " << cpp_strerror(r)

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: if -EAGAIN, no need to output an error message -- just a debug message

Mutex::Locker locker(m_lock);
if (r < 0) {
derr << "error retrieving leader lock owner: " << cpp_strerror(r) << dendl;

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: if -ENOENT, no need to output error message. Should you immediately retry to acquire the lock in that case?

}
void check_leader_alive(utime_t &now, int heartbeat_interval);
void notify_heartbeat(Context *on_finish);
void notify_lock_acquired(Context *on_finish);

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Do these two methods (notify_lock_acquired and notify_lock_released ) need to be public? You have overloaded versions that interact w/ the internal state machine and these stand-alone ones.

Mutex::Locker locker(m_lock);
if (!m_stopping && !m_releasing && !m_acquiring) {
if (m_leader_lock->is_lock_owner()) {
dout(0) << "got another leader heartbeat, releasing lock" << dendl;

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

I don't think you want to do that here (or below). You can have a transient network hiccup and the Objecter could replay old messages from before this instance acquired the lock. I'd just ignore it since you should have been blacklisted if the lock was broken.

* REGISTER_WATCH * lock_released received |
* | /-------------------------\
* v v |
* ACQUIRE_LEADER_LOCK * *> GET_LOCKER -> <peon>

This comment has been minimized.

@dillaman

dillaman Jan 17, 2017

Contributor

Nit: peon -> passive or secondary

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 23, 2017

@dillaman significantly modified:

  1. All possible logic is moved from Replayer to LeaderWatcher class. The Replayer only uses LeaderWatcher's is_leader method to check the current status and post_acquire/pre_release hooks to start/stop image replayers on the leader lock state change.

  2. Timer is used to send hartbeat messages and initiate break lock.

  3. The ManagedLocker's post/pre acquire/release hooks are used to change the replayer state (stop image replayers, init/shut down status watcher) on acquiring/releasing the leader lock. This way I hope to avoid issues when the leader state change is requested while we are still in transition state.

  4. "release leader" asok command is added (useful for testing, but users might find it useful too, e.g. to release leader for some pools so other instance could take care of these).

  5. I need to track in-fly break_lock and get_locker operations for proper shut down. My implementation does not look very nice though. Do you have any suggestions? I suppose I could use AsyncOpTracker, but actually I think may be ManagedLock should be taught to track these requests and wait on shut down?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 23, 2017

@trociny For (5), I think it would be good for ManagedLock to track these async requests and delay shutdown until they are completed.

@@ -0,0 +1,270 @@
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Can you create a basic mock test case for the leader watcher? It would require templatizing the LeaderWatcher class.

This comment has been minimized.

@trociny

trociny Jan 26, 2017

Contributor

Yes, this is in progress.

return 0;
}
void LeaderWatcher::shut_down() {

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Might be better to make this an async method w/ helper methods now that you have to unlock/lock over and over around to get around the fact that it is internally calling async methods.

release_leader_lock();
}
Context *LeaderWatcher::async(Context * ctx) {

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Nit: rename to create_async_context? or update the existing librbd::util::create_async_context to be able to handle either an ImageCtx-like param or a ContextWQ-like param.

m_locker = locker;
}
schedule_timer_task("break leader", 2, false,

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Two missed heartbeats results in a break lock? This should probably be configurable and be more aligned to 30 seconds like the watch timeout.

assert(m_lock.is_locked());
std::shared_ptr<librbd::managed_lock::Locker> locker(

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Why not just use a member variable for this in-flight request storage? You shouldn't have concurrent get locker requests in-flight, correct?

This comment has been minimized.

@trociny

trociny Jan 26, 2017

Contributor

Actually, I expect concurrent get locker requests. They are generated by these events:

  • lock acquire failed,
  • lock break fail,
  • lock released,
  • "heartbeat" message received,
  • "lock acquired" message received,

as these are both internal and external events, it needs some serialization if we want to avoid concurrent requests. In my previous version I had a check in get_locker to skip if there is already an in-flight request. This requires two additional member variables (locker and a "in-flight" boolean). Using a shared pointer looked a little nicer to me, but I don't have a strong opinion here.

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

You could also just create a custom struct context wrapper that has its own locker member variable. That would combine the FunctionContext and Locker allocations into a single wrapper.

@@ -147,6 +160,13 @@ class MirrorAdminSocketHook : public AdminSocketHook {
if (r == 0) {
commands[command] = new FlushCommand(mirror);
}
command = "rbd mirror release leader";

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Nit: perhaps "rbd mirror leader release" in case new leader asok commands are added in the future

@@ -152,6 +166,13 @@ class ReplayerAdminSocketHook : public AdminSocketHook {
if (r == 0) {
commands[command] = new FlushCommand(replayer);
}
command = "rbd mirror release leader " + name;

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Nit: same comment here

set_sources(ImageIds());
if (!m_image_replayers.empty()) {
Mutex::Locker timer_locker(m_threads->timer_lock);
Context *task = new FunctionContext(

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Can this be cleaned up? The nested FunctionContexts are hard to read.

bool is_leader();
void release_leader();
virtual void post_acquire_leader_handler(Context *on_finish) = 0;

This comment has been minimized.

@dillaman

dillaman Jan 26, 2017

Contributor

Perhaps just create a listener interface to avoid having to inherit from this class just to receive callbacks.

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 30, 2017

@dillaman It looks ready for review now.

@trociny trociny changed the title from [DNM] rbd-mirror: add support for active/passive daemon instances to rbd-mirror: add support for active/passive daemon instances Jan 30, 2017

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jan 30, 2017

@trociny Thanks -- I'll start a deep review today. Did you see the build failure?

LeaderWatcher(Threads *threads, librados::IoCtx &io_ctx);
void add_listener(Listener *listener);

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Is there a reason to need the add/remove listener methods and not just pass in the listener in the constructor? Is there a one-to-many relationship somewhere?

This comment has been minimized.

@trociny

trociny Jan 31, 2017

Contributor

There is no any particular reason so far. It just looked more generic and could be useful for testing (to add test hooks). I can also imagine some useful applications like a cli watcher, though it is unlikely we have such and this always can be changed. So if you prefer the listener is passed in the constructor, I will update this.

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

My only concern was the added complexity w/ handling multiple observers and how to properly handle adding / removing them if a callback is in-flight. The code becomes cleaner w/o support for multiple.

assert(m_lock.is_locked());
Context *ctx = create_async_context_callback(

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Nit: librbd::Watcher::register_watch is guaranteed to be an async callback already. Did you hit some oddity here that required an extra async wrapper?

This comment has been minimized.

@trociny

trociny Jan 31, 2017

Contributor

No, I didn't. I did hit it in unregister_watch, when the client was blacklisted, and just changed this for safety too (even if it is guaranteed to be an async callback now, I can't see why someone might not change this in the future, forgetting to update LeaderWatcher).

If you don't like unnecessary async calls I will update this. Though I would like we have some clear conventions (markers in header file?) to know if a function is guaranteed to be an async callback without need to look in the code. Or update functions like unregister_watch to provide this guarantee.

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

I'm fine w/ leaving it -- no harm and this isn't a high performance code path. It was more of a curiosity.

void handle_get_locker(int r, librbd::managed_lock::Locker& locker);
void acquire_leader_lock(bool reset_attempt_counter);
void acquire_leader_lock();

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Nit: I'd just update the call sites to this override to use acquire_leader_lock(false) directly

This comment has been minimized.

@trociny

trociny Jan 31, 2017

Contributor

I need to pass it to schedule_timer_task, which expects void(*callback)().

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Ack

if (is_leader(m_lock)) {
dout(5) << "got another leader lock_acquired, ignoring" << dendl;
} else {
cancel_timer_task();

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Should "acquire" and "release" notifications reset m_acquire_attempts?

This comment has been minimized.

@trociny

trociny Jan 31, 2017

Contributor

For "release" case it is already reset , but it is missed for "acquire". Thanks, I will update.

namespace {
static const uint64_t NOTIFY_TIMEOUT_MS = 5000;

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Nit: comment still open

void Replayer::set_sources(const ImageIds &image_ids)
{
dout(20) << "enter" << dendl;
assert(m_lock.is_locked());
if (!m_init_images.empty()) {
if (!m_init_images.empty() && !m_stopping.read()

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

Nit: move && to end of this line

@trociny

This comment has been minimized.

Contributor

trociny commented Jan 31, 2017

@dillaman updated

@@ -0,0 +1,207 @@
#!/bin/sh

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

@trociny Just realized while running the tests that this needs an associate suite test case to execute it under teuthology

This comment has been minimized.

@trociny

trociny Jan 31, 2017

Contributor

@dillaman The tests starts/stops rbd-daemons, while under teuthology it is out of the script control.

I suppose I can modify the script so when it is being run under teuthology it executes only a subset of tests (may be using "leader release" instead of stop in some cases). I will investigate this tomorrow. May be you already have some hints or suggestions?

This comment has been minimized.

@dillaman

dillaman Jan 31, 2017

Contributor

@trociny Hmm -- perhaps what we really just need (long term) is a "thrasher" task like the ones for OSDs/MDSs. I'll create a separate tracker ticket for that.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 1, 2017

@trociny Running the rbd_mirror_ha.sh script locally has failed for me 3 out of 4 times. I will try to investigate more tomorrow.

Mykola Golub added some commits Oct 10, 2016

Mykola Golub
qa/workunits/rbd: allow to start several daemons for the same cluster
by optionally specifyning daemon instance after cluster name and
colon, like:

  start_mirror ${cluster}:${instance}

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
qa/workunits/rbd: tests for rbd mirror HA
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
librbd: allow watcher derived classes access ioctx, oid and work queue
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
librbd: pass notifier_id to Watcher handle_notify
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
qa/workunits/rbd: tweaks to use RBD_MIRROR_USE_EXISTING_CLUSTER locally
This var is mostly used when running rbd_mirror test scripts on
teuthology.  It can be used locally though to speedup re-running the
tests:

Set a test temp directory:

  export RBD_MIRROR_TEMDIR=/tmp/tmp.rbd_mirror

Run the tests the first time with NOCLEANUP flag (the cluster and
daemons are not stopped on finish):

  RBD_MIRROR_NOCLEANUP=1 ../qa/workunits/rbd/rbd_mirror.sh

Now, to re-run the test without restarting the cluster, run cleanup
with USE_EXISTING_CLUSTER flag:

  RBD_MIRROR_USE_EXISTING_CLUSTER=1 \
    ../qa/workunits/rbd/rbd_mirror_ha.sh cleanup

and then run the tests:

  RBD_MIRROR_USE_EXISTING_CLUSTER=1
    ../qa/workunits/rbd/rbd_mirror_ha.sh

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
rbd-mirror: add class for mirror status watcher
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
rbd-mirror HA: pool replayer should be started/stopped when lock acqu…
…ired/released

Fixes: http://tracker.ceph.com/issues/17020
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
rbd-mirror HA: create pool locker / leader class
Fixes: http://tracker.ceph.com/issues/17019
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
librbd: track in-fly break_lock and get_locker requests in managed lock
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
librbd: missing const qualifiers
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
librbd: add create_async_context_callback helper
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
librbd: equality operator for managed_lock::Locker
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@trociny

This comment has been minimized.

Contributor

trociny commented Feb 1, 2017

@dillaman I was able to reproduce rbd_mirror_ha.sh test failure after several runs (not sure it was your case though). In my case it failed on "crash leader" test due to too short timeout in rbd_mirror_ha.sh wait_for_leader (forgot to update the test after I added acquire attempts to the LeaderWatcher). I updated the PR.

Also, I have added a suite test, trying to run this on teuthology...

Mykola Golub
qa/suites/rbd: add rbd-mirror-ha test
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@trociny

This comment has been minimized.

Contributor

trociny commented Feb 1, 2017

@dillaman

Also, I have added a suite test, trying to run this on teuthology

But I like your idea with a "thrasher" task too.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 1, 2017

@trociny Thanks -- works for me now

@dillaman dillaman merged commit 8b8023c into ceph:master Feb 1, 2017

2 of 3 checks passed

default Build finished.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@trociny trociny deleted the trociny:wip-17020 branch Feb 2, 2017

@runsisi

This comment has been minimized.

Contributor

runsisi commented Mar 9, 2017

@dillaman after this PR, we can not add more than one peer for a local pool anymore?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Mar 9, 2017

@runsisi That has never been supported (yet) -- not sure why you think this PR removed that capability.

@runsisi

This comment has been minimized.

Contributor

runsisi commented Mar 10, 2017

@dillaman did some tests on the latest Jewel release, i.e., 10.2.6, results are showed as follows, it shows the images in cluster cs4 and cluster cs5 were synced to cluster cs6 successfully.

so i think the N remote pools -> 1 local pool feature works as expected.

do you mean the bidirectional mirroring, i.e., N remote pools <-> 1 local pool feature has not been supported, and the N remote pools -> 1 local pool feature is not what the RBD mirroring supposed to support?

[root@cs6 ~]# rbd info cs4i1 --cluster cs4
rbd image 'cs4i1':
        ...
        mirroring global id: 2b1be645-c753-40da-ac1f-103711cae407
        mirroring primary: true

[root@cs6 ~]# rbd info cs5i1 --cluster cs5 
rbd image 'cs5i1':
        ...
        mirroring global id: e59d1217-e29a-4d51-ae52-c6a573085d7f
        mirroring primary: true
[root@cs6 ~]# rbd mirror pool info rbd
Mode: image
Peers: 
  UUID                                 NAME CLIENT       
  6ccc4f09-ea5c-4a12-8f7f-ebc976b09131 cs5  client.admin 
  a5e58893-a4a8-4413-917c-53602ad4c411 cs4  client.admin

[root@cs6 ~]# rbd ls
cs4i1
cs5i1

[root@cs6 ~]# rbd info cs4i1
rbd image 'cs4i1':
        ...
        mirroring global id: 2b1be645-c753-40da-ac1f-103711cae407
        mirroring primary: false

[root@cs6 ~]# rbd info cs5i1
rbd image 'cs5i1':
        ...
        mirroring global id: e59d1217-e29a-4d51-ae52-c6a573085d7f
        mirroring primary: false
@dillaman

This comment has been minimized.

Contributor

dillaman commented Mar 10, 2017

@runsisi N remote peers syncing to 1 local cluster is definitely a scenario we have never tested nor designed against. Feel free to submit a PR to fix whatever got broken but N clusters -> 1 cluster is a very odd scenario that could run into other issues in the future (e.g. name collisions).

@runsisi

This comment has been minimized.

Contributor

runsisi commented Mar 10, 2017

@dillaman yeah, thank you for your explanation.

but then i am wondering why the rbd cli add the support to add more than one peer for a local pool?

@dillaman

This comment has been minimized.

Contributor

dillaman commented Mar 10, 2017

@runsisi It's definitely a long-term goal to support N-to-N replication. We would like to have a better understanding of required use-cases in the real-world before we dive into such an endeavor. It probably would be a good idea to prevent the rbd CLI from adding more than one peer to a pool in the meantime.

@runsisi

This comment has been minimized.

Contributor

runsisi commented Mar 10, 2017

@dillaman i see, jason thank you very much!

@dillaman

This comment has been minimized.

Contributor

dillaman commented Mar 10, 2017

@runsisi In the meantime, if you have cluster cs4 and cluster cs5 use different pool names, they could replicate to the associated two pools in cluster cs6.

@runsisi

This comment has been minimized.

Contributor

runsisi commented Mar 10, 2017

@dillaman yes, i suppose this is the designed behavior :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment