New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror A/A: coordinate image syncs with leader #14745

Merged
merged 4 commits into from Jun 7, 2017

Conversation

Projects
None yet
2 participants
@trociny
Contributor

trociny commented Apr 24, 2017

Fixes: http://tracker.ceph.com/issues/18789
Signed-off-by: Mykola Golub mgolub@mirantis.com

@trociny trociny requested a review from dillaman Apr 24, 2017

@trociny

This comment has been minimized.

Contributor

trociny commented Apr 24, 2017

Implementation details:

  • The throttler changed from per cluster to per pool, because the instance leader control is per pool.
  • The LeaderWatcher is used for sync messages.
  • To handle situations like a message lost, an instance crash or the leader changed, the following strategy is used. The proxy instance keeps resending sync_request message with 10 sec interval until sync_request_ack is recieved from the leader. The leader keeps to re-send sync_request_ack with 10 sec interval until sync_complete is received. Also the instance remove hook is added to clear removed instance syncs in progress.
  • The sync state is not persisted in case of leader failover (as it was proposed in the ticket [1]). I am not sure it is necessary. Due to resending sync_request messages, the new leader will handle the syncs that are not started yet. Those that are already started will send sync_complete notification to the new leader, which will be just ignored. The only drawback I see is the number of concurrent syncs may temporary double when the leader is changed.
  • sync_started messages look like not necessary for this implementation.
  • Instance removal hook is added as a separate commit right now, because I am not sure it is the best approach.

[1] http://tracker.ceph.com/issues/18789

@dillaman

This comment has been minimized.

Contributor

dillaman commented Apr 24, 2017

Nit: typo of "separate" in "rbd-mirror: refactor ImageSyncThrottler" commit message

@dillaman

My original thinking was that these messages would be part of InstanceWatcher since that would allow direct RPC between the instance requesting the sync and the leader that can grant it. Sending the RPC via the LeaderWatcher would result in a broadcast message.

The InstanceWatcher already has the notion of repeating messages until acked which seems like it could be re-used here. Also, I think you would be able to eliminate the listener pattern by instantiating the InstanceSyncThrottler when the leader is acquired and by passing the pointer to InstanceWatcher (and clearing it when the leader role is lost).

}
sync_holder->m_on_finish->complete(r);
m_throttler->finish_op(sync_holder->m_local_image_id);
Mutex::Locker locker(m_lock);

This comment has been minimized.

@dillaman

dillaman Apr 24, 2017

Contributor

Nit: I would prefer to clean up the local state before firing the completion down to the instance throttler

@@ -100,6 +100,7 @@ std::vector<MockImageSync *> MockImageSync::instances;
// template definitions
#include "tools/rbd_mirror/InstanceSyncThrottler.cc"

This comment has been minimized.

@dillaman

dillaman Apr 24, 2017

Contributor

A template specialization of InstanceSyncThrottle should be defined above and eliminate the need to pull in the CC file here. That would also allow the test cases below to specifically test ImageSyncThrottler instead of its interactions w/ InstanceSyncThrottler.

@@ -68,6 +136,10 @@ struct UnknownPayload {
typedef boost::variant<HeartbeatPayload,
LockAcquiredPayload,
LockReleasedPayload,
SyncRequestPayload,
SyncRequestAckPayload,

This comment has been minimized.

@dillaman

dillaman Apr 24, 2017

Contributor

Nit: perhaps this message could be called SyncStartPayload (from leader -> instance to grant permission to start the sync) and we could eliminate the previous use of SyncStartPayload below. The Ack suffix just seems odd to me since it's not an acknowledgement of the request but rather a command to start.

@trociny

This comment has been minimized.

Contributor

trociny commented Apr 24, 2017

@dillaman Thank you for the review. Using LeaderWatcher for sync messages looked appealing to me because of no need to track who is the leader right now to send sync notification. Broadcast message also made things easier. Using InstanceWatcher looked like would complicate the implementation considerably, but I will try this way. Hope I was wrong.

@trociny

This comment has been minimized.

Contributor

trociny commented May 9, 2017

@dillaman updated

std::list<C_SyncHolder *> m_sync_queue;
std::map<PoolImageId, C_SyncHolder *> m_inflight_syncs;
std::map<std::string, C_SyncHolder *> m_waiting_syncs;

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

Nit: seems like you can remove this collection since once you schedule the sync, you hand the management over to InstanceWatcher for starting / canceling. This collection just seems to be used for an assert.

class ProgressContext;
/**
* Manage concurrent image-syncs
*/
template <typename ImageCtxT = librbd::ImageCtx>
class ImageSyncThrottler : public md_config_obs_t {
class ImageSyncThrottler {

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

Nit: should this class be the InstanceSyncThrottler since it's associated w/ the instance and the original version (now named InstanceSyncThrottler) should remain the ImageSyncThrottler (or LeaderSyncThrottler)? Right now, the InstanceXYZ classes are tied to a specific instance of rbd-mirror.

Alternatively, if you pass the InstanceWatcher reference down to the bootstrap state machine, it could instantly create a ImageSync state machine whose first state is just to invoke InstanceWatcher::notify_sync_start and wait. That would eliminate this pass-through class entirely.

This comment has been minimized.

@trociny

trociny May 16, 2017

Contributor

InstanceSyncThrottler name was chosen when I was working on previous version, when this class was handling sync notifications via the leader object. Now this name really looks wrong. I wouldn't like to change it to ImageSyncThrottler because it knows nothing about 'image', and even that operations it throttles are syncs. I am going to change it to just Throttler.

I will evaluate your idea to remove ImageSyncThrottler and use InstanceWatcher::notify_sync_xyz methods directly in ImageSync state machine. Thanks.

<< ": canceling request to local throttler" << dendl;
if (on_start != nullptr) {
if (instance_id == m_instance_id) {
send_sync_start(sync_id, on_start);

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

Nit: can you just bubble up -ESTALE for the local instance as well to avoid the special case?

return;
}
// This should only be possible when notify_finish_sync for the

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

How does this situation occur?

This comment has been minimized.

@trociny

trociny May 16, 2017

Contributor

A user completes a sync by sending notify_finish_sync (which returns immediately, after queuing the sync notification to the leader), and after this (by some reason) calls notify_start_sync for the same image again. At this moment the previous notification for this sync_id may still be in progress.

This could be avoided by using some unique value for sync_id instead of image_id by the caller, but this will complicate things on the caller side.

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

ACK. Would it be possible to remove the special on_complete callback and just wrap the on_start context in that case with this special handling?

@@ -72,8 +74,19 @@ class InstanceWatcher : protected librbd::Watcher {
const std::string &peer_image_id,
bool schedule_delete, Context *on_notify_ack);
void notify_start_sync(const std::string &request_id,

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

Nit: perhaps swap to "noun_verb" form like the methods above? notify_sync_start, notify_sync_cancel, etc

@dillaman

This comment has been minimized.

Contributor

dillaman commented May 16, 2017

@trociny I think the optimization to directly pass-through image sync requests for the local leader's images might be adding undo complications. The code complexity would most likely be reduced if we just eliminated all those special cases and just send the message and allowed it to be received back by the watcher.

assert(sync_ctx->on_complete == nullptr);
sync_ctx->on_complete = new FunctionContext(
[this, sync_id, on_start] (int r) {
if (r == -ECANCELED) {

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

Is this state possible? If chaining off a previous sync request that was canceled and somehow we raced to restart the sync request, should the previous -ECANCLED be applied to the new sync request?

This comment has been minimized.

@trociny

trociny May 16, 2017

Contributor

When the notification for the previous request is completed, on_complete is always called with r =0 (see handle_notify_complete_sync). This case is when the caller cancels this (not started yet) sync (handled in notify_cancel_sync, if on_complete is not null).

This comment has been minimized.

@dillaman

dillaman May 16, 2017

Contributor

Would that be a race condition back to the throttler? Since the cancel method doesn't take an "on_finish" callback, it could just assume it's all properly cleaned up. However, eventually, the "on_start" will get invoked with an -ECANCELED parameter.

If contexts were passed to the finish and cancel methods, perhaps these boundary cases would be clearer?

This comment has been minimized.

@trociny

trociny May 16, 2017

Contributor

@dillaman I think I see some problems with my current code here. I am not sure we are talking about the same thing though, but lets discuss this when I update the code fixing the problems I see and addressing other your comments.

@trociny

This comment has been minimized.

Contributor

trociny commented May 25, 2017

@dillaman Here is a new version. The main changes:

  • ImageSyncThrottler removed (moving the sync request logic to ImageSync state machine).
  • I tried to simplify sync messages tracking in InstanceWatcher by allowing duplicates.
  • I returned to SyncRequest/SyncStart scheme to make sure the throttler's finish_op will not be lost (e.g. due to lost ack or the instance crash).

Some details for the last item. Now, when requesting a sync slot, the instance sends SyncRequest messages and waits for ack (resending after timeout). When the leader throttler grants the sync slot, the SyncRequest is acked and the leader sends SyncStart message (resending after timeout until it is acked). The SyncStart is acked by the instance when the sync is complete.

A situation is possible, when the leader receives a duplicate 'SyncRequest' resent after timeout, after it has already granted the sync slot and sent 'SyncStart'. In this case a duplicate 'SyncStart' will be sent, which is handled on the receiver side by completing the previous 'SyncStart' with -ESTALE.

In this scheme it looks like 'SyncComplete' notifications are not necessary and just lead to duplicate sync throttler finish_op. It looks I may remove them, so notify_sync_complete will only complete ack context for received 'SyncStart' notification. I have not removed them yet though -- I want to know your opinion first, about the approach in general, and about 'SyncComplet' notifications in particularly.

@dillaman

This comment has been minimized.

Contributor

dillaman commented May 26, 2017

@trociny In general it sounds good to me.

@trociny

This comment has been minimized.

Contributor

trociny commented May 29, 2017

@dillaman updated (unnecessary 'SyncComplet' messages are removed)

@@ -785,8 +785,14 @@ void InstanceWatcher<I>::handle_image_acquire(
const std::string &peer_image_id, Context *on_finish) {
dout(20) << "global_image_id=" << global_image_id << dendl;
m_instance_replayer->acquire_image(global_image_id, peer_mirror_uuid,
peer_image_id, on_finish);
auto ctx = new FunctionContext(

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Probably would be a good idea to track these queued callbacks via an AsyncOpTracker to prevent lock release / shut down races.

// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
// vim: ts=8 sw=2 smarttab
#include "Throttler.h"

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: why the name change to something so generic? This seems to be heavily tied to sync throttling.

This comment has been minimized.

@trociny

trociny May 30, 2017

Contributor

Now it can throttle any ops, it has nothing specific to image sync (or just sync). That is why so generic name. If you think it is a bad idea I will change it. ImageSyncThrottler? SyncThrottler?

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

I figured that was the idea, but it is still tied to a specific config option for max sync and it dump status for number of syncs.

This comment has been minimized.

@trociny

trociny May 30, 2017

Contributor

Ah, good point. I forgot about this. I will change it (return back) to ImageSyncThrottler.

@@ -85,10 +85,14 @@ template <typename I>
void BootstrapRequest<I>::cancel() {
dout(20) << dendl;
Mutex::Locker locker(m_lock);
m_canceled = true;
{

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: no need for new block

m_client_meta, m_work_queue, ctx,
m_progress_ctx);
if (m_canceled) {
m_ret_val = -ECANCELED;

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: was this change needed? If the image sync wasn't started, no need to worry about a race w/ the result code.

This comment has been minimized.

@trociny

trociny May 30, 2017

Contributor

There was no special need, this just looked more readable to me. And yes even if there no need to worry about a race I prefer to keep the lock in the cases like this if it does not make trouble. Because when I look at the code later it always looks alarming to me when I see variables like these accessed without lock and I need to go through the code to ensure it is safe.

Ok, I will return it back to make changes less intrusive.

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

No worries -- just double checking

@@ -72,8 +74,18 @@ class InstanceWatcher : protected librbd::Watcher {
const std::string &peer_image_id,
bool schedule_delete, Context *on_notify_ack);
void notify_sync_request(const std::string &sync_id, Context *on_sync_start);
bool cancel_notify_sync_request(const std::string &sync_id);

This comment has been minimized.

@dillaman

dillaman May 30, 2017

Contributor

Nit: perhaps just cancel_sync_request?

@trociny trociny changed the title from [DNM] rbd-mirror A/A: coordinate image syncs with leader to rbd-mirror A/A: coordinate image syncs with leader May 31, 2017

@trociny

This comment has been minimized.

Contributor

trociny commented May 31, 2017

@dillaman updated

@dillaman

This comment has been minimized.

Contributor

dillaman commented Jun 2, 2017

retest this please

Mykola Golub added some commits Apr 16, 2017

Mykola Golub
rbd-mirror: make sync throttler per pool
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
test/rbd_mirror: TestMockImageReplayer cleanup
Remove unnecessary ImageSyncThrottler dependency.

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
rbd-mirror: resolve potential recursive lock
Would have been possible when release_image had lead to cancel
sync (i.e. cancel_sync_request).

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Mykola Golub
rbd-mirror A/A: coordinate image syncs with leader
Fixes: http://tracker.ceph.com/issues/18789
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
@trociny

This comment has been minimized.

Contributor

trociny commented Jun 6, 2017

rebased to resolve conflicts with master

@dillaman

lgtm -- test failures seem unrelated to this change

@dillaman dillaman merged commit 7bda34e into ceph:master Jun 7, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment