rbd-mirror A/A: introduce basic image mapping policy #15691

vshankar · 2017-06-14T14:43:50Z

NOTE: this pr was carved out a branch (https://github.com/vshankar/ceph/commits/rbd-mirror-image-distribution) which was intended to be pr itself which distributes images amongst mirror instances based on a policy. The actual logic to map/remap images is still a part of the branch which as of now conflicts master.

Introduce a simple image mapping policy (with lookup/map/shuffle semantics), state machines to save/load image to instance mapping states to/from on-disk and notify mirror instances to acquire/release a given image.

Part of tracker: http://tracker.ceph.com/issues/18786

vshankar · 2017-06-15T16:04:52Z

@trociny @dillaman I'm folding the map/remap state machine into one state machine as most of it is common (I just had it that way initially).

Also, I think all of the UNAMPPING, MAPPING and MAPPED might not be required. We can just have a single MAPPED state as we stop the replayer (on one instance) and start again (on another). If a replayer wasn't running, that would be a no-op.

trociny

I am very interested in seeing @dillaman's comments, but I would like to see better abstraction for Policy class.

I think the Policy don't need to know about InstanceMap::InMap, and ImageSpec. The map that is passed to the Policy could be an abstract structure like BitVector we use when operate with image object map.

Also, note, in general case, depending on policy, adding an item (image) to map may cause rebalance of other items. The same is for removal. So any method that changes the map should return a new map (and probablyj map diff).

I think we could have just one method like this:

remap(changes, current_map, *new_map, *diff)

Or several methods, depending how it willl be used, like these:

update_instances(instances_to_add, instances_to_remove, current_map, *new_map, *diff)
update_images(images_to_add, images_to_remove, current_map, *new_map, *diff)

trociny · 2017-06-16T07:54:26Z

@vshankar I also had some comments to your code but it looks they were lost (did not added) after I submitted review. Anyway I don't think there is much value in them until we discuss the interface.

dillaman · 2017-06-16T14:11:29Z

In general, the policy interface should support a batch update interface. The mapping should be fully encapsulated by the policy. The PoolWatcher::handle_update method should pass the data it receives to the policy so that the policy can batch update. Since policy updates cannot occur immediately (i.e. the policy needs to commit its intentions to disk before communicating them back to the PoolWatcher), the policy should have a simple listener interface with a collection of images to unmap and map (i.e. collections of instance id and global image id).

The policy should be loaded from disk after the instance is promoted to leader. The policy should know which peers exists so that when it receives the initial image update for a peer, it can properly prune images when zero peers (and local pool) contain the image. That per-image peer state doesn't need to be persisted to disk since it can be rebuilt dynamically upon startup. The policy should wait for the initial image list from all registered peers (and local pool) before making any changes.

Eventually when we add support for IO stats to be communicated from rbd-mirror instances back to the leader, those stats can be forwarded to a more advanced policy and again the listener callback would be used to update the assignments.

dillaman · 2017-06-16T13:42:30Z

src/cls/rbd/cls_rbd_types.h

@@ -364,6 +364,35 @@ struct TrashImageSpec {
 };
 WRITE_CLASS_ENCODER(TrashImageSpec);

+enum ImageMapState {


Nit: Mirror prefix on all this stuff

dillaman · 2017-06-16T13:50:48Z

src/cls/rbd/cls_rbd_types.h

+
+  std::string instance_id;
+  ImageMapState state;
+


We should add MirrorPolicyData struct here (see librbd::journal::ClientData) so that future policies can store extra metadata (i.e. future policy might track IOPS / throughput for distribution).

Will add this in the next update.

dillaman · 2017-06-16T14:12:35Z

src/cls/rbd/cls_rbd.cc

@@ -3097,6 +3097,7 @@ static const std::string IMAGE_KEY_PREFIX("image_");
 static const std::string GLOBAL_KEY_PREFIX("global_");
 static const std::string STATUS_GLOBAL_KEY_PREFIX("status_global_");
 static const std::string INSTANCE_KEY_PREFIX("instance_");
+static const std::string IMAGE_MAP_KEY_PREFIX("map_image_");


Nit: need "mirror_" prefix on all these new things

dillaman · 2017-06-16T14:13:09Z

src/cls/rbd/cls_rbd.cc

@@ -4393,6 +4514,118 @@ int mirror_instances_remove(cls_method_context_t hctx, bufferlist *in,
 }

 /**
+ * Input:
+ * none


Nit: it does have input

dillaman · 2017-06-16T14:13:34Z

src/cls/rbd/cls_rbd.cc

+
+/**
+ * Input:
+ * @param cls::rbd::ImageMap: image map


Nit: also has global image id

dillaman · 2017-06-16T14:13:56Z

src/cls/rbd/cls_rbd.cc

+ * @param std::map<std::string, cls::rbd::ImageMap>: image map
+ * @returns 0 on success, negative error code on failure
+ */
+int image_map_get(cls_method_context_t hctx, bufferlist *in,


I don't think we would need an individual "getter"

dillaman · 2017-06-29T13:43:23Z

src/tools/rbd_mirror/CMakeLists.txt

@@ -33,7 +33,14 @@ set(rbd_mirror_internal
  image_sync/SnapshotCreateRequest.cc
  image_sync/SyncPointCreateRequest.cc
  image_sync/SyncPointPruneRequest.cc
-  pool_watcher/RefreshImagesRequest.cc)
+  pool_watcher/RefreshImagesRequest.cc
+  image_map/Types.cc


Nit: alphabetical order in the list

dillaman · 2017-06-29T13:48:05Z

src/tools/rbd_mirror/image_map/LoadRequest.cc

+void LoadRequest<I>::send() {
+  dout(20) << dendl;
+
+  // there needs to be atleast one instance (ourself)


Nit: spelling

dillaman · 2017-06-29T13:50:32Z

src/tools/rbd_mirror/image_map/LoadRequest.h

+  static LoadRequest *create(librados::IoCtx &ioctx,
+                             InMap *inmap, Context *on_finish) {
+    return new LoadRequest(
+      ioctx, inmap, on_finish);


Nit: looks like it will fit on previous line

dillaman · 2017-06-29T13:54:49Z

src/tools/rbd_mirror/image_map/Policy.cc

+void Policy<I>::update_images(const std::string &mirror_uuid,
+                              ImageIds &&added_image_ids,
+                              ImageIds &&removed_image_ids,
+                              MapSpecRef mapspec, Context *on_finish) {


Since you will have a listener callback, the PoolReplayer really shouldn't pass in nor care about a MapSpecRef nor on_finish. This should just chug along in the background. The listener callback from the policy could provide std::map<instance id, std::set<ImageId>> params for add / remove events and therefore no worry about inter-class locking and we hide image_map types from the layer above (plus we need to have a timestamp for each image so that we don't ping-pong images back-and-forth between instances during rebalance).

I haven't added the shuffle timestamp yet in the update -- will have that along with additional tests in the next update.

dillaman · 2017-06-29T14:05:08Z

src/tools/rbd_mirror/image_map/Policy.cc

+      update_images_added(mirror_uuid, std::move(added_image_ids),
+                          &mapspec->added, gather_ctx->new_sub());
+    }
+  }


You might need to prune the in-memory image mapping if after adding/removing a given image has zero registered peers. You cannot rely on the "remove" event since at start-up, the first event will only be "add" events but it would be a complete state dump. Might be easier to track if the PoolWatcher had two different event callbacks for handle_refresh and handle_update so that when you get a refresh, you just treat that as the current state of the world. For this PR, I would just ensure that you have two public API methods (update_images and set_images) and we will fix the watcher later.

This was intended to be called via PoolWatcher so it was safe I guess since upon refreshing local and remote pool we would know what images in the local pool needs to be removed. This would not be true if its invoked via PoolReplayer.

dillaman · 2017-06-29T14:07:03Z

src/tools/rbd_mirror/image_map/Policy.cc

+    images->emplace(image_id.id, image_id.global_id, instance_id);
+
+    auto rm = m_info[image_id.global_id].peer_uuid.erase(mirror_uuid);
+    assert(rm == 1);


Nit: not sure that is safe

This was intended to be called from PoolWatcher.

dillaman · 2017-06-29T15:05:57Z

src/tools/rbd_mirror/image_map/RemoveRequest.cc

@@ -0,0 +1,133 @@
+// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-


Combine RemoveRequest with UpdateRequest that can batch multiple remote/update operations w/ a single op to the OSD.

dillaman · 2017-06-29T15:09:03Z

src/tools/rbd_mirror/image_map/Types.h

+  // in place to_instance_id would point to the
+  // instance id of the mirror daemon where the
+  // image needs to be reassigned (from currently
+  // instance_id).


Nit: I would add a state to this so you can track (in-memory) your current and next actions in one place instead of having multiple lists of things to map / unmap / waiting for instance response / etc.

Umm ok. That would need to be done for each image (current action and a set of next actions to be performed). This would need to be a part of the ImageInfo structure and not this one I guess. Thinking more about this (Image) structure, it seems that we may not need at all -- we can maintain a list of Image ids (ImageIds) which the timer thread would process by looking the current action to be performed (per image).

dillaman · 2017-06-29T15:13:45Z

src/tools/rbd_mirror/image_map/Types.h

+std::ostream &operator<<(std::ostream &, const ImageInfo &info);
+
+typedef std::set<std::string> GlobalImageIds;
+typedef std::map<std::string, GlobalImageIds> InMap;


I have a real irrational hatred of this name. I think this essentially should be a private typedef of Policy since the LoadRequest only really needs to pass a map of global image ids -> instance ids (just like the combined UpdateRequest would just take a map of global image ids -> instance ids for update / removal (empty instance id could reflect removal).

dillaman · 2017-06-29T15:23:34Z

src/tools/rbd_mirror/image_map/Policy.cc

+
+    std::string instance_id = lookup(image_id.global_id);
+    if (instance_id == UNMAPPED_INSTANCE_ID) {
+      instance_id = map(image_id.global_id);


This policy should really be asynchronous based on a periodic timer callback. When the policy sees a bunch of changes, it can update the in-memory state and set a timer for the map/unmap in-bulk. The timer periodic can be kept short (i.e. 1 second), but when we come online and have 3000 images, we don't want to do one-off mappings.

dillaman · 2017-06-29T15:25:40Z

Perhaps for the short term Policy and SimplePolicy can just be combined if that would reduce the development timeline (since we are out of time)? We can refactor it later.

vshankar · 2017-07-11T15:00:14Z

@dillaman Wanted to check of one thing before I update the PR: would it make sense to have the local image notifications map to just the local instance instead of getting assigned to peer instances? This I think would make some things easier -- on-disk state needs to be maintained (and the purge logic) only for remote images (non empty mirror uuids). Thoughts?

dillaman · 2017-07-19T02:30:55Z

src/cls/rbd/cls_rbd.cc

+  int max_read = MIN(RBD_MAX_KEYS_READ, max_return);
+  std::string last_read = mirror_image_map_key(start_after);
+
+  while (max_read) {


Nit: max_read > 0

dillaman · 2017-07-19T02:31:37Z

src/cls/rbd/cls_rbd.cc

+
+    max_read = MIN(RBD_MAX_KEYS_READ, max_return - image_map->size());
+    if (!vals.empty()) {
+      last_read = mirror_image_map_key(image_map->rbegin()->first);


last_read = vals.rbegin()->first

dillaman · 2017-07-19T02:32:22Z

src/cls/rbd/cls_rbd.cc

+int mirror_image_map_list(cls_method_context_t hctx,
+                          const std::string &start_after,
+                          uint64_t max_return,
+                          std::map<std::string, cls::rbd::MirrorImageMap> *image_map) {


Nit: image_map -> image_maps or image_mapping throughout

dillaman · 2017-07-19T02:32:38Z

src/cls/rbd/cls_rbd.cc

+      const std::string& global_image_id =
+        it->first.substr(MIRROR_IMAGE_MAP_KEY_PREFIX.size());
+
+      cls::rbd::MirrorImageMap imagemap;


Nit: image_map

dillaman · 2017-07-19T02:34:35Z

src/cls/rbd/cls_rbd.cc

+
+/**
+ * Input:
+ * @param global_id: image global id


Nit: out-of-date

dillaman · 2017-07-19T02:36:18Z

src/cls/rbd/cls_rbd.cc

+ */
+int mirror_image_map_update(cls_method_context_t hctx, bufferlist *in,
+                            bufferlist *out) {
+  std::map<std::string, std::string> map_update;


Revert to the original version of this method (and mirror_image_map_remove) -- multiple updates can be packed into a single librados operation. Definitely do not just pass a string for the data since that means we could never improve it (i.e. add stats for load balancing)

Yeh, I have the client data included for the next push.

@dillaman : I've kept the client data as boost::variant for flexibility in case we want to add/remove metadata fields that are used for image distribution (or have an entirely different set of variables which a policy would use).

vshankar · 2017-08-09T15:22:41Z

@dillaman

fix comments (from #15788)

class methods (list/update/remove) now accept <std::string, cls::rbd::MirrorImageMap>
do away with RemoveRequest (now folded in UpdateRequest)
UpdateRequest updates on-disk image map in batches (update+remove in single rados call)
minor fix in mirror_image_map_list (as per comment) and naming nits.

dillaman · 2017-08-09T16:16:18Z

src/cls/rbd/cls_rbd.cc

+
+/**
+ * Input:
+ * @param std::map<std::string, cls::rbd::MirrorImageMap>: image mapping


I am still thinking that these two methods (set and remove) should just take two parameters (global image id and cls::rbd::MirrorImageMap) instead of a map. The caller of these methods can just execute the method multiple times within the same operation. In terms of OSD complexity, you have a little extra overhead for multiple exec calls in the operation, but I think it will simplify the update state machine and it mimics the behavior of all other RBD cls methods.

I'm actually ok with both. If the overhead of back-to-back exec (1024 max at a time) calls is not much, then lets choose that just to keep things in sync w/ rest of the cls rbd codebase?

I can update this by tomorrow if we finalize this interface.

I'd say switch the interface (and issue 256 updates at a time)

dillaman · 2017-08-17T17:01:43Z

src/cls/rbd/cls_rbd.cc

+/**
+ * Input:
+ * @param global_image_id: global image id
+ * @param image_mapping: mirror image map (empty)


Why does the remove method take a (empty) cls::rbd::MirrorImageMap?

I inferred that based on your earlier comments (maybe for future use while purging on-disk map?) on keeping the class method APIs consistent w.r.t. params (#15788 (comment)).

Maybe I read between the lines too much -- I'll make remove just accept just a global image id.

Sorry -- yes, I was just trying to say pass one at a time instead of a collection within a map.

dillaman · 2017-08-17T17:04:21Z

src/tools/rbd_mirror/image_map/UpdateRequest.h

+  // global image ids to purge.
+  static UpdateRequest *create(librados::IoCtx &ioctx,
+                               std::map<std::string, cls::rbd::MirrorImageMap> &&update_mapping,
+                               std::set<std::string> &&global_image_ids, Context *on_finish) {


Nit: can we rename global_image_ids to something like remove_global_image_ids throughout?

dillaman · 2017-08-17T17:06:13Z

src/cls/rbd/cls_rbd.cc

+/**
+ * Input:
+ * @param global_image_id: global image id
+ * @param image_mapping: mirror image map


Nit: image_map when it's just the MirrorImageMap structure, image_mapping when it's a map of global image ids -> MirrorImageMap structures (for consistency).

Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar · 2017-08-18T17:05:17Z

updated and rebased

dillaman

lgtm

vshankar added the rbd label Jun 14, 2017

vshankar requested review from trociny and dillaman June 15, 2017 03:57

vshankar force-pushed the mirror-ha-policy branch from 298424d to bb16751 Compare June 15, 2017 06:41

trociny added the feature label Jun 15, 2017

trociny reviewed Jun 16, 2017

View reviewed changes

dillaman reviewed Jun 16, 2017

View reviewed changes

vshankar force-pushed the mirror-ha-policy branch from bb16751 to a4767fa Compare June 20, 2017 18:12

vshankar mentioned this pull request Jun 20, 2017

rbd-mirror A/A: track images in policy map #15788

Merged

vshankar force-pushed the mirror-ha-policy branch 3 times, most recently from 18b7cc6 to 6789fa4 Compare June 29, 2017 15:10

dillaman reviewed Jun 29, 2017

View reviewed changes

vshankar force-pushed the mirror-ha-policy branch 2 times, most recently from a27f150 to a71242e Compare July 18, 2017 15:20

vshankar changed the title ~~rbd-mirror A/A: introduce basic image mapping policy~~ [WIP] rbd-mirror A/A: introduce basic image mapping policy Jul 18, 2017

dillaman reviewed Jul 19, 2017

View reviewed changes

vshankar force-pushed the mirror-ha-policy branch from a71242e to 5ca58ed Compare July 21, 2017 14:57

vshankar force-pushed the mirror-ha-policy branch from 5ca58ed to 2badd7a Compare July 31, 2017 09:51

vshankar force-pushed the mirror-ha-policy branch from 2badd7a to b5e988e Compare August 9, 2017 14:39

dillaman reviewed Aug 9, 2017

View reviewed changes

vshankar force-pushed the mirror-ha-policy branch from b5e988e to e42a487 Compare August 10, 2017 07:15

dillaman reviewed Aug 17, 2017

View reviewed changes

dillaman changed the title ~~[WIP] rbd-mirror A/A: introduce basic image mapping policy~~ rbd-mirror A/A: introduce basic image mapping policy Aug 17, 2017

vshankar added 2 commits August 18, 2017 08:49

rbd-mirror: track on-disk image to instance map

64fa2c1

Signed-off-by: Venky Shankar <vshankar@redhat.com>

rbd-mirror: load/update image map state machine

8647b37

Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar force-pushed the mirror-ha-policy branch from e42a487 to 8647b37 Compare August 18, 2017 13:45

dillaman approved these changes Aug 18, 2017

View reviewed changes

dillaman merged commit 9d7ff92 into ceph:master Aug 18, 2017

vshankar deleted the mirror-ha-policy branch August 21, 2017 03:29

		@@ -0,0 +1,133 @@
		// -- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t --

rbd-mirror A/A: introduce basic image mapping policy #15691

rbd-mirror A/A: introduce basic image mapping policy #15691

Conversation

vshankar commented Jun 14, 2017

vshankar commented Jun 15, 2017

trociny left a comment

Choose a reason for hiding this comment

trociny commented Jun 16, 2017

dillaman commented Jun 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman commented Jun 29, 2017

vshankar commented Jul 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshankar commented Aug 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshankar Aug 18, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshankar commented Aug 18, 2017

dillaman left a comment

Choose a reason for hiding this comment

dillaman commented Jun 16, 2017 •

edited

vshankar Aug 18, 2017 •

edited