Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror A/A: introduce basic image mapping policy #15691

Merged
merged 2 commits into from Aug 18, 2017

Conversation

vshankar
Copy link
Contributor

NOTE: this pr was carved out a branch (https://github.com/vshankar/ceph/commits/rbd-mirror-image-distribution) which was intended to be pr itself which distributes images amongst mirror instances based on a policy. The actual logic to map/remap images is still a part of the branch which as of now conflicts master.

Introduce a simple image mapping policy (with lookup/map/shuffle semantics), state machines to save/load image to instance mapping states to/from on-disk and notify mirror instances to acquire/release a given image.

Part of tracker: http://tracker.ceph.com/issues/18786

@vshankar
Copy link
Contributor Author

@trociny @dillaman I'm folding the map/remap state machine into one state machine as most of it is common (I just had it that way initially).

Also, I think all of the UNAMPPING, MAPPING and MAPPED might not be required. We can just have a single MAPPED state as we stop the replayer (on one instance) and start again (on another). If a replayer wasn't running, that would be a no-op.

Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very interested in seeing @dillaman's comments, but I would like to see better abstraction for Policy class.

I think the Policy don't need to know about InstanceMap::InMap, and ImageSpec. The map that is passed to the Policy could be an abstract structure like BitVector we use when operate with image object map.

Also, note, in general case, depending on policy, adding an item (image) to map may cause rebalance of other items. The same is for removal. So any method that changes the map should return a new map (and probablyj map diff).

I think we could have just one method like this:

remap(changes, current_map, *new_map, *diff)

Or several methods, depending how it willl be used, like these:

update_instances(instances_to_add, instances_to_remove, current_map, *new_map, *diff)
update_images(images_to_add, images_to_remove, current_map, *new_map, *diff)

@trociny
Copy link
Contributor

trociny commented Jun 16, 2017

@vshankar I also had some comments to your code but it looks they were lost (did not added) after I submitted review. Anyway I don't think there is much value in them until we discuss the interface.

@dillaman
Copy link

dillaman commented Jun 16, 2017

In general, the policy interface should support a batch update interface. The mapping should be fully encapsulated by the policy. The PoolWatcher::handle_update method should pass the data it receives to the policy so that the policy can batch update. Since policy updates cannot occur immediately (i.e. the policy needs to commit its intentions to disk before communicating them back to the PoolWatcher), the policy should have a simple listener interface with a collection of images to unmap and map (i.e. collections of instance id and global image id).

The policy should be loaded from disk after the instance is promoted to leader. The policy should know which peers exists so that when it receives the initial image update for a peer, it can properly prune images when zero peers (and local pool) contain the image. That per-image peer state doesn't need to be persisted to disk since it can be rebuilt dynamically upon startup. The policy should wait for the initial image list from all registered peers (and local pool) before making any changes.

Eventually when we add support for IO stats to be communicated from rbd-mirror instances back to the leader, those stats can be forwarded to a more advanced policy and again the listener callback would be used to update the assignments.

@@ -364,6 +364,35 @@ struct TrashImageSpec {
};
WRITE_CLASS_ENCODER(TrashImageSpec);

enum ImageMapState {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Mirror prefix on all this stuff


std::string instance_id;
ImageMapState state;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add MirrorPolicyData struct here (see librbd::journal::ClientData) so that future policies can store extra metadata (i.e. future policy might track IOPS / throughput for distribution).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this in the next update.

@@ -3097,6 +3097,7 @@ static const std::string IMAGE_KEY_PREFIX("image_");
static const std::string GLOBAL_KEY_PREFIX("global_");
static const std::string STATUS_GLOBAL_KEY_PREFIX("status_global_");
static const std::string INSTANCE_KEY_PREFIX("instance_");
static const std::string IMAGE_MAP_KEY_PREFIX("map_image_");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: need "mirror_" prefix on all these new things

@@ -4393,6 +4514,118 @@ int mirror_instances_remove(cls_method_context_t hctx, bufferlist *in,
}

/**
* Input:
* none

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it does have input


/**
* Input:
* @param cls::rbd::ImageMap: image map

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: also has global image id

* @param std::map<std::string, cls::rbd::ImageMap>: image map
* @returns 0 on success, negative error code on failure
*/
int image_map_get(cls_method_context_t hctx, bufferlist *in,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would need an individual "getter"

@@ -33,7 +33,14 @@ set(rbd_mirror_internal
image_sync/SnapshotCreateRequest.cc
image_sync/SyncPointCreateRequest.cc
image_sync/SyncPointPruneRequest.cc
pool_watcher/RefreshImagesRequest.cc)
pool_watcher/RefreshImagesRequest.cc
image_map/Types.cc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: alphabetical order in the list

void LoadRequest<I>::send() {
dout(20) << dendl;

// there needs to be atleast one instance (ourself)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spelling

static LoadRequest *create(librados::IoCtx &ioctx,
InMap *inmap, Context *on_finish) {
return new LoadRequest(
ioctx, inmap, on_finish);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: looks like it will fit on previous line

void Policy<I>::update_images(const std::string &mirror_uuid,
ImageIds &&added_image_ids,
ImageIds &&removed_image_ids,
MapSpecRef mapspec, Context *on_finish) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you will have a listener callback, the PoolReplayer really shouldn't pass in nor care about a MapSpecRef nor on_finish. This should just chug along in the background. The listener callback from the policy could provide std::map<instance id, std::set<ImageId>> params for add / remove events and therefore no worry about inter-class locking and we hide image_map types from the layer above (plus we need to have a timestamp for each image so that we don't ping-pong images back-and-forth between instances during rebalance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added the shuffle timestamp yet in the update -- will have that along with additional tests in the next update.

update_images_added(mirror_uuid, std::move(added_image_ids),
&mapspec->added, gather_ctx->new_sub());
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to prune the in-memory image mapping if after adding/removing a given image has zero registered peers. You cannot rely on the "remove" event since at start-up, the first event will only be "add" events but it would be a complete state dump. Might be easier to track if the PoolWatcher had two different event callbacks for handle_refresh and handle_update so that when you get a refresh, you just treat that as the current state of the world. For this PR, I would just ensure that you have two public API methods (update_images and set_images) and we will fix the watcher later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended to be called via PoolWatcher so it was safe I guess since upon refreshing local and remote pool we would know what images in the local pool needs to be removed. This would not be true if its invoked via PoolReplayer.

images->emplace(image_id.id, image_id.global_id, instance_id);

auto rm = m_info[image_id.global_id].peer_uuid.erase(mirror_uuid);
assert(rm == 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not sure that is safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended to be called from PoolWatcher.

@@ -0,0 +1,133 @@
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine RemoveRequest with UpdateRequest that can batch multiple remote/update operations w/ a single op to the OSD.

// in place to_instance_id would point to the
// instance id of the mirror daemon where the
// image needs to be reassigned (from currently
// instance_id).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would add a state to this so you can track (in-memory) your current and next actions in one place instead of having multiple lists of things to map / unmap / waiting for instance response / etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm ok. That would need to be done for each image (current action and a set of next actions to be performed). This would need to be a part of the ImageInfo structure and not this one I guess. Thinking more about this (Image) structure, it seems that we may not need at all -- we can maintain a list of Image ids (ImageIds) which the timer thread would process by looking the current action to be performed (per image).

std::ostream &operator<<(std::ostream &, const ImageInfo &info);

typedef std::set<std::string> GlobalImageIds;
typedef std::map<std::string, GlobalImageIds> InMap;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a real irrational hatred of this name. I think this essentially should be a private typedef of Policy since the LoadRequest only really needs to pass a map of global image ids -> instance ids (just like the combined UpdateRequest would just take a map of global image ids -> instance ids for update / removal (empty instance id could reflect removal).


std::string instance_id = lookup(image_id.global_id);
if (instance_id == UNMAPPED_INSTANCE_ID) {
instance_id = map(image_id.global_id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This policy should really be asynchronous based on a periodic timer callback. When the policy sees a bunch of changes, it can update the in-memory state and set a timer for the map/unmap in-bulk. The timer periodic can be kept short (i.e. 1 second), but when we come online and have 3000 images, we don't want to do one-off mappings.

@dillaman
Copy link

Perhaps for the short term Policy and SimplePolicy can just be combined if that would reduce the development timeline (since we are out of time)? We can refactor it later.

@vshankar
Copy link
Contributor Author

@dillaman Wanted to check of one thing before I update the PR: would it make sense to have the local image notifications map to just the local instance instead of getting assigned to peer instances? This I think would make some things easier -- on-disk state needs to be maintained (and the purge logic) only for remote images (non empty mirror uuids). Thoughts?

@vshankar vshankar force-pushed the mirror-ha-policy branch 2 times, most recently from a27f150 to a71242e Compare July 18, 2017 15:20
@vshankar vshankar changed the title rbd-mirror A/A: introduce basic image mapping policy [WIP] rbd-mirror A/A: introduce basic image mapping policy Jul 18, 2017
int max_read = MIN(RBD_MAX_KEYS_READ, max_return);
std::string last_read = mirror_image_map_key(start_after);

while (max_read) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: max_read > 0


max_read = MIN(RBD_MAX_KEYS_READ, max_return - image_map->size());
if (!vals.empty()) {
last_read = mirror_image_map_key(image_map->rbegin()->first);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_read = vals.rbegin()->first

int mirror_image_map_list(cls_method_context_t hctx,
const std::string &start_after,
uint64_t max_return,
std::map<std::string, cls::rbd::MirrorImageMap> *image_map) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: image_map -> image_maps or image_mapping throughout

const std::string& global_image_id =
it->first.substr(MIRROR_IMAGE_MAP_KEY_PREFIX.size());

cls::rbd::MirrorImageMap imagemap;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: image_map


/**
* Input:
* @param global_id: image global id

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: out-of-date

*/
int mirror_image_map_update(cls_method_context_t hctx, bufferlist *in,
bufferlist *out) {
std::map<std::string, std::string> map_update;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert to the original version of this method (and mirror_image_map_remove) -- multiple updates can be packed into a single librados operation. Definitely do not just pass a string for the data since that means we could never improve it (i.e. add stats for load balancing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I have the client data included for the next push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman : I've kept the client data as boost::variant for flexibility in case we want to add/remove metadata fields that are used for image distribution (or have an entirely different set of variables which a policy would use).

@vshankar
Copy link
Contributor Author

vshankar commented Aug 9, 2017

@dillaman

fix comments (from #15788)

  • class methods (list/update/remove) now accept <std::string, cls::rbd::MirrorImageMap>
  • do away with RemoveRequest (now folded in UpdateRequest)
  • UpdateRequest updates on-disk image map in batches (update+remove in single rados call)
  • minor fix in mirror_image_map_list (as per comment) and naming nits.


/**
* Input:
* @param std::map<std::string, cls::rbd::MirrorImageMap>: image mapping
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still thinking that these two methods (set and remove) should just take two parameters (global image id and cls::rbd::MirrorImageMap) instead of a map. The caller of these methods can just execute the method multiple times within the same operation. In terms of OSD complexity, you have a little extra overhead for multiple exec calls in the operation, but I think it will simplify the update state machine and it mimics the behavior of all other RBD cls methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually ok with both. If the overhead of back-to-back exec (1024 max at a time) calls is not much, then lets choose that just to keep things in sync w/ rest of the cls rbd codebase?

I can update this by tomorrow if we finalize this interface.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say switch the interface (and issue 256 updates at a time)

/**
* Input:
* @param global_image_id: global image id
* @param image_mapping: mirror image map (empty)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the remove method take a (empty) cls::rbd::MirrorImageMap?

Copy link
Contributor Author

@vshankar vshankar Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inferred that based on your earlier comments (maybe for future use while purging on-disk map?) on keeping the class method APIs consistent w.r.t. params (#15788 (comment)).

Maybe I read between the lines too much -- I'll make remove just accept just a global image id.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry -- yes, I was just trying to say pass one at a time instead of a collection within a map.

// global image ids to purge.
static UpdateRequest *create(librados::IoCtx &ioctx,
std::map<std::string, cls::rbd::MirrorImageMap> &&update_mapping,
std::set<std::string> &&global_image_ids, Context *on_finish) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rename global_image_ids to something like remove_global_image_ids throughout?

/**
* Input:
* @param global_image_id: global image id
* @param image_mapping: mirror image map

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: image_map when it's just the MirrorImageMap structure, image_mapping when it's a map of global image ids -> MirrorImageMap structures (for consistency).

@dillaman dillaman changed the title [WIP] rbd-mirror A/A: introduce basic image mapping policy rbd-mirror A/A: introduce basic image mapping policy Aug 17, 2017
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

updated and rebased

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dillaman dillaman merged commit 9d7ff92 into ceph:master Aug 18, 2017
@vshankar vshankar deleted the mirror-ha-policy branch August 21, 2017 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants