New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-mirror A/A: introduce basic image mapping policy #15691
Conversation
298424d
to
bb16751
Compare
@trociny @dillaman I'm folding the map/remap state machine into one state machine as most of it is common (I just had it that way initially). Also, I think all of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very interested in seeing @dillaman's comments, but I would like to see better abstraction for Policy class.
I think the Policy don't need to know about InstanceMap::InMap
, and ImageSpec
. The map that is passed to the Policy could be an abstract structure like BitVector we use when operate with image object map.
Also, note, in general case, depending on policy, adding an item (image) to map may cause rebalance of other items. The same is for removal. So any method that changes the map should return a new map (and probablyj map diff).
I think we could have just one method like this:
remap(changes, current_map, *new_map, *diff)
Or several methods, depending how it willl be used, like these:
update_instances(instances_to_add, instances_to_remove, current_map, *new_map, *diff)
update_images(images_to_add, images_to_remove, current_map, *new_map, *diff)
@vshankar I also had some comments to your code but it looks they were lost (did not added) after I submitted review. Anyway I don't think there is much value in them until we discuss the interface. |
In general, the policy interface should support a batch update interface. The mapping should be fully encapsulated by the policy. The The policy should be loaded from disk after the instance is promoted to leader. The policy should know which peers exists so that when it receives the initial image update for a peer, it can properly prune images when zero peers (and local pool) contain the image. That per-image peer state doesn't need to be persisted to disk since it can be rebuilt dynamically upon startup. The policy should wait for the initial image list from all registered peers (and local pool) before making any changes. Eventually when we add support for IO stats to be communicated from rbd-mirror instances back to the leader, those stats can be forwarded to a more advanced policy and again the listener callback would be used to update the assignments. |
src/cls/rbd/cls_rbd_types.h
Outdated
@@ -364,6 +364,35 @@ struct TrashImageSpec { | |||
}; | |||
WRITE_CLASS_ENCODER(TrashImageSpec); | |||
|
|||
enum ImageMapState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Mirror
prefix on all this stuff
src/cls/rbd/cls_rbd_types.h
Outdated
|
||
std::string instance_id; | ||
ImageMapState state; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add MirrorPolicyData
struct here (see librbd::journal::ClientData
) so that future policies can store extra metadata (i.e. future policy might track IOPS / throughput for distribution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add this in the next update.
src/cls/rbd/cls_rbd.cc
Outdated
@@ -3097,6 +3097,7 @@ static const std::string IMAGE_KEY_PREFIX("image_"); | |||
static const std::string GLOBAL_KEY_PREFIX("global_"); | |||
static const std::string STATUS_GLOBAL_KEY_PREFIX("status_global_"); | |||
static const std::string INSTANCE_KEY_PREFIX("instance_"); | |||
static const std::string IMAGE_MAP_KEY_PREFIX("map_image_"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: need "mirror_" prefix on all these new things
src/cls/rbd/cls_rbd.cc
Outdated
@@ -4393,6 +4514,118 @@ int mirror_instances_remove(cls_method_context_t hctx, bufferlist *in, | |||
} | |||
|
|||
/** | |||
* Input: | |||
* none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it does have input
src/cls/rbd/cls_rbd.cc
Outdated
|
||
/** | ||
* Input: | ||
* @param cls::rbd::ImageMap: image map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: also has global image id
src/cls/rbd/cls_rbd.cc
Outdated
* @param std::map<std::string, cls::rbd::ImageMap>: image map | ||
* @returns 0 on success, negative error code on failure | ||
*/ | ||
int image_map_get(cls_method_context_t hctx, bufferlist *in, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would need an individual "getter"
bb16751
to
a4767fa
Compare
18b7cc6
to
6789fa4
Compare
src/tools/rbd_mirror/CMakeLists.txt
Outdated
@@ -33,7 +33,14 @@ set(rbd_mirror_internal | |||
image_sync/SnapshotCreateRequest.cc | |||
image_sync/SyncPointCreateRequest.cc | |||
image_sync/SyncPointPruneRequest.cc | |||
pool_watcher/RefreshImagesRequest.cc) | |||
pool_watcher/RefreshImagesRequest.cc | |||
image_map/Types.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: alphabetical order in the list
void LoadRequest<I>::send() { | ||
dout(20) << dendl; | ||
|
||
// there needs to be atleast one instance (ourself) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: spelling
static LoadRequest *create(librados::IoCtx &ioctx, | ||
InMap *inmap, Context *on_finish) { | ||
return new LoadRequest( | ||
ioctx, inmap, on_finish); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: looks like it will fit on previous line
void Policy<I>::update_images(const std::string &mirror_uuid, | ||
ImageIds &&added_image_ids, | ||
ImageIds &&removed_image_ids, | ||
MapSpecRef mapspec, Context *on_finish) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you will have a listener callback, the PoolReplayer
really shouldn't pass in nor care about a MapSpecRef
nor on_finish
. This should just chug along in the background. The listener callback from the policy could provide std::map<instance id, std::set<ImageId>>
params for add / remove events and therefore no worry about inter-class locking and we hide image_map
types from the layer above (plus we need to have a timestamp for each image so that we don't ping-pong images back-and-forth between instances during rebalance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't added the shuffle timestamp yet in the update -- will have that along with additional tests in the next update.
update_images_added(mirror_uuid, std::move(added_image_ids), | ||
&mapspec->added, gather_ctx->new_sub()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to prune the in-memory image mapping if after adding/removing a given image has zero registered peers. You cannot rely on the "remove" event since at start-up, the first event will only be "add" events but it would be a complete state dump. Might be easier to track if the PoolWatcher
had two different event callbacks for handle_refresh
and handle_update
so that when you get a refresh, you just treat that as the current state of the world. For this PR, I would just ensure that you have two public API methods (update_images
and set_images
) and we will fix the watcher later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended to be called via PoolWatcher
so it was safe I guess since upon refreshing local and remote pool we would know what images in the local pool needs to be removed. This would not be true if its invoked via PoolReplayer
.
images->emplace(image_id.id, image_id.global_id, instance_id); | ||
|
||
auto rm = m_info[image_id.global_id].peer_uuid.erase(mirror_uuid); | ||
assert(rm == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not sure that is safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended to be called from PoolWatcher
.
@@ -0,0 +1,133 @@ | |||
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine RemoveRequest
with UpdateRequest
that can batch multiple remote/update operations w/ a single op to the OSD.
// in place to_instance_id would point to the | ||
// instance id of the mirror daemon where the | ||
// image needs to be reassigned (from currently | ||
// instance_id). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would add a state to this so you can track (in-memory) your current and next actions in one place instead of having multiple lists of things to map / unmap / waiting for instance response / etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm ok. That would need to be done for each image (current action and a set of next actions to be performed). This would need to be a part of the ImageInfo
structure and not this one I guess. Thinking more about this (Image
) structure, it seems that we may not need at all -- we can maintain a list of Image ids (ImageIds
) which the timer thread would process by looking the current action to be performed (per image).
std::ostream &operator<<(std::ostream &, const ImageInfo &info); | ||
|
||
typedef std::set<std::string> GlobalImageIds; | ||
typedef std::map<std::string, GlobalImageIds> InMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a real irrational hatred of this name. I think this essentially should be a private typedef of Policy
since the LoadRequest
only really needs to pass a map of global image ids -> instance ids (just like the combined UpdateRequest
would just take a map of global image ids -> instance ids for update / removal (empty instance id could reflect removal).
|
||
std::string instance_id = lookup(image_id.global_id); | ||
if (instance_id == UNMAPPED_INSTANCE_ID) { | ||
instance_id = map(image_id.global_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This policy should really be asynchronous based on a periodic timer callback. When the policy sees a bunch of changes, it can update the in-memory state and set a timer for the map/unmap in-bulk. The timer periodic can be kept short (i.e. 1 second), but when we come online and have 3000 images, we don't want to do one-off mappings.
Perhaps for the short term |
@dillaman Wanted to check of one thing before I update the PR: would it make sense to have the local image notifications map to just the local instance instead of getting assigned to peer instances? This I think would make some things easier -- on-disk state needs to be maintained (and the purge logic) only for remote images (non empty mirror uuids). Thoughts? |
a27f150
to
a71242e
Compare
src/cls/rbd/cls_rbd.cc
Outdated
int max_read = MIN(RBD_MAX_KEYS_READ, max_return); | ||
std::string last_read = mirror_image_map_key(start_after); | ||
|
||
while (max_read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: max_read > 0
src/cls/rbd/cls_rbd.cc
Outdated
|
||
max_read = MIN(RBD_MAX_KEYS_READ, max_return - image_map->size()); | ||
if (!vals.empty()) { | ||
last_read = mirror_image_map_key(image_map->rbegin()->first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_read = vals.rbegin()->first
src/cls/rbd/cls_rbd.cc
Outdated
int mirror_image_map_list(cls_method_context_t hctx, | ||
const std::string &start_after, | ||
uint64_t max_return, | ||
std::map<std::string, cls::rbd::MirrorImageMap> *image_map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: image_map
-> image_maps
or image_mapping
throughout
src/cls/rbd/cls_rbd.cc
Outdated
const std::string& global_image_id = | ||
it->first.substr(MIRROR_IMAGE_MAP_KEY_PREFIX.size()); | ||
|
||
cls::rbd::MirrorImageMap imagemap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: image_map
src/cls/rbd/cls_rbd.cc
Outdated
|
||
/** | ||
* Input: | ||
* @param global_id: image global id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: out-of-date
src/cls/rbd/cls_rbd.cc
Outdated
*/ | ||
int mirror_image_map_update(cls_method_context_t hctx, bufferlist *in, | ||
bufferlist *out) { | ||
std::map<std::string, std::string> map_update; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert to the original version of this method (and mirror_image_map_remove
) -- multiple updates can be packed into a single librados operation. Definitely do not just pass a string for the data since that means we could never improve it (i.e. add stats for load balancing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, I have the client data
included for the next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman : I've kept the client data
as boost::variant
for flexibility in case we want to add/remove metadata fields that are used for image distribution (or have an entirely different set of variables which a policy would use).
a71242e
to
5ca58ed
Compare
5ca58ed
to
2badd7a
Compare
2badd7a
to
b5e988e
Compare
fix comments (from #15788)
|
src/cls/rbd/cls_rbd.cc
Outdated
|
||
/** | ||
* Input: | ||
* @param std::map<std::string, cls::rbd::MirrorImageMap>: image mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still thinking that these two methods (set and remove) should just take two parameters (global image id and cls::rbd::MirrorImageMap) instead of a map. The caller of these methods can just execute the method multiple times within the same operation. In terms of OSD complexity, you have a little extra overhead for multiple exec calls in the operation, but I think it will simplify the update state machine and it mimics the behavior of all other RBD cls methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually ok with both. If the overhead of back-to-back exec (1024 max at a time) calls is not much, then lets choose that just to keep things in sync w/ rest of the cls rbd codebase?
I can update this by tomorrow if we finalize this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say switch the interface (and issue 256 updates at a time)
b5e988e
to
e42a487
Compare
src/cls/rbd/cls_rbd.cc
Outdated
/** | ||
* Input: | ||
* @param global_image_id: global image id | ||
* @param image_mapping: mirror image map (empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the remove method take a (empty) cls::rbd::MirrorImageMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inferred that based on your earlier comments (maybe for future use while purging on-disk map?) on keeping the class method APIs consistent w.r.t. params (#15788 (comment)).
Maybe I read between the lines too much -- I'll make remove just accept just a global image id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry -- yes, I was just trying to say pass one at a time instead of a collection within a map.
// global image ids to purge. | ||
static UpdateRequest *create(librados::IoCtx &ioctx, | ||
std::map<std::string, cls::rbd::MirrorImageMap> &&update_mapping, | ||
std::set<std::string> &&global_image_ids, Context *on_finish) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we rename global_image_ids
to something like remove_global_image_ids
throughout?
src/cls/rbd/cls_rbd.cc
Outdated
/** | ||
* Input: | ||
* @param global_image_id: global image id | ||
* @param image_mapping: mirror image map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: image_map
when it's just the MirrorImageMap
structure, image_mapping
when it's a map of global image ids -> MirrorImageMap
structures (for consistency).
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
e42a487
to
8647b37
Compare
updated and rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
NOTE: this pr was carved out a branch (https://github.com/vshankar/ceph/commits/rbd-mirror-image-distribution) which was intended to be pr itself which distributes images amongst mirror instances based on a policy. The actual logic to map/remap images is still a part of the branch which as of now conflicts master.
Introduce a simple image mapping policy (with lookup/map/shuffle semantics), state machines to save/load image to instance mapping states to/from on-disk and notify mirror instances to acquire/release a given image.
Part of tracker: http://tracker.ceph.com/issues/18786