New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crush: weight_set and id remapping #14486
Conversation
@liewegas here is a proposal for a minimalist integration of the weight_set in Ceph. For buckets where the weight_set is not empty, the functions CrushWrapper::{insert_item,move_bucket,update_item,remove_item,remove_item_under} set the weight set and the target weight of new items to zero, and they do not adjust the weights of their parents. This allows the user to maintain the weights via the crushmap using the "choose_args" keyword::
Whenever a choose_args is added to CrushWrapper (by compiling or decoding the crushmap), the buckets for which there exists at least one non empty weight_set are marked as such so that CrushWrapper::{insert_item,move_bucket,update_item,remove_item,remove_item_under} know to treat them differently. The CrushWrapper::do_rule function takes an additional argument which is used as an index to find the I'm moderately happy about this proposal but that's what I have so far ;-) |
I like this. It's simple. Basically weight changes just don't have any
effect until whatever created the weight_set in the first place goes and
runs again.
Where do you want to store the choose_args/weight_set? As part of the map
proper at the end of CrushWrapper::{encode,decode}? Or as a separate type
that the OSDMap manages/stores separately?
Will it be decompiled/compiled as part of the crush map (with a syntax
like that below)?
|
yes
yes What bothers me is that I can't quite imagine how it will evolve beyond this minimalist proposal. But it's probably a problem for later :-) |
Hmm, what happens if you change the structure of the tree? E.g., move a
host to a different rack? The weight_set isn't expressive enough to
reflect a different size of the bucket, so it has to remove it form
the old location, but it'll be 0'ed out in the new location.
|
when adding/removing items in a bucket, each choose_args[-1-bucket_id] must also be modified to remove the corresponding item or add an item with weight zero. This ensure the items count for each choose_args[-1-bucket_id] matches the items count of the map->bucket[-1-bucket_id] in the crushmap. In other words they must be kept in sync at all times. This should not be too much of a problem since adding / removing items/buckets is done via the CrushWrapper API. Unless there is something I'm not seeing ? |
yeah, that makes sense. it just means that if you move a bucket between 2
buckets (or an osd to another bucket) that it will be effectively removed
from the old location but not added to the new one (until someone
recalculates the weight_set values).
|
Yes. From a practical point of view this is not a useful behavior. Maybe it would be better to just forbid adding/removing items/buckets via the CrushWrapper API when there are weight_sets. The user has to add/remove items/buckets by decompiling/modifying/compiling the crushmap and is responsible for maintaining the weight_set accordingly. |
The compiler part is implemented and the syntax is slightly simpler than what was originally proposed 8438556#diff-6c59ef5804a9ab71015e059395d84feeR107 . Please let me know if you see something odd ;-) |
@liewegas updated the implementation with compile/decompile/encode/decode and a crushtool based test, moving on to disabling crush map updates via the CrushWrapper api when weight sets are present. |
@liewegas I think it is ready for review :-) |
teuthology-suite -k distro --priority 1000 --suite rados --subset $(expr $RANDOM % 200)/200 --email loic@dachary.org --ceph wip-multiweight --machine-type smithi
Re-running the failed tests They consistently fail but are unrelated to this pull request |
src/crush/CrushWrapper.cc
Outdated
@@ -1540,6 +1572,38 @@ void CrushWrapper::decode(bufferlist::iterator& blp) | |||
class_rname[c.second] = c.first; | |||
::decode(class_bucket, blp); | |||
cleanup_classes(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a } if (!blp.end()) { here? i have the class code deployed on a test cluster.
[ 1.000 2.000 5.000 ] | ||
[ 3.000 2.000 5.000 ] | ||
] | ||
ids -20 -30 -25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weight_set uses [ ] for the lists.. maybe ids should too?
ids [ -20 -30 -25 ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit on syntax, but otherwise looks good!
It is bugous and unsupported. Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
bucket_straw2_choose needs to use weights that may be different from weight_items. For instance to compensate for an uneven distribution caused by a low number of values. Or to fix the probability biais introduced by conditional probabilities (see http://tracker.ceph.com/issues/15653 for more information). We introduce a weight_set for each straw2 bucket to set the desired weight for a given item at a given position. The weight of a given item when picking the first replica (first position) may be different from the weight the second replica (second position). For instance the weight matrix for a given bucket containing items 3, 7 and 13 could be as follows: position 0 position 1 item 3 0x10000 0x100000 item 7 0x40000 0x10000 item 13 0x40000 0x10000 When crush_do_rule picks the first of two replicas (position 0), item 7, 3 are four times more likely to be choosen by bucket_straw2_choose than item 13. When choosing the second replica (position 1), item 3 is ten times more likely to be choosen than item 7, 13. By default the weight_set of each bucket exactly matches the content of item_weights for each position to ensure backward compatibility. bucket_straw2_choose compares items by using their id. The same ids are also used to index buckets and they must be unique. For each item in a bucket an array of ids can be provided for placement purposes and they are used instead of the ids. If no replacement ids are provided, the legacy behavior is preserved. Signed-off-by: Loic Dachary <loic@dachary.org>
If there is no crush_choose_arg_map for a given pool (the default) a NULL pointer is given instead and crush_do_rule behavior remains unchanged. Signed-off-by: Loic Dachary <loic@dachary.org>
A map of crush_choose_arg_map is added to the crushmap text syntax. The key is an integer matching a pool number. Signed-off-by: Loic Dachary <loic@dachary.org>
Adding, removing or move items / buckets via the CrushWrapper API when choose_args is not empty is unlikely to produce the desired outcome. The caller should instead add, remove or move items / buckets in a decompiled crushmap, update the associated choose_arg and upload the new crushmap. Signed-off-by: Loic Dachary <loic@dachary.org>
Signed-off-by: Loic Dachary <loic@dachary.org>
Looks good! Maybe we should deprecate the tree buckets more explicitly somehow so they can be removed.. maybe a bool CrushWrapper::has_tree_buckets(), and a health warning in luminous if they're in use to warn that support will be removed post-luminous? |
good idea created http://tracker.ceph.com/issues/19654 to not forget about it. |
teuthology-suite -k distro --priority 1000 --suite rados --subset $(expr $RANDOM % 200)/200 --email loic@dachary.org --ceph wip-multiweight --machine-type smithi
Re-running the failed tests |
No description provided.