-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd/OSDMap: improve upmap calculation #14902
Conversation
This eventually needs to get fixed. Opened http://tracker.ceph.com/issues/19818 Signed-off-by: Sage Weil <sage@redhat.com>
Previously we relied on identifying target osds of interest by seeing which ones were touched by at least one PG. We also assumed that their target weight was their crush weight, which might not be the case if multiple pool using different rules were in play. Address this by using the get_rule_weight_osd_map. This isn't perfect as the PGs might be different "sizes," so one should still only calculate upmap for multiple pools when the pools have "equivlanet" PGs. Signed-off-by: Sage Weil <sage@redhat.com>
Use deviation for this item, not the max deviation. Signed-off-by: Sage Weil <sage@redhat.com>
If we are less than a full PG over the target, we are not overfull. Signed-off-by: Sage Weil <sage@redhat.com>
The argument is a ratio, not in units of PGs (like the other deviation values). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
The previous code could only swap overfull devices with underfull devices if they were in the same bucket. With this change, we can map across buckets. For example, if host A has more PGs than host B, then we'll remap some PGs on devices in host A with devices in host B. Signed-off-by: Sage Weil <sage@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ?
@@ -3407,15 +3407,25 @@ int OSDMap::calc_pg_upmaps( | |||
CephContext *cct, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/"equivlanet"/"equivalentt"/ in the commit message
// see if alt has the same parent | ||
if (j == 0 || | ||
get_parent_of_type(o[pos], stack[j-1].first) == | ||
get_parent_of_type(alt, stack[j-1].first)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
border case if both get_parent_of_type return 0
@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ? (maybe you replied already as I posted the same question before. I don't remember reading the response though). |
For each type in the "stack" we are building a vector of distinct items (just like crush would). Those items still have to be unique. In _choose_type_stack we try to swap out overfull items (leaves as before, now also buckets) for underfull items, but those swaps still check to make sure the items in the vector are unique. So if the stack is something like [(host,3),(device,1)] we'll still get three unique hosts.. it just might be a different host (and device). |
I missed something indeed, thanks for explaining. |
// FIXME: if there are multiple takes that place a different number of | ||
// objects we do not take that into account. (Also, note that doing this | ||
// right is also a function of the pool, since the crush rule | ||
// might choose 2 + choose 2 but pool size may only be 3.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: the border case of weights 5 1 1 1 1 should be handled in this function by reducing the weight 5 to 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good. It would benefit from more unit testing and splitting the bigger functions into smaller, easier to test and understand equivalents.
The current version does not remap across hosts or racks (or other intervening
CRUSH bucket types)--it only moves PGs within the same lowest-level bucket referenced
by the rule. This change fixes that.
On a test map (lab cluster) I'm able to get all devices to +/- 1 PG.
Also fix a bug or two.