New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/OSDMap: improve upmap calculation #14902

Merged
merged 8 commits into from May 17, 2017

Conversation

Projects
None yet
1 participant
@liewegas
Member

liewegas commented May 2, 2017

The current version does not remap across hosts or racks (or other intervening
CRUSH bucket types)--it only moves PGs within the same lowest-level bucket referenced
by the rule. This change fixes that.

On a test map (lab cluster) I'm able to get all devices to +/- 1 PG.

Also fix a bug or two.

liewegas added some commits May 1, 2017

crush/CrushWrapper: note about get_rule_weight_osd_map limitation
This eventually needs to get fixed.  Opened
http://tracker.ceph.com/issues/19818

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: upmap: behave when current rule does not touch all osds
Previously we relied on identifying target osds of interest by seeing
which ones were touched by at least one PG.  We also assumed that their
target weight was their crush weight, which might not be the case if
multiple pool using different rules were in play.

Address this by using the get_rule_weight_osd_map.  This isn't perfect
as the PGs might be different "sizes," so one should still only calculate
upmap for multiple pools when the pools have "equivlanet" PGs.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: upmap: fix bug
Use deviation for this item, not the max deviation.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: upmap: only overfull when at least one PG over
If we are less than a full PG over the target, we are not overfull.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: upmap: clarify variable name
The argument is a ratio, not in units of PGs (like the other deviation
values).

Signed-off-by: Sage Weil <sage@redhat.com>
crush/CrushWrapper: add get_parent_of_type
Signed-off-by: Sage Weil <sage@redhat.com>
crush/CrushWrapper: use get_parent_of_type
Signed-off-by: Sage Weil <sage@redhat.com>
crush/CrushWrapper: remap across intervening bucket types
The previous code could only swap overfull devices with underfull devices
if they were in the same bucket.  With this change, we can map across
buckets.  For example, if host A has more PGs than host B, then we'll
remap some PGs on devices in host A with devices in host B.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas added the core label May 2, 2017

@liewegas liewegas requested a review from May 2, 2017

@liewegas liewegas changed the title from osdmap: improve upmap calculation to osd/OSDMap: improve upmap calculation May 2, 2017

@ghost

@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ?

@@ -3407,15 +3407,25 @@ int OSDMap::calc_pg_upmaps(
CephContext *cct,

This comment has been minimized.

@ghost

ghost May 2, 2017

s/"equivlanet"/"equivalentt"/ in the commit message

// see if alt has the same parent
if (j == 0 ||
get_parent_of_type(o[pos], stack[j-1].first) ==
get_parent_of_type(alt, stack[j-1].first)) {

This comment has been minimized.

@ghost

ghost May 2, 2017

border case if both get_parent_of_type return 0

@ghost

This comment has been minimized.

ghost commented May 10, 2017

@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ? (maybe you replied already as I posted the same question before. I don't remember reading the response though).

@liewegas

This comment has been minimized.

Member

liewegas commented May 10, 2017

For each type in the "stack" we are building a vector of distinct items (just like crush would). Those items still have to be unique. In _choose_type_stack we try to swap out overfull items (leaves as before, now also buckets) for underfull items, but those swaps still check to make sure the items in the vector are unique. So if the stack is something like [(host,3),(device,1)] we'll still get three unique hosts.. it just might be a different host (and device).

@ghost

This comment has been minimized.

ghost commented May 10, 2017

I missed something indeed, thanks for explaining.

// FIXME: if there are multiple takes that place a different number of
// objects we do not take that into account. (Also, note that doing this
// right is also a function of the pool, since the crush rule
// might choose 2 + choose 2 but pool size may only be 3.)

This comment has been minimized.

@ghost

ghost May 10, 2017

side note: the border case of weights 5 1 1 1 1 should be handled in this function by reducing the weight 5 to 4

@ghost

ghost approved these changes May 10, 2017

The logic looks good. It would benefit from more unit testing and splitting the bigger functions into smaller, easier to test and understand equivalents.

@liewegas liewegas merged commit 55c47ef into ceph:master May 17, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-upmap branch May 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment