Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/OSDMap: improve upmap calculation #14902

Merged
merged 8 commits into from
May 17, 2017
Merged

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented May 2, 2017

The current version does not remap across hosts or racks (or other intervening
CRUSH bucket types)--it only moves PGs within the same lowest-level bucket referenced
by the rule. This change fixes that.

On a test map (lab cluster) I'm able to get all devices to +/- 1 PG.

Also fix a bug or two.

This eventually needs to get fixed.  Opened
http://tracker.ceph.com/issues/19818

Signed-off-by: Sage Weil <sage@redhat.com>
Previously we relied on identifying target osds of interest by seeing
which ones were touched by at least one PG.  We also assumed that their
target weight was their crush weight, which might not be the case if
multiple pool using different rules were in play.

Address this by using the get_rule_weight_osd_map.  This isn't perfect
as the PGs might be different "sizes," so one should still only calculate
upmap for multiple pools when the pools have "equivlanet" PGs.

Signed-off-by: Sage Weil <sage@redhat.com>
Use deviation for this item, not the max deviation.

Signed-off-by: Sage Weil <sage@redhat.com>
If we are less than a full PG over the target, we are not overfull.

Signed-off-by: Sage Weil <sage@redhat.com>
The argument is a ratio, not in units of PGs (like the other deviation
values).

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
The previous code could only swap overfull devices with underfull devices
if they were in the same bucket.  With this change, we can map across
buckets.  For example, if host A has more PGs than host B, then we'll
remap some PGs on devices in host A with devices in host B.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas added the core label May 2, 2017
@liewegas liewegas requested a review from a user May 2, 2017 01:46
@liewegas liewegas changed the title osdmap: improve upmap calculation osd/OSDMap: improve upmap calculation May 2, 2017
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ?

@@ -3407,15 +3407,25 @@ int OSDMap::calc_pg_upmaps(
CephContext *cct,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"equivlanet"/"equivalentt"/ in the commit message

// see if alt has the same parent
if (j == 0 ||
get_parent_of_type(o[pos], stack[j-1].first) ==
get_parent_of_type(alt, stack[j-1].first)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

border case if both get_parent_of_type return 0

@ghost
Copy link

ghost commented May 10, 2017

@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ? (maybe you replied already as I posted the same question before. I don't remember reading the response though).

@liewegas
Copy link
Member Author

For each type in the "stack" we are building a vector of distinct items (just like crush would). Those items still have to be unique. In _choose_type_stack we try to swap out overfull items (leaves as before, now also buckets) for underfull items, but those swaps still check to make sure the items in the vector are unique. So if the stack is something like [(host,3),(device,1)] we'll still get three unique hosts.. it just might be a different host (and device).

@ghost
Copy link

ghost commented May 10, 2017

I missed something indeed, thanks for explaining.

// FIXME: if there are multiple takes that place a different number of
// objects we do not take that into account. (Also, note that doing this
// right is also a function of the pool, since the crush rule
// might choose 2 + choose 2 but pool size may only be 3.)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: the border case of weights 5 1 1 1 1 should be handled in this function by reducing the weight 5 to 4

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good. It would benefit from more unit testing and splitting the bigger functions into smaller, easier to test and understand equivalents.

@liewegas liewegas merged commit 55c47ef into ceph:master May 17, 2017
@liewegas
Copy link
Member Author

@liewegas liewegas deleted the wip-upmap branch May 17, 2017 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant