osd/OSDMap: improve upmap calculation #14902

liewegas · 2017-05-02T01:01:06Z

The current version does not remap across hosts or racks (or other intervening
CRUSH bucket types)--it only moves PGs within the same lowest-level bucket referenced
by the rule. This change fixes that.

On a test map (lab cluster) I'm able to get all devices to +/- 1 PG.

Also fix a bug or two.

This eventually needs to get fixed. Opened http://tracker.ceph.com/issues/19818 Signed-off-by: Sage Weil <sage@redhat.com>

Previously we relied on identifying target osds of interest by seeing which ones were touched by at least one PG. We also assumed that their target weight was their crush weight, which might not be the case if multiple pool using different rules were in play. Address this by using the get_rule_weight_osd_map. This isn't perfect as the PGs might be different "sizes," so one should still only calculate upmap for multiple pools when the pools have "equivlanet" PGs. Signed-off-by: Sage Weil <sage@redhat.com>

Use deviation for this item, not the max deviation. Signed-off-by: Sage Weil <sage@redhat.com>

If we are less than a full PG over the target, we are not overfull. Signed-off-by: Sage Weil <sage@redhat.com>

The argument is a ratio, not in units of PGs (like the other deviation values). Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

The previous code could only swap overfull devices with underfull devices if they were in the same bucket. With this change, we can map across buckets. For example, if host A has more PGs than host B, then we'll remap some PGs on devices in host A with devices in host B. Signed-off-by: Sage Weil <sage@redhat.com>

ghost

@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ?

ghost · 2017-05-02T05:39:54Z

src/osd/OSDMap.cc

@@ -3407,15 +3407,25 @@ int OSDMap::calc_pg_upmaps(
  CephContext *cct,


s/"equivlanet"/"equivalentt"/ in the commit message

ghost · 2017-05-02T05:55:34Z

src/crush/CrushWrapper.cc

+		  // see if alt has the same parent
+		  if (j == 0 ||
+		      get_parent_of_type(o[pos], stack[j-1].first) ==
+		      get_parent_of_type(alt, stack[j-1].first)) {


border case if both get_parent_of_type return 0

ghost · 2017-05-10T16:26:09Z

@liewegas It looks like this could end up with a PG being mapped twice to the same host, via the remapping. Or am I missing part of the logic ? (maybe you replied already as I posted the same question before. I don't remember reading the response though).

liewegas · 2017-05-10T18:49:24Z

For each type in the "stack" we are building a vector of distinct items (just like crush would). Those items still have to be unique. In _choose_type_stack we try to swap out overfull items (leaves as before, now also buckets) for underfull items, but those swaps still check to make sure the items in the vector are unique. So if the stack is something like [(host,3),(device,1)] we'll still get three unique hosts.. it just might be a different host (and device).

ghost · 2017-05-10T21:06:52Z

I missed something indeed, thanks for explaining.

ghost · 2017-05-10T21:15:03Z

src/crush/CrushWrapper.cc

+  // FIXME: if there are multiple takes that place a different number of
+  // objects we do not take that into account.  (Also, note that doing this
+  // right is also a function of the pool, since the crush rule
+  // might choose 2 + choose 2 but pool size may only be 3.)


side note: the border case of weights 5 1 1 1 1 should be handled in this function by reducing the weight 5 to 4

ghost

The logic looks good. It would benefit from more unit testing and splitting the bigger functions into smaller, easier to test and understand equivalents.

liewegas · 2017-05-17T15:29:32Z

http://pulpito.ceph.com/sage-2017-05-16_22:22:40-rados-wip-sage-testing2---basic-smithi/

liewegas added 8 commits May 1, 2017 17:27

crush/CrushWrapper: note about get_rule_weight_osd_map limitation

5616942

This eventually needs to get fixed. Opened http://tracker.ceph.com/issues/19818 Signed-off-by: Sage Weil <sage@redhat.com>

osd/OSDMap: upmap: fix bug

1f12220

Use deviation for this item, not the max deviation. Signed-off-by: Sage Weil <sage@redhat.com>

osd/OSDMap: upmap: only overfull when at least one PG over

dcb66c2

If we are less than a full PG over the target, we are not overfull. Signed-off-by: Sage Weil <sage@redhat.com>

osd/OSDMap: upmap: clarify variable name

9599aa9

The argument is a ratio, not in units of PGs (like the other deviation values). Signed-off-by: Sage Weil <sage@redhat.com>

crush/CrushWrapper: add get_parent_of_type

e8a86dd

Signed-off-by: Sage Weil <sage@redhat.com>

crush/CrushWrapper: use get_parent_of_type

79e4d24

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added the core label May 2, 2017

liewegas requested a review from a user May 2, 2017 01:46

liewegas changed the title ~~osdmap: improve upmap calculation~~ osd/OSDMap: improve upmap calculation May 2, 2017

ghost reviewed May 2, 2017

View reviewed changes

ghost reviewed May 10, 2017

View reviewed changes

ghost approved these changes May 10, 2017

View reviewed changes

liewegas added the wip-sage2-testing label May 12, 2017

liewegas merged commit 55c47ef into ceph:master May 17, 2017

liewegas deleted the wip-upmap branch May 17, 2017 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/OSDMap: improve upmap calculation #14902

osd/OSDMap: improve upmap calculation #14902

liewegas commented May 2, 2017

ghost left a comment •

edited by ghost

Loading

ghost May 2, 2017

ghost May 2, 2017

ghost commented May 10, 2017

liewegas commented May 10, 2017

ghost commented May 10, 2017

ghost May 10, 2017

ghost left a comment

liewegas commented May 17, 2017

		@@ -3407,15 +3407,25 @@ int OSDMap::calc_pg_upmaps(
		CephContext *cct,

osd/OSDMap: improve upmap calculation #14902

osd/OSDMap: improve upmap calculation #14902

Conversation

liewegas commented May 2, 2017

ghost left a comment • edited by ghost Loading

Choose a reason for hiding this comment

ghost May 2, 2017

Choose a reason for hiding this comment

ghost May 2, 2017

Choose a reason for hiding this comment

ghost commented May 10, 2017

liewegas commented May 10, 2017

ghost commented May 10, 2017

ghost May 10, 2017

Choose a reason for hiding this comment

ghost left a comment

Choose a reason for hiding this comment

liewegas commented May 17, 2017

ghost left a comment •

edited by ghost

Loading