crush changes for erasure coding #869

liewegas · 2013-11-26T01:48:02Z

changes to crush to fix/improve the indep mode

changes to the mon to fix adjusting the crush rule, and to se the crush
rule explicitly (and ignore the 'ruleset' nonsense)

ghost · 2013-12-02T13:25:47Z

src/mon/MonCommands.h

-	"name=type,type=CephString,goodchars=[A-Za-z0-9-_.]",
-	"create crush rule <name> to 'take' from bucket <root> and 'chooseleaf_first' a bucket <type>", \
+	"name=type,type=CephString,goodchars=[A-Za-z0-9-_.] " \
+	"name=mode,type=CephChoices,strings=firstn|indep,goodchars=[A-Za-z0-9-_.],req=false",


goodchars is not needed here

ghost · 2013-12-02T16:24:58Z

src/test/crush/indep.cc

+
+  // mark a bunch of osds out
+  int num = 3*3*3;
+  for (int i=0; i<num / 2; ++i)


If the idea is to make it difficult to find the required items, would it be even better to only leave 9 osds instead of more ? I tried and it does work with ~75 tries. Not sure if that's of any value. What troubles me is that I don't know if there is a way to check under which conditions the results can always be obtained.

it's marking out every other osd, so we are sure not to mark out all osds in a given branch and have no possible solution.

ghost · 2013-12-03T10:32:27Z

For the sake of backward compatibility, wouldn't it be better to make it so indep does not change behavior ? Another crush_choose_foobar could be added instead.

This is only present to size the temporary scratch arrays that we put on the stack. Let the caller allocate them as they wish and remove the limitation. Signed-off-by: Sage Weil <sage@inktank.com>

For firstn mode, if we fail to make a valid placement choice, we just continue and return a short result to the caller. For indep mode, however, we need to make the position stable, and return an undefined value on failed placements to avoid shifting later results to the left. Signed-off-by: Sage Weil <sage@inktank.com>

Signed-off-by: Sage Weil <sage@inktank.com>

This is based on the OSDMap::print_tree(), but is a bit simpler because we do not have up/down information at this level. Signed-off-by: Sage Weil <sage@inktank.com>

Signed-off-by: Sage Weil <sage@inktank.com>

Now that indep is handled by crush_choose_indep, rename crush_choose to crush_choose_firstn and remove all the conditionals. This ends up stripping out *lots* of code. Note that it *also* makes it obvious that the shenanigans we were playing with r' for uniform buckets were broken for firstn mode. This appears to have happened waaaay back in commit dae8bec (or earlier)... 2007. Signed-off-by: Sage Weil <sage@inktank.com>

Signed-off-by: Sage Weil <sage@inktank.com>

If it is a replicated pool, remove and shift to the left. For erasure pools, replace nonexistent items with CRUSH_ITEM_NONE. Signed-off-by: Sage Weil <sage@inktank.com>

Add a mode (firstn or indep) to the create-simple command. Make it optional and default to firstn (for compatiblity and simplicity). Note: a "default=..." option for mon commands would be easier. Signed-off-by: Sage Weil <sage@inktank.com>

Signed-off-by: Sage Weil <sage@inktank.com>

Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Signed-off-by: Sage Weil <sage@inktank.com>

Pass down the parent's 'r' value so that we will sample different values in the recursive call when the parent tries multiple times. This avoids doing useless work (calling multiple times and trying the same values). Signed-off-by: Sage Weil <sage@inktank.com>

Explicitly control the number of sample attempts, and allow the number of tries in the recursive call to be explicitly controlled via the rule. This is important because the amount of time we want to spend looking for a solution may be rule dependent (e.g., higher for the wide indep pool than the rep pools). (We should do the same for the other tunables, by the way!) Signed-off-by: Sage Weil <sage@inktank.com>

ghost · 2013-12-03T23:36:17Z

unittest_crush_indep
...
(0/27 out) 1 -> [9,7,1,26,13,22,5]
(1/27 out) 1 -> [9,7,1,26,13,22,5]
(2/27 out) 1 -> [9,7,2,26,13,22,5]
 0 moved, 1 changed
(3/27 out) 1 -> [9,7,25,15,13,22,5]
 0 moved, 2 changed
(4/27 out) 1 -> [9,7,25,15,13,22,5]
(5/27 out) 1 -> [9,7,25,15,13,22,5]
(6/27 out) 1 -> [9,7,25,19,13,22,15]
 15 moved from 3 to 6
 1 moved, 2 changed
(7/27 out) 1 -> [9,7,25,19,13,22,15]
...

The optimimum would be

(6/27 out) 1 -> [9,7,25,15,13,22,19]

instead of

(6/27 out) 1 -> [9,7,25,19,13,22,15]

is it expected or should it be always optimum ?

ghost · 2013-12-04T00:01:07Z

src/test/crush/indep.cc

+
+    std::map<int,unsigned> pos;
+    vector<int> prev;
+    for (unsigned i=0; i<weight.size()/2; ++i) {


It passes with weight.size() instead of weight.size()/2 and nicely degrades until there is only one osd rack left with one osd left.

cool, i'll update that test

liewegas · 2013-12-04T00:14:46Z

Some of those shifts are expected. They're not quite optimal, but a natural consequence of the fact that intuitively you have multiple 'actors' (positions in the result) that are pulling choices out of the same bucket (unused osds), and they are choosing in some order. If several ranks groping about for options, and a previous choice is ruled out, a new person groping about may pick something on an early attempt that another guy was previously finding on a later attempt. (This is why we do a breadth-first search; it makes this effect less pronounced because each output rank gets a turn, then they all get a second turn, etc., instead of the first position getting ~100 turns and potentially picking something that a later rank previous got on its first try.)

ghost · 2013-12-04T00:23:16Z

src/test/crush/indep.cc

+      ASSERT_LE(moved, 1);
+      ASSERT_LE(changed, 3);
+
+      // mark another osd out


changing the line below to

weight[(i*2)%weight.size()] = 0;

triggers a failure:

(18/27 out) 3 -> [2147483647,13,19,9,15,25,23] 13 moved from 6 to 1 23 moved from 0 to 6 2 moved, 3 changed test/crush/indep.cc:232: Failure Expected: (moved) <= (1), actual: 2 vs 1
is it to be expected when the missing OSDs are more uniformly distributed among buckets ?

AFAICS this sort of thing is unavoidable; we just try to make it as unlikely as possible. Those asserts are the lowest values I could make the test pass at; they're not strict bounds. If you pull the latest branch and #define DEBUG_INDEP in mapper.c you can see the order that things are chosen and why items are moving around. Note that the 'a' line is the chooseleaf type (the rack) and the 'b' line are the devices nested beneath that appear in the final result.

Again, the intuitive explanation is that the 'indep' mode isn't truly independent, because the same item cannot be chosen twice by two different output positions. So there is always some level of interference, and changing one rank's choice can effect other ranks' past or future choices.

liewegas · 2013-12-04T19:14:27Z

re: indep compatibility, yeah. We could preserve the old indep code, but it would prevent us from cleaning up _firstn, and nobody should have been using it. I'm inclined to kill it and avoid the cruft, but curious what others think!

ghost · 2013-12-05T17:27:57Z

PendingReleaseNotes

+  the rule should be modified to use 'firstn' instead, and the
+  administrator should wait until any data movement completes before
+  upgrading.
+


What is the worst that can happen if someone overlooks this ?

The pool that uses the (old) indep CRUSH rule will fail to peer...

On Thu, 5 Dec 2013, Loic Dachary wrote:

In PendingReleaseNotes:

As part of fix for common: signal_handler: added support for using reentrant strsignal() implementations vs. sys_siglist[] #6796, 'ceph osd pool set ' now receives as an integer instead of a string. This affects how 'hashpspool' flag is set/unset: instead of 'true' or 'false', it now must be '0' or '1'.

+- The behavior of the CRUSH 'indep' choose mode has been changed. No

ceph cluster should have been using this behavior unless someone has

manually extracted a crush map, modified a CRUSH rule to replace

'firstn' with 'indep', recompiled, and reinjected the new map into

the cluster. If the 'indep' mode is currently in use on a cluster,

the rule should be modified to use 'firstn' instead, and the

administrator should wait until any data movement completes before

upgrading.

What is the worst that can happen if someone overlooks this ?

?
Reply to this email directly or view it onGitHub.[E6zJzQGM2pxyQVNszVYQWfR1sZAO2JDA1eXCXjUhutWvmyW3mT07jS6lbd3a0fgl.gi
f]

Fix indentation. Simplify+fix the changed vs moved calculation. Use the new SET_CHOOSE_LEAF_TRIES command. Signed-off-by: Sage Weil <sage@inktank.com>

Parameterize the attempts for the _firstn choose method, and apply the rule-specified tries count to firstn mode as well. Note that we have slightly different behavior here than with indep: If the firstn value is not specified for firstn, we pass through the normal attempt count. This maintains compatibility with legacy behavior. Note that this is usually *not* actually N^2 work, though, because of the descend_once tunable. However, descend_once is unfortunately *not* the same thing as 1 chooseleaf try because it is only checked on a reject but not on a collision. Sigh. In contrast, for indep, if tries is not specified we default to 1 recursive attempt, because that is simply more sane, and we have the option to do so. The descend_once tunable has no effect for indep. Signed-off-by: Sage Weil <sage@inktank.com>

Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Signed-off-by: Sage Weil <sage@inktank.com>

When making a generic indep rule, set the recursive retry to 5. This gives better overall results. Signed-off-by: Sage Weil <sage@inktank.com>

Signed-off-by: Sage Weil <sage@inktank.com>

We need to include the faeture in the mask. Signed-off-by: Sage Weil <sage@inktank.com>

This aligns the internal identifier names with the user-visible names in the decompiled crush map language. Signed-off-by: Sage Weil <sage@inktank.com>

…teps Signed-off-by: Sage Weil <sage@inktank.com>

Signed-off-by: Sage Weil <sage@inktank.com>

...if DEBUG_INDEP is #defined. Signed-off-by: Sage Weil <sage@inktank.com>

athanatos · 2013-12-06T23:13:37Z

This looks right to me.

This will help us catch things if we break the mapping. Signed-off-by: Sage Weil <sage@inktank.com>

ghost · 2013-12-07T00:28:13Z

I would be interested to see how you generated src/test/cli/crushtool/one-hundered-devices-new-tunables.crushmap ( in the commit message maybe ? ).

ghost · 2013-12-07T00:29:13Z

👍 globally

liewegas · 2013-12-07T00:46:37Z

On Fri, 6 Dec 2013, Loic Dachary wrote:

I would be interested to see how you generated
src/test/cli/crushtool/one-hundered-devices-new-tunables.crushmap ( in the
commit message maybe ? ).

I put the first line in the .t file with no output, ran the test, and then
copied the .t.err file over the .t file and committed it.

There is a better way, but it requires remembering how to invoke cram :).

sage

crush changes for erasure coding Reviewed-by: Loic Dachary <loic@dachary.org> Reviewed-by: Samuel Just <sam.just@inktank.com>

Added blogbench.yaml to parallel before and after -x upgrade steps Reviewed-by: Josh Durgin <jdurgin@redhat.com>

ghost reviewed Dec 2, 2013
View reviewed changes

ghost mentioned this pull request Dec 2, 2013

crush unittest: s/number/variable/ #885

Closed

ghost reviewed Dec 2, 2013
View reviewed changes

Sage Weil added 14 commits December 3, 2013 14:41

crush: eliminate CRUSH_MAX_SET result size limitation

1cfe140

This is only present to size the temporary scratch arrays that we put on the stack. Let the caller allocate them as they wish and remove the limitation. Signed-off-by: Sage Weil <sage@inktank.com>

crush: use breadth-first search for indep mode

86e9780

Signed-off-by: Sage Weil <sage@inktank.com>

crush/CrushWrapper: add dump_tree()

7967ae0

This is based on the OSDMap::print_tree(), but is a bit simpler because we do not have up/down information at this level. Signed-off-by: Sage Weil <sage@inktank.com>

crush/mapper: add note about r in recursive choose

4551fee

Signed-off-by: Sage Weil <sage@inktank.com>

crush: add a few unit tests for INDEP mode

0b9bb54

Signed-off-by: Sage Weil <sage@inktank.com>

osd, crush: add 'erasure' pool/pg type

9085286

Signed-off-by: Sage Weil <sage@inktank.com>

osd/OSDMap: do not shift result when removing nonexistent osds

6dc68f4

If it is a replicated pool, remove and shift to the left. For erasure pools, replace nonexistent items with CRUSH_ITEM_NONE. Signed-off-by: Sage Weil <sage@inktank.com>

osd/osd_types: pg_pool_t: fix /// -> ///< comments

d51a219

Signed-off-by: Sage Weil <sage@inktank.com>

crush/mapper: clarify numrep vs endpos

1b567ee

Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Signed-off-by: Sage Weil <sage@inktank.com>

ghost reviewed Dec 4, 2013
View reviewed changes

ghost reviewed Dec 5, 2013
View reviewed changes

Sage Weil added 4 commits December 6, 2013 14:24

crush/mapper: fix up the indep tests

cb88763

Fix indentation. Simplify+fix the changed vs moved calculation. Use the new SET_CHOOSE_LEAF_TRIES command. Signed-off-by: Sage Weil <sage@inktank.com>

crush/mapper: add SET_CHOOSE_TRIES rule step

d1b9746

Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Signed-off-by: Sage Weil <sage@inktank.com>

crush/CrushWrapper: set chooseleaf_tries to 5 for 'simple' indep rules

09ce7a2

When making a generic indep rule, set the recursive retry to 5. This gives better overall results. Signed-off-by: Sage Weil <sage@inktank.com>

Sage Weil added 6 commits December 6, 2013 14:24

crush/CrushCompiler: [de]compile set_choose[leaf]_tries rule step

03911b0

Signed-off-by: Sage Weil <sage@inktank.com>

osd/OSDMap: fix feature calculation for CACHEPOOL

431a13e

We need to include the faeture in the mask. Signed-off-by: Sage Weil <sage@inktank.com>

crush: CHOOSE_LEAF -> CHOOSELEAF throughout

caa0e22

This aligns the internal identifier names with the user-visible names in the decompiled crush map language. Signed-off-by: Sage Weil <sage@inktank.com>

crush: add feature CRUSH_V2 for new indep mode and SET_*_TRIES rule s…

c853019

…teps Signed-off-by: Sage Weil <sage@inktank.com>

PendingReleaseNotes: note change of CRUSH indep mode in release notes

e632a79

Signed-off-by: Sage Weil <sage@inktank.com>

crush/mapper: dump indep partial progression for debugging

384f01d

...if DEBUG_INDEP is #defined. Signed-off-by: Sage Weil <sage@inktank.com>

crush: add indep data set to cli tests

e8fdef2

This will help us catch things if we break the mapping. Signed-off-by: Sage Weil <sage@inktank.com>

liewegas pushed a commit that referenced this pull request Dec 8, 2013

Merge pull request #869 from ceph/wip-crush

94da215

crush changes for erasure coding Reviewed-by: Loic Dachary <loic@dachary.org> Reviewed-by: Samuel Just <sam.just@inktank.com>

liewegas merged commit 94da215 into master Dec 8, 2013

liewegas pushed a commit that referenced this pull request Dec 14, 2016

Merge pull request #869 from ceph/wip-15093-jewel

7e595e1

Added blogbench.yaml to parallel before and after -x upgrade steps Reviewed-by: Josh Durgin <jdurgin@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crush changes for erasure coding #869

crush changes for erasure coding #869

liewegas commented Nov 26, 2013

ghost Dec 2, 2013

ghost Dec 2, 2013

liewegas Dec 3, 2013

ghost commented Dec 3, 2013

ghost commented Dec 3, 2013

ghost Dec 4, 2013

liewegas Dec 4, 2013

liewegas commented Dec 4, 2013

ghost Dec 4, 2013

liewegas Dec 4, 2013

liewegas Dec 4, 2013

liewegas commented Dec 4, 2013

ghost Dec 5, 2013

liewegas Dec 5, 2013

athanatos commented Dec 6, 2013

ghost commented Dec 7, 2013

ghost commented Dec 7, 2013

liewegas commented Dec 7, 2013

crush changes for erasure coding #869

crush changes for erasure coding #869

Conversation

liewegas commented Nov 26, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Dec 3, 2013

ghost commented Dec 3, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Dec 4, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Dec 4, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athanatos commented Dec 6, 2013

ghost commented Dec 7, 2013

ghost commented Dec 7, 2013

liewegas commented Dec 7, 2013