Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crush changes for erasure coding #869

Merged
merged 25 commits into from Dec 8, 2013
Merged

crush changes for erasure coding #869

merged 25 commits into from Dec 8, 2013

Conversation

liewegas
Copy link
Member

changes to crush to fix/improve the indep mode

changes to the mon to fix adjusting the crush rule, and to se the crush
rule explicitly (and ignore the 'ruleset' nonsense)

"name=type,type=CephString,goodchars=[A-Za-z0-9-_.]",
"create crush rule <name> to 'take' from bucket <root> and 'chooseleaf_first' a bucket <type>", \
"name=type,type=CephString,goodchars=[A-Za-z0-9-_.] " \
"name=mode,type=CephChoices,strings=firstn|indep,goodchars=[A-Za-z0-9-_.],req=false",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goodchars is not needed here

@ghost ghost mentioned this pull request Dec 2, 2013

// mark a bunch of osds out
int num = 3*3*3;
for (int i=0; i<num / 2; ++i)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the idea is to make it difficult to find the required items, would it be even better to only leave 9 osds instead of more ? I tried and it does work with ~75 tries. Not sure if that's of any value. What troubles me is that I don't know if there is a way to check under which conditions the results can always be obtained.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's marking out every other osd, so we are sure not to mark out all osds in a given branch and have no possible solution.

@ghost
Copy link

ghost commented Dec 3, 2013

For the sake of backward compatibility, wouldn't it be better to make it so indep does not change behavior ? Another crush_choose_foobar could be added instead.

Sage Weil added 14 commits December 3, 2013 14:41
This is only present to size the temporary scratch arrays that we put on
the stack.  Let the caller allocate them as they wish and remove the
limitation.

Signed-off-by: Sage Weil <sage@inktank.com>
For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller.  For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
This is based on the OSDMap::print_tree(), but is a bit simpler because
we do not have up/down information at this level.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals.  This ends up
stripping out *lots* of code.

Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode.  This appears to
have happened waaaay back in commit dae8bec (or earlier)... 2007.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
If it is a replicated pool, remove and shift to the left.  For erasure
pools, replace nonexistent items with CRUSH_ITEM_NONE.

Signed-off-by: Sage Weil <sage@inktank.com>
Add a mode (firstn or indep) to the create-simple command.  Make it
optional and default to firstn (for compatiblity and simplicity).

Note: a "default=..." option for mon commands would be easier.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Pass numrep (the width of the result) separately from the number of results
we want *this* iteration.  This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.

Signed-off-by: Sage Weil <sage@inktank.com>
Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times.  This avoids doing
useless work (calling multiple times and trying the same values).

Signed-off-by: Sage Weil <sage@inktank.com>
Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).

(We should do the same for the other tunables, by the way!)

Signed-off-by: Sage Weil <sage@inktank.com>
@ghost
Copy link

ghost commented Dec 3, 2013

unittest_crush_indep
...
(0/27 out) 1 -> [9,7,1,26,13,22,5]
(1/27 out) 1 -> [9,7,1,26,13,22,5]
(2/27 out) 1 -> [9,7,2,26,13,22,5]
 0 moved, 1 changed
(3/27 out) 1 -> [9,7,25,15,13,22,5]
 0 moved, 2 changed
(4/27 out) 1 -> [9,7,25,15,13,22,5]
(5/27 out) 1 -> [9,7,25,15,13,22,5]
(6/27 out) 1 -> [9,7,25,19,13,22,15]
 15 moved from 3 to 6
 1 moved, 2 changed
(7/27 out) 1 -> [9,7,25,19,13,22,15]
...

The optimimum would be

(6/27 out) 1 -> [9,7,25,15,13,22,19]

instead of

(6/27 out) 1 -> [9,7,25,19,13,22,15]

is it expected or should it be always optimum ?


std::map<int,unsigned> pos;
vector<int> prev;
for (unsigned i=0; i<weight.size()/2; ++i) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passes with weight.size() instead of weight.size()/2 and nicely degrades until there is only one osd rack left with one osd left.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, i'll update that test

@liewegas
Copy link
Member Author

liewegas commented Dec 4, 2013

Some of those shifts are expected. They're not quite optimal, but a natural consequence of the fact that intuitively you have multiple 'actors' (positions in the result) that are pulling choices out of the same bucket (unused osds), and they are choosing in some order. If several ranks groping about for options, and a previous choice is ruled out, a new person groping about may pick something on an early attempt that another guy was previously finding on a later attempt. (This is why we do a breadth-first search; it makes this effect less pronounced because each output rank gets a turn, then they all get a second turn, etc., instead of the first position getting ~100 turns and potentially picking something that a later rank previous got on its first try.)

ASSERT_LE(moved, 1);
ASSERT_LE(changed, 3);

// mark another osd out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the line below to

      weight[(i*2)%weight.size()] = 0;

triggers a failure:

(18/27 out) 3 -> [2147483647,13,19,9,15,25,23]
 13 moved from 6 to 1
 23 moved from 0 to 6
 2 moved, 3 changed
test/crush/indep.cc:232: Failure
Expected: (moved) <= (1), actual: 2 vs 1
is it to be expected when the missing OSDs are more uniformly distributed among buckets ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this sort of thing is unavoidable; we just try to make it as unlikely as possible. Those asserts are the lowest values I could make the test pass at; they're not strict bounds. If you pull the latest branch and #define DEBUG_INDEP in mapper.c you can see the order that things are chosen and why items are moving around. Note that the 'a' line is the chooseleaf type (the rack) and the 'b' line are the devices nested beneath that appear in the final result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the intuitive explanation is that the 'indep' mode isn't truly independent, because the same item cannot be chosen twice by two different output positions. So there is always some level of interference, and changing one rank's choice can effect other ranks' past or future choices.

@liewegas
Copy link
Member Author

liewegas commented Dec 4, 2013

re: indep compatibility, yeah. We could preserve the old indep code, but it would prevent us from cleaning up _firstn, and nobody should have been using it. I'm inclined to kill it and avoid the cruft, but curious what others think!

the rule should be modified to use 'firstn' instead, and the
administrator should wait until any data movement completes before
upgrading.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the worst that can happen if someone overlooks this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool that uses the (old) indep CRUSH rule will fail to peer...

On Thu, 5 Dec 2013, Loic Dachary wrote:

In PendingReleaseNotes:

+- The behavior of the CRUSH 'indep' choose mode has been changed. No

  • ceph cluster should have been using this behavior unless someone has
  • manually extracted a crush map, modified a CRUSH rule to replace
  • 'firstn' with 'indep', recompiled, and reinjected the new map into
  • the cluster. If the 'indep' mode is currently in use on a cluster,
  • the rule should be modified to use 'firstn' instead, and the
  • administrator should wait until any data movement completes before
  • upgrading.

What is the worst that can happen if someone overlooks this ?

?
Reply to this email directly or view it onGitHub.[E6zJzQGM2pxyQVNszVYQWfR1sZAO2JDA1eXCXjUhutWvmyW3mT07jS6lbd3a0fgl.gi
f]

Sage Weil added 4 commits December 6, 2013 14:24
Fix indentation.
Simplify+fix the changed vs moved calculation.
Use the new SET_CHOOSE_LEAF_TRIES command.

Signed-off-by: Sage Weil <sage@inktank.com>
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well.  Note that we have
slightly different behavior here than with indep:

 If the firstn value is not specified for firstn, we pass through the
 normal attempt count.  This maintains compatibility with legacy behavior.
 Note that this is usually *not* actually N^2 work, though, because of the
 descend_once tunable.  However, descend_once is unfortunately *not* the
 same thing as 1 chooseleaf try because it is only checked on a reject but
 not on a collision.  Sigh.

 In contrast, for indep, if tries is not specified we default to 1
 recursive attempt, because that is simply more sane, and we have the
 option to do so.  The descend_once tunable has no effect for indep.

Signed-off-by: Sage Weil <sage@inktank.com>
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Signed-off-by: Sage Weil <sage@inktank.com>
When making a generic indep rule, set the recursive retry to 5.  This gives
better overall results.

Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil added 6 commits December 6, 2013 14:24
Signed-off-by: Sage Weil <sage@inktank.com>
We need to include the faeture in the mask.

Signed-off-by: Sage Weil <sage@inktank.com>
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Signed-off-by: Sage Weil <sage@inktank.com>
…teps

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
...if DEBUG_INDEP is #defined.

Signed-off-by: Sage Weil <sage@inktank.com>
@athanatos
Copy link
Contributor

This looks right to me.

This will help us catch things if we break the mapping.

Signed-off-by: Sage Weil <sage@inktank.com>
@ghost
Copy link

ghost commented Dec 7, 2013

I would be interested to see how you generated src/test/cli/crushtool/one-hundered-devices-new-tunables.crushmap ( in the commit message maybe ? ).

@ghost
Copy link

ghost commented Dec 7, 2013

👍 globally

@liewegas
Copy link
Member Author

liewegas commented Dec 7, 2013

On Fri, 6 Dec 2013, Loic Dachary wrote:

I would be interested to see how you generated
src/test/cli/crushtool/one-hundered-devices-new-tunables.crushmap ( in the
commit message maybe ? ).

I put the first line in the .t file with no output, ran the test, and then
copied the .t.err file over the .t file and committed it.

There is a better way, but it requires remembering how to invoke cram :).

sage

liewegas pushed a commit that referenced this pull request Dec 8, 2013
crush changes for erasure coding

Reviewed-by: Loic Dachary <loic@dachary.org>
Reviewed-by: Samuel Just <sam.just@inktank.com>
@liewegas liewegas merged commit 94da215 into master Dec 8, 2013
liewegas pushed a commit that referenced this pull request Dec 14, 2016
Added blogbench.yaml to parallel before and after -x upgrade steps

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants