New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crush changes for erasure coding #869
Conversation
"name=type,type=CephString,goodchars=[A-Za-z0-9-_.]", | ||
"create crush rule <name> to 'take' from bucket <root> and 'chooseleaf_first' a bucket <type>", \ | ||
"name=type,type=CephString,goodchars=[A-Za-z0-9-_.] " \ | ||
"name=mode,type=CephChoices,strings=firstn|indep,goodchars=[A-Za-z0-9-_.],req=false", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goodchars is not needed here
|
||
// mark a bunch of osds out | ||
int num = 3*3*3; | ||
for (int i=0; i<num / 2; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the idea is to make it difficult to find the required items, would it be even better to only leave 9 osds instead of more ? I tried and it does work with ~75 tries. Not sure if that's of any value. What troubles me is that I don't know if there is a way to check under which conditions the results can always be obtained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's marking out every other osd, so we are sure not to mark out all osds in a given branch and have no possible solution.
For the sake of backward compatibility, wouldn't it be better to make it so indep does not change behavior ? Another crush_choose_foobar could be added instead. |
This is only present to size the temporary scratch arrays that we put on the stack. Let the caller allocate them as they wish and remove the limitation. Signed-off-by: Sage Weil <sage@inktank.com>
For firstn mode, if we fail to make a valid placement choice, we just continue and return a short result to the caller. For indep mode, however, we need to make the position stable, and return an undefined value on failed placements to avoid shifting later results to the left. Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
This is based on the OSDMap::print_tree(), but is a bit simpler because we do not have up/down information at this level. Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Now that indep is handled by crush_choose_indep, rename crush_choose to crush_choose_firstn and remove all the conditionals. This ends up stripping out *lots* of code. Note that it *also* makes it obvious that the shenanigans we were playing with r' for uniform buckets were broken for firstn mode. This appears to have happened waaaay back in commit dae8bec (or earlier)... 2007. Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
If it is a replicated pool, remove and shift to the left. For erasure pools, replace nonexistent items with CRUSH_ITEM_NONE. Signed-off-by: Sage Weil <sage@inktank.com>
Add a mode (firstn or indep) to the create-simple command. Make it optional and default to firstn (for compatiblity and simplicity). Note: a "default=..." option for mon commands would be easier. Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Signed-off-by: Sage Weil <sage@inktank.com>
Pass down the parent's 'r' value so that we will sample different values in the recursive call when the parent tries multiple times. This avoids doing useless work (calling multiple times and trying the same values). Signed-off-by: Sage Weil <sage@inktank.com>
Explicitly control the number of sample attempts, and allow the number of tries in the recursive call to be explicitly controlled via the rule. This is important because the amount of time we want to spend looking for a solution may be rule dependent (e.g., higher for the wide indep pool than the rep pools). (We should do the same for the other tunables, by the way!) Signed-off-by: Sage Weil <sage@inktank.com>
unittest_crush_indep ... (0/27 out) 1 -> [9,7,1,26,13,22,5] (1/27 out) 1 -> [9,7,1,26,13,22,5] (2/27 out) 1 -> [9,7,2,26,13,22,5] 0 moved, 1 changed (3/27 out) 1 -> [9,7,25,15,13,22,5] 0 moved, 2 changed (4/27 out) 1 -> [9,7,25,15,13,22,5] (5/27 out) 1 -> [9,7,25,15,13,22,5] (6/27 out) 1 -> [9,7,25,19,13,22,15] 15 moved from 3 to 6 1 moved, 2 changed (7/27 out) 1 -> [9,7,25,19,13,22,15] ... The optimimum would be (6/27 out) 1 -> [9,7,25,15,13,22,19] instead of (6/27 out) 1 -> [9,7,25,19,13,22,15] is it expected or should it be always optimum ? |
|
||
std::map<int,unsigned> pos; | ||
vector<int> prev; | ||
for (unsigned i=0; i<weight.size()/2; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It passes with weight.size() instead of weight.size()/2 and nicely degrades until there is only one osd rack left with one osd left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, i'll update that test
Some of those shifts are expected. They're not quite optimal, but a natural consequence of the fact that intuitively you have multiple 'actors' (positions in the result) that are pulling choices out of the same bucket (unused osds), and they are choosing in some order. If several ranks groping about for options, and a previous choice is ruled out, a new person groping about may pick something on an early attempt that another guy was previously finding on a later attempt. (This is why we do a breadth-first search; it makes this effect less pronounced because each output rank gets a turn, then they all get a second turn, etc., instead of the first position getting ~100 turns and potentially picking something that a later rank previous got on its first try.) |
ASSERT_LE(moved, 1); | ||
ASSERT_LE(changed, 3); | ||
|
||
// mark another osd out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the line below to
weight[(i*2)%weight.size()] = 0;
triggers a failure:
(18/27 out) 3 -> [2147483647,13,19,9,15,25,23] 13 moved from 6 to 1 23 moved from 0 to 6 2 moved, 3 changed test/crush/indep.cc:232: Failure Expected: (moved) <= (1), actual: 2 vs 1is it to be expected when the missing OSDs are more uniformly distributed among buckets ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS this sort of thing is unavoidable; we just try to make it as unlikely as possible. Those asserts are the lowest values I could make the test pass at; they're not strict bounds. If you pull the latest branch and #define DEBUG_INDEP in mapper.c you can see the order that things are chosen and why items are moving around. Note that the 'a' line is the chooseleaf type (the rack) and the 'b' line are the devices nested beneath that appear in the final result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the intuitive explanation is that the 'indep' mode isn't truly independent, because the same item cannot be chosen twice by two different output positions. So there is always some level of interference, and changing one rank's choice can effect other ranks' past or future choices.
re: indep compatibility, yeah. We could preserve the old indep code, but it would prevent us from cleaning up _firstn, and nobody should have been using it. I'm inclined to kill it and avoid the cruft, but curious what others think! |
the rule should be modified to use 'firstn' instead, and the | ||
administrator should wait until any data movement completes before | ||
upgrading. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the worst that can happen if someone overlooks this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool that uses the (old) indep CRUSH rule will fail to peer...
On Thu, 5 Dec 2013, Loic Dachary wrote:
In PendingReleaseNotes:
- As part of fix for common: signal_handler: added support for using reentrant strsignal() implementations vs. sys_siglist[] #6796, 'ceph osd pool set ' now
receives as an integer instead of a string. This affects how
'hashpspool' flag is set/unset: instead of 'true' or 'false', it now
must be '0' or '1'.+- The behavior of the CRUSH 'indep' choose mode has been changed. No
- ceph cluster should have been using this behavior unless someone has
- manually extracted a crush map, modified a CRUSH rule to replace
- 'firstn' with 'indep', recompiled, and reinjected the new map into
- the cluster. If the 'indep' mode is currently in use on a cluster,
- the rule should be modified to use 'firstn' instead, and the
- administrator should wait until any data movement completes before
- upgrading.
What is the worst that can happen if someone overlooks this ?
?
Reply to this email directly or view it onGitHub.[E6zJzQGM2pxyQVNszVYQWfR1sZAO2JDA1eXCXjUhutWvmyW3mT07jS6lbd3a0fgl.gi
f]
Fix indentation. Simplify+fix the changed vs moved calculation. Use the new SET_CHOOSE_LEAF_TRIES command. Signed-off-by: Sage Weil <sage@inktank.com>
Parameterize the attempts for the _firstn choose method, and apply the rule-specified tries count to firstn mode as well. Note that we have slightly different behavior here than with indep: If the firstn value is not specified for firstn, we pass through the normal attempt count. This maintains compatibility with legacy behavior. Note that this is usually *not* actually N^2 work, though, because of the descend_once tunable. However, descend_once is unfortunately *not* the same thing as 1 chooseleaf try because it is only checked on a reject but not on a collision. Sigh. In contrast, for indep, if tries is not specified we default to 1 recursive attempt, because that is simply more sane, and we have the option to do so. The descend_once tunable has no effect for indep. Signed-off-by: Sage Weil <sage@inktank.com>
Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Signed-off-by: Sage Weil <sage@inktank.com>
When making a generic indep rule, set the recursive retry to 5. This gives better overall results. Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
We need to include the faeture in the mask. Signed-off-by: Sage Weil <sage@inktank.com>
This aligns the internal identifier names with the user-visible names in the decompiled crush map language. Signed-off-by: Sage Weil <sage@inktank.com>
…teps Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
...if DEBUG_INDEP is #defined. Signed-off-by: Sage Weil <sage@inktank.com>
This looks right to me. |
This will help us catch things if we break the mapping. Signed-off-by: Sage Weil <sage@inktank.com>
I would be interested to see how you generated src/test/cli/crushtool/one-hundered-devices-new-tunables.crushmap ( in the commit message maybe ? ). |
👍 globally |
On Fri, 6 Dec 2013, Loic Dachary wrote:
I put the first line in the .t file with no output, ran the test, and then There is a better way, but it requires remembering how to invoke cram :). sage |
crush changes for erasure coding Reviewed-by: Loic Dachary <loic@dachary.org> Reviewed-by: Samuel Just <sam.just@inktank.com>
Added blogbench.yaml to parallel before and after -x upgrade steps Reviewed-by: Josh Durgin <jdurgin@redhat.com>
changes to crush to fix/improve the indep mode
changes to the mon to fix adjusting the crush rule, and to se the crush
rule explicitly (and ignore the 'ruleset' nonsense)