Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolve and document most common erasure coded pool pain points #3194

Merged
merged 5 commits into from Jan 18, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
195 changes: 195 additions & 0 deletions doc/rados/troubleshooting/troubleshooting-pg.rst
Expand Up @@ -359,6 +359,201 @@ monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph
`Clock Settings`_ for additional details.


Erasure Coded PGs are not active+clean
======================================

When CRUSH fails to find enough OSDs to map to a PG, it will show as a
``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance::

[2,1,6,0,5,8,2147483647,7,4]

Not enough OSDs
---------------

If the Ceph cluster only has 8 OSDs and the erasure coded pool needs
9, that is what it will show. You can either create another erasure
coded pool that requires less OSDs::

ceph osd erasure-code-profile set myprofile k=5 m=3
ceph osd pool create erasurepool 16 16 erasure myprofile

or add a new OSDs and the PG will automatically use them.

CRUSH constraints cannot be satisfied
-------------------------------------

If the cluster has enough OSDs, it is possible that the CRUSH ruleset
imposes constraints that cannot be satisfied. If there are 10 OSDs on
two hosts and the CRUSH rulesets require that no two OSDs from the
same host are used in the same PG, the mapping may fail because only
two OSD will be found. You can check the constraint by displaying the
ruleset::

$ ceph osd crush rule ls
[
"replicated_ruleset",
"erasurepool"]
$ ceph osd crush rule dump erasurepool
{ "rule_id": 1,
"rule_name": "erasurepool",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 20,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_indep",
"num": 0,
"type": "host"},
{ "op": "emit"}]}


You can resolve the problem by creating a new pool in which PGs are allowed
to have OSDs residing on the same host with::

ceph osd erasure-code-profile set myprofile ruleset-failure-domain=osd
ceph osd pool create erasurepool 16 16 erasure myprofile

CRUSH gives up too soon
-----------------------

If the Ceph cluster has just enough OSDs to map the PG (for instance a
cluster with a total of 9 OSDs and an erasure coded pool that requires
9 OSDs per PG), it is possible that CRUSH gives up before finding a
mapping. It can be resolved by:

* lowering the erasure coded pool requirements to use less OSDs per PG
(that requires the creation of another pool as erasure code profiles
cannot be dynamically modified).

* adding more OSDs to the cluster (that does not require the erasure
coded pool to be modified, it will become clean automatically)

* use a hand made CRUSH ruleset that tries more times to find a good
mapping. (that requires the creation of another pool as erasure
code profiles cannot be dynamically modified). It can be done by
setting ``set_choose_tries`` to a value greater than the default.

You should first verify the problem with ``crushtool`` after
extracting the crushmap from the cluster so your experiments do not
modify the Ceph cluster and only work on a local files::

$ ceph osd crush rule dump erasurepool
{ "rule_name": "erasurepool",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 20,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_indep",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
$ ceph osd getcrushmap > crush.map
got crush map from osdmap epoch 13
$ crushtool -i crush.map --test --show-bad-mappings \
--rule 1 \
--num-rep 9 \
--min-x 1 --max-x $((1024 * 1024))
bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]

Where ``--num-rep`` is the number of OSDs the erasure code crush
ruleset needs, ``--rule`` is the value of the ``ruleset`` field
displayed by ``ceph osd crush rule dump``. The test will try mapping
one million values (i.e. the range defined by ``[--min-x,--max-x]``)
and must display at least one bad mapping. If it outputs nothing it
means all mappings are successfull and you can stop right there: the
problem is elsewhere.

The crush ruleset can be edited by decompiling the crush map::

$ crushtool --decompile crush.map > crush.txt

and adding the following line to the ruleset::

step set_choose_tries 100

The relevant part of of the ``crush.txt`` file should look something
like::

rule erasurepool {
ruleset 1
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}

It can then be compiled and tested again::

$ crushtool --compile crush.txt -o better-crush.map

When all mappings succeed, an histogram of the number of tries that
were necessary to find all of them can be displayed with the
``--show-choose-tries`` option of ``crushtool``::

$ crushtool -i better-crush.map --test --show-bad-mappings \
--show-choose-tries \
--rule 1 \
--num-rep 9 \
--min-x 1 --max-x $((1024 * 1024))
...
11: 42
12: 44
13: 54
14: 45
15: 35
16: 34
17: 30
18: 25
19: 19
20: 22
21: 20
22: 17
23: 13
24: 16
25: 13
26: 11
27: 11
28: 13
29: 11
30: 10
31: 6
32: 5
33: 10
34: 3
35: 7
36: 5
37: 2
38: 5
39: 5
40: 2
41: 5
42: 4
43: 1
44: 2
45: 2
46: 3
47: 1
48: 0
...
102: 0
103: 1
104: 0
...

It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped).

.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
.. _here: ../../configuration/pool-pg-config-ref
Expand Down
6 changes: 4 additions & 2 deletions src/crush/CrushWrapper.cc
Expand Up @@ -923,15 +923,17 @@ int CrushWrapper::add_simple_ruleset(string name, string root_name,
}
int steps = 3;
if (mode == "indep")
steps = 4;
steps = 5;
int min_rep = mode == "firstn" ? 1 : 3;
int max_rep = mode == "firstn" ? 10 : 20;
//set the ruleset the same as rule_id(rno)
crush_rule *rule = crush_make_rule(steps, rno, rule_type, min_rep, max_rep);
assert(rule);
int step = 0;
if (mode == "indep")
if (mode == "indep") {
crush_rule_set_step(rule, step++, CRUSH_RULE_SET_CHOOSELEAF_TRIES, 5, 0);
crush_rule_set_step(rule, step++, CRUSH_RULE_SET_CHOOSE_TRIES, 100, 0);
}
crush_rule_set_step(rule, step++, CRUSH_RULE_TAKE, root, 0);
if (type)
crush_rule_set_step(rule, step++,
Expand Down
5 changes: 5 additions & 0 deletions src/crush/CrushWrapper.h
Expand Up @@ -709,6 +709,11 @@ class CrushWrapper {
ruleno = crush_add_rule(crush, n, ruleno);
return ruleno;
}
int set_rule_mask_max_size(unsigned ruleno, int max_size) {
crush_rule *r = get_rule(ruleno);
if (IS_ERR(r)) return -1;
return r->mask.max_size = max_size;
}
int set_rule_step(unsigned ruleno, unsigned step, int op, int arg1, int arg2) {
if (!crush) return -ENOENT;
crush_rule *n = get_rule(ruleno);
Expand Down
3 changes: 3 additions & 0 deletions src/crush/mapper.c
Expand Up @@ -644,6 +644,9 @@ static void crush_choose_indep(const struct crush_map *map,
out2[rep] = CRUSH_ITEM_NONE;
}
}
if (map->choose_tries && ftotal <= map->choose_total_tries)
map->choose_tries[ftotal]++;

#ifdef DEBUG_INDEP
if (out2) {
printf("%u %d a: ", ftotal, left);
Expand Down
4 changes: 3 additions & 1 deletion src/erasure-code/isa/ErasureCodeIsa.cc
Expand Up @@ -55,8 +55,10 @@ ErasureCodeIsa::create_ruleset(const string &name,

if (ruleid < 0)
return ruleid;
else
else {
crush.set_rule_mask_max_size(ruleid, get_chunk_count());
return crush.get_rule_mask_ruleset(ruleid);
}
}

// -----------------------------------------------------------------------------
Expand Down
4 changes: 3 additions & 1 deletion src/erasure-code/jerasure/ErasureCodeJerasure.cc
Expand Up @@ -46,8 +46,10 @@ int ErasureCodeJerasure::create_ruleset(const string &name,
"indep", pg_pool_t::TYPE_ERASURE, ss);
if (ruleid < 0)
return ruleid;
else
else {
crush.set_rule_mask_max_size(ruleid, get_chunk_count());
return crush.get_rule_mask_ruleset(ruleid);
}
}

void ErasureCodeJerasure::init(const map<string,string> &parameters)
Expand Down
6 changes: 4 additions & 2 deletions src/erasure-code/lrc/ErasureCodeLrc.cc
Expand Up @@ -62,9 +62,9 @@ int ErasureCodeLrc::create_ruleset(const string &name,
}
ruleset = rno;

int steps = 3 + ruleset_steps.size();
int steps = 4 + ruleset_steps.size();
int min_rep = 3;
int max_rep = 30;
int max_rep = get_chunk_count();
int ret;
ret = crush.add_rule(steps, ruleset, pg_pool_t::TYPE_ERASURE,
min_rep, max_rep, rno);
Expand All @@ -73,6 +73,8 @@ int ErasureCodeLrc::create_ruleset(const string &name,

ret = crush.set_rule_step(rno, step++, CRUSH_RULE_SET_CHOOSELEAF_TRIES, 5, 0);
assert(ret == 0);
ret = crush.set_rule_step(rno, step++, CRUSH_RULE_SET_CHOOSE_TRIES, 100, 0);
assert(ret == 0);
ret = crush.set_rule_step(rno, step++, CRUSH_RULE_TAKE, root, 0);
assert(ret == 0);
// [ [ "choose", "rack", 2 ],
Expand Down
109 changes: 109 additions & 0 deletions src/test/cli/crushtool/show-choose-tries.t
@@ -0,0 +1,109 @@
$ crushtool -c "$TESTDIR/show-choose-tries.txt" -o "$TESTDIR/show-choose-tries.crushmap"
$ FIRSTN_RULESET=0
$ crushtool -i "$TESTDIR/show-choose-tries.crushmap" --test --show-choose-tries --rule $FIRSTN_RULESET --x 1 --num-rep 2
0: 1
1: 1
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0
10: 0
11: 0
12: 0
13: 0
14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 0
28: 0
29: 0
30: 0
31: 0
32: 0
33: 0
34: 0
35: 0
36: 0
37: 0
38: 0
39: 0
40: 0
41: 0
42: 0
43: 0
44: 0
45: 0
46: 0
47: 0
48: 0
49: 0
$ INDEP_RULESET=1
$ crushtool -i "$TESTDIR/show-choose-tries.crushmap" --test --show-choose-tries --rule $INDEP_RULESET --x 1 --num-rep 1
0: 0
1: 1
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0
10: 0
11: 0
12: 0
13: 0
14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 0
28: 0
29: 0
30: 0
31: 0
32: 0
33: 0
34: 0
35: 0
36: 0
37: 0
38: 0
39: 0
40: 0
41: 0
42: 0
43: 0
44: 0
45: 0
46: 0
47: 0
48: 0
49: 0
$ rm -f "$TESTDIR/show-choose-tries.crushmap"
# Local Variables:
# compile-command: "cd ../../.. ; make -j4 crushtool && test/run-cli-tests"
# End: