Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crush: add devices class that rules can use as a filter #13444

Merged
18 commits merged into from Mar 8, 2017

Conversation

Projects
None yet
2 participants
@ghost
Copy link

ghost commented Feb 15, 2017

@ghost ghost added core feature labels Feb 15, 2017

@ghost ghost self-assigned this Feb 15, 2017

@@ -108,7 +108,7 @@ struct crush_grammar : public grammar<crush_grammar>
tunable = str_p("tunable") >> name >> posint;

// devices
device = str_p("device") >> posint >> name;
device = str_p("device") >> posint >> name >> *( str_p("class") >> name );

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 15, 2017

Member

i think * should be ! (* = 0 or more, ! = 0 or 1)

This comment has been minimized.

Copy link
@ghost

ghost Feb 15, 2017

Author

done

if (verbose) err << " class " << " '" << c << "'" << std::endl;
} else {
if (verbose) err << std::endl;
}

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 15, 2017

Member

maybe add the decompile case to this patch too, so that a decompile -> compile cycle will not lose informaiton?

This comment has been minimized.

Copy link
@ghost

ghost Feb 15, 2017

Author

The compilation phase still needs to duplicate the crush trees that the de-compilation will need to guess device classes are in use. It is my understanding that the encoded crush map does not know anything about device classes. It is inferred from the naming of the buckets that contains a special character. Am I going in the wrong direction ?

*
* @returns the size of __buckets__ on success, < 0 on error
*/
extern int crush_find_roots(struct crush_map *map, int **buckets);

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 15, 2017

Member

void find_roots(set& roots) const;
already exists in CrushWrapper

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 15, 2017

Member

if the intention is to have a C implementation, we should consolidate so that that one calls this one.

This comment has been minimized.

Copy link
@ghost

ghost Feb 15, 2017

Author

I had no intention, I just did not look... thanks :-)

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 15, 2017

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 15, 2017

I'm unclear how we can add to the existing crushmap encoding. Is there versioning that I don't see ? Or is it via feature bits ?

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 15, 2017

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 15, 2017

and what happens when the new code reads an old crushmap ?

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 15, 2017

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 15, 2017

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 15, 2017

got it, the code speaks for itself indeed. I got confused but now it's clear :-)

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 15, 2017

Should we create a tree for each root and each device class ? Or is it enough to create a tree for each bucket that is referenced by a TAKE rule step ?

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 15, 2017

updated crush: parse "class XXX" after device to compile/decompile as suggested

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 16, 2017

@liewegas 988a8ab is a rough draft implementation. It would be great if you could take a quick look and see if something is obviously going in the wrong direction. Here is what it does:

  • on finalize() create a new bucket tree for each root and each device class, with buckets named like name~deviceclass
  • the bucket tree cloning function mirrors the hierarchy and only keeps devices of the specified class
  • after the bucket tree have been cloned per device class, the rules that have a take operator referencing a device class are modified to reference the new bucket tree (this is a little trickier than it should be since the ruleno is exposed to the user and the modified rule must take the place of the original rule)
  • finalize() must be called prior to encode to ensure the generated buckets reflect the last modifications (this is for OSDMonitor that copies crush maps with encode/decode, other callers know to call finalize after modifying the crush map)

To ensure the crush map consistency, finalize() removes all generated rules / buckets before creating them.

err << " class " << class_name << std::endl;
}

crush.set_rule_step_take(ruleno, step++, item_id[item], c);

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 16, 2017

Member

I think ehre we actaully want to map item to the parallel per-class tree.

We could also pass c into the rule step since it's ignored, but i'd avoid doing that until we have a reason to use it within the rule processing itself..

// Use ri instead of i here: we need the ruleno of the generated
// rule because it is exposed to the user. The original rule is
// given another ruleno, never exposed to the user.
out << "\truleset " << crush.get_rule_mask_ruleset(ri) << "\n";

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 16, 2017

Member

do we actually need to generate new rules? i think the original rule can simply use the generated item for the take call, and when decompiled, the fact that the item is generated will tell us to add the 'class xxx' argument to the take step.

This comment has been minimized.

Copy link
@ghost

ghost Feb 16, 2017

Author

good idea, I'll try that, thanks !

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 18, 2017

@liewegas I'm happy about this series but it's still fresh and I may have overlooked a few things. Hopefully not something big but I'm not yet 100% sure.

I think the goal for this pull request should be to implement and document the "class XXX" for devices & step take only. At this point the users would have no control and would not even be able to see the modified rule step or the generated buckets. But that new feature will save them from the pain of maintaining two parallel trees. Displaying meaningful information about the generated trees via ceph osd crush tree etc. could be implemented afterwards.

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 20, 2017

Somewhere we also need a per-bucket map of class -> id so that, given a bucket x, we can find the derivative for a given class. That's needed so that when we decompile we can supplement the 'id NNN' line with 'class-id ' (or similar) so that the derivative bucket ids are preserved across decompile->compile.

If that's in place, I think some of the other parts would be simpler:

  • after compiling the buckets, we can generate per-class buckets for all roots and all classes, fully populating the per-bucket class->id map.
  • the rule compile can then just do that translation as it goes, without the map of ruleno -> step -> class

?

original->hash,
original->type,
0, NULL, NULL);
if(copy == NULL)

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 20, 2017

Member

whitespace

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 20, 2017

the derivative bucket ids are preserved across decompile->compile

Since the derivative bucket id is never exposed, why would we need to preserve it ?

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 21, 2017

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 21, 2017

@liewegas I overlooked that, thanks for explaining.

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 21, 2017

@liewegas this is hopefully fixed and it indeed makes things simpler in some places (see c2ab0c9 for the draft implementation, passes manual tests). The syntax is a little different from what you proposed originally to be consistent with the rest:

host host1 {
	id -2		# do not change unnecessarily
	id -7 class ssd		# do not change unnecessarily
	id -12 class hdd		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item device1 weight 1.000
}

It would be great if you could review this draft before I clean it up, in case something significant is missing :-)

@liewegas
Copy link
Member

liewegas left a comment

This looks right the right appraoch to me!

@@ -603,6 +647,9 @@ int CrushCompiler::parse_bucket(iter_t const& i)
//err << "assigned id " << id << std::endl;
}

for (auto &i : class_id)
crush.class_bucket[id][i.first] = i.second;

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 21, 2017

Member

crush.class_bucket[id].swap(class_id);

This comment has been minimized.

Copy link
@ghost

ghost Feb 21, 2017

Author

cool

This comment has been minimized.

Copy link
@ghost

ghost Feb 27, 2017

Author

we don't want to replace all of crush.class_bucket[id], just add to it.

if (crush.class_bucket.count(id) == 0) {
err << "in rule '" << rname << "' step take " << item
<< " has no class information" << std::endl;
return -EINVAL;

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 21, 2017

Member

this is a bug, right? we should have ids for all classes and all buckets at this stage?

This comment has been minimized.

Copy link
@ghost

ghost Feb 21, 2017

Author

yes, it could be an assert

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 21, 2017

@liewegas I thought about how things will be with this implementation from the user perspective (human & not human). Since it's not transparent, ceph osd crush tree will show everything, including the generated buckets & modified rules. That does not strike me as a problem. However the user may be in a position to do a number of things that lead to inconsistency using the ceph osd crush... commands.

For the osd & ceph clients who are not aware of the device classes, all should be ok since the crush map is encoded with all the generated buckets & substituted rule steps. Decoding does not expect anything new.

Does that seem right ?

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 27, 2017

jenkins test this please (log expired)

@ghost ghost added the wip-loic-testing label Feb 27, 2017

@ghost ghost changed the title WIP: crush device classes crush device classes Feb 27, 2017

int c;
int res = crush.split_id_class(step_item, &original_item, &c);
if (res < 0)
return res;

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 27, 2017

Member

if res < 0, doesn't that mean we didn't split, and this is a normal bucket with no class specified?

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 27, 2017

Member

ah, res < 0 is a real error, nevermind.

@@ -668,6 +671,8 @@ int CrushCompiler::parse_bucket(iter_t const& i)

int CrushCompiler::parse_rule(iter_t const& i)
{
crush.populate_classes();

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 27, 2017

Member

maybe it's better to instead put this in parse_crush, guarded by a bool did_first_rule; or similar, so it's only done once before the first rule (and at the end if there were no rules).

This comment has been minimized.

Copy link
@ghost

ghost Feb 28, 2017

Author

although uncommon the buckets may be defined in between rules

      // the whole crush map
      crushmap = *(tunable | device | bucket_type) >> *(bucket | crushrule);
@@ -4551,6 +4551,7 @@ int OSDMonitor::crush_rename_bucket(const string& srcname,
return ret;

pending_inc.crush.clear();
newcrush.cleanup_classes();
newcrush.encode(pending_inc.crush, mon->get_quorum_con_features());

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 27, 2017

Member

These seem very fragile because the map is unusable after this. (Same with decompile_crush(), although there it's rebuilt so just wasted work). Perhaps instead it's better to make encode smart enough to skip all the extra buckets?

This comment has been minimized.

Copy link
@ghost

ghost Feb 28, 2017

Author

Contrary to the first draft implementation I did a week ago, cleanup_classes only removes the buckets that are completely unreferenced and it does not remove the generated buckets that are created by populate_classes and referenced by a rule. I realized that doing so would indeed lead to a non usable encoded crushmap which would not only be fragile but also impose that all client also implement the device class feature.

The cleanup_classes here removes any left overs but the encoded crushmap is still fully functional and compatible with pre-device-classes clients.

It still is fragile because someone could forget to cleanup_classes before calling encode. But the worst that happens in this case is that the crushmap is bloated with unused buckets. And they are likely to be trimmed the next time cleanup_classes is called elsewhere.

My first impulse was to add a call to cleanup_classes in encode. But the function is const and it did not seem right to make it non const for that purpose.

That being said I'm moderately happy with this solution even though it does not seem to be a serious threat and I'd welcome a suggestion to make it better.

This comment has been minimized.

Copy link
@ghost

ghost Feb 28, 2017

Author

Hum. The

      cleanup_classes();
      populate_classes();

in CrushWrapper::decode should actually be

      cleanup_classes();

and we can get rid of mots the other calls in OSDMonitor.

case a) A non device classes aware client gets a crushmap from a device class aware mon, modify it, encode it, send it back to a device class aware mon. OK

case b) A non device class aware client gets a crushmap from a device aware mon including generated buckets, and calls a non device class aware crushtool to decompile it. It will fail to recompile because of the special bucket names. A better error message suggesting an upgrade should be displayed in this case.

case C) A device class aware client sends a crushmap to a non device class aware mon (maybe because it is the leader of a cluster where an upgrade is in progress) the mon will be able to use the crushmap exactly as a device class aware mon would. Only it will never trim unused buckets with special names.

The crush.populate_classes(); in CrushCompiler::decompile should be removed.
The crush.populate_classes(); in CrushCompiler::parse_crush should really be cleanup_classes since populate_classes() is called before parsing a rule.
All cleanup_classes from OSDMonitor should be removed since it is called by decode anyway.
The cleanup_classes from crushtool.cc should be removed since it is is called from CrushCompiler::parse_crush and from CrushCompiler::decompile

I'll do these changes if they makes sense to you. Or if it's all too confusing and implementing them will help clarify ;-)

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 28, 2017

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 28, 2017

Perhaps we can limit the callers to just finalize(), decode(), and compile() then? Would that work?

I'll try that, I think it's the way to go.

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 28, 2017

jenkins test this please (ceph_objectstore_tool.py failed)

@ghost

This comment has been minimized.

Copy link
Author

ghost commented Feb 28, 2017

I reworked the "crush: update device classes where relevant" commit and it's now much simpler. Also updated the comment of this commit to explain how it works and why it is compatible with existing clients in detail.

ldachary added some commits Feb 15, 2017

crush: parse "class XXX" after device
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: item_exists is const
Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: _bucket_is_in_use does not use cct
Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: emacs comment is obsolete
Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: parse "class XXX" after step take
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: implement remove_unused_root
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: implement device_class_clone
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: get_item_weight must not fail on root
Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: implement split_id_class
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: parse "class XXX" after bucket id
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: implement trim_roots_with_class
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: implement {populate,cleanup}_classes
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: a bucket is also busy if a derived bucket uses it
The step take of a rule may reference a bucket that is cloned
from the original one, with the device class added. In this
case both buckets are busy because they are taken by the rule.

Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: update device classes where relevant
The device classes are implemented by modifying:

- the argument of step TAKE in rules
- cloning bucket trees when required by a rule step

This happens (via populate_classes):

- before compiling a rule step TAKE

When the crush map is encoded, the device class information is stored
with it, independently from the rules and the buckets, as a map of
classes for each device & bucket and a map of classes for each rule step
TAKE.

The extra buckets created but not used by any rule do not need to be
preserved and they are removed (via cleanup_classes):

- before decompilation
- after compilation
- after decoding

The client and daemons that are not aware of the device classes are
compatible because the crushmap modified with the new buckets is fully
functional. The invalid names used in the for the generated
buckets (bucket~class) can be CrushWrapper::decode by any existing
client because there is no verification of the name validity during
decoding. It can also be CrushWrapper::dump or CrushCompiler::decompile
via ceph osd dump or crushtool. It cannot, however, be compiled again
because CrushCompiler::compile will try to set the name with
CrushWrapper::set_item_name and it will fail with EINVAL because of the
~.

Fixes: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: do not erase non existent name
Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: clear bucket class ids when removed
Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: document the class qualifier
The class qualifier is documented for devices and step take
only. Although it shows in the buckets and could be set by the user, it
would be very error prone to do so.

Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
crush: device classes mini cluster sanity check
Run a mini cluster and verify that modifying the crushmap,
sending it to the cluster produces the intended effect.

Refs: http://tracker.ceph.com/issues/18943

Signed-off-by: Loic Dachary <ldachary@redhat.com>
@ghost

This comment has been minimized.

Copy link
Author

ghost commented Mar 2, 2017

teuthology-suite -k distro --priority 1000 --suite rados --subset $(expr $RANDOM % 200)/200 --email loic@dachary.org --ceph wip-18943-device-class --machine-type smithi

Re-running failed jobs

Unrelated failures

@ghost ghost merged commit 7590322 into ceph:master Mar 8, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Mar 8, 2017

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.