Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: Mantle: A Programmable Metadata Load Balancer #10887

Merged
merged 3 commits into from Oct 26, 2016
Merged

Conversation

@michaelsevilla
Copy link
Contributor

@michaelsevilla michaelsevilla commented Aug 26, 2016

@smithfarm smithfarm added the feature label Aug 28, 2016
@smithfarm smithfarm changed the title Mantle: A Programmable Metadata Load Balancer fs: Mantle: A Programmable Metadata Load Balancer Aug 28, 2016

#define dout_subsys ceph_subsys_mds
#define dout_subsys ceph_subsys_mds_balancer

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

I'm not sure we want this: iirc the --debug-mds flag isn't a prefixish thing, so this change will silence all the balancer and migrator logs for people who currently set --debug-mds

@@ -330,7 +330,7 @@ COMMAND("mds set_max_mds " \
"set max MDS index", "mds", "rw", "cli,rest")
COMMAND("mds set " \
"name=var,type=CephChoices,strings=max_mds|max_file_size"
"|allow_new_snaps|inline_data|allow_multimds|allow_dirfrags " \
"|allow_new_snaps|inline_data|allow_multimds|allow_dirfrags|balancer " \

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Please add this to "fs set" instead of "mds set" -- mds set is the legacy mode that operates only on the default filesystem, fs set is the new hotness which lets you work on any filesystem

}

/* execute the balancer */
Mantle *mantle = new Mantle();

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Might as well just do a "Mantle mantle" and avoid the pointer/deletion?

std::pair < map<mds_rank_t, mds_load_t>::iterator, bool > r(mds_load.insert(val));
mds_load_t &load(r.first->second);

metrics[i].insert(make_pair("auth.meta_load", load.auth.meta_load()));

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Might be tidier to use a C++11 map literal here if possible

int MDBalancer::localize_balancer(string const balancer)
{
int64_t pool_id = mds->mdsmap->get_metadata_pool();
string fname = "/tmp/" + balancer;

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Does it have to go into a /tmp/ file? I would expect these scripts are usually small enough to painlessly hold in memory after we've read it from rados

object_locator_t oloc(pool_id);
bufferlist data;
C_SaferCond waiter;
mds->objecter->read(oid, oloc, 0, 0, CEPH_NOSNAP, &data, 0, &waiter);

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Ouch, this is a blocking OSD read that's done from inside the mds_lock when this gets called from mantle_prep_rebalance. The MDS will die horribly if the OSDs are not responsive at this moment.

I think it would be better to kick off the read when handling the mdsmap (if the balancer has changed), and populate the script in the background when the read completes. If mantle_prep_rebalance runs before that read has completed it can either no-op or fall back to the native balancer.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Sep 9, 2016
Member

I don't think we can do daemon-local fallbacks like that. The default balancer assumes everybody else will come to basically the same conclusions as the local instance (about which daemons are paired for export/import, etc).
I haven't looked at the code but I presume we'll need to error out from rebalancing and put big fat warnings in the clog/Beacon if anything goes wrong.


int r = mantle_prep_rebalance();
if (!r) {
mds->clog->info() << "mantle succeeded; "

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Maybe the success message could just go to the MDS log instead of the cluster log? This will get spammy for anyone who likes to "ceph -w"

@@ -487,6 +487,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
::encode(session_autoclose, bl);
::encode(max_file_size, bl);
::encode(max_mds, bl);
::encode(balancer, bl);

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

This is the legacy encoding for clients too old to handle versioned things, don't add balancer here

@@ -515,6 +516,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
::encode(session_autoclose, bl);
::encode(max_file_size, bl);
::encode(max_mds, bl);
::encode(balancer, bl);

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Same: legacy encoding, leave it out

@@ -551,6 +553,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
::encode(session_autoclose, bl);
::encode(max_file_size, bl);
::encode(max_mds, bl);
::encode(balancer, bl);

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

When you add an item to an encoded representation, you have to add it at the end (i.e. after "damaged" in this case) -- the way older receivers handle newer messages is to ignore trailing bytes

@@ -542,7 +544,7 @@ void MDSMap::encode(bufferlist& bl, uint64_t features) const
return;
}

ENCODE_START(5, 4, bl);

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

You need to bump the "ev" below from 10 to 11 instead of modifying this version (the mdsmap encoding is weird!)

@@ -583,7 +586,7 @@ void MDSMap::decode(bufferlist::iterator& p)
std::map<mds_rank_t,int32_t> inc; // Legacy field, parse and drop

cached_up_features = 0;
DECODE_START_LEGACY_COMPAT_LEN_16(5, 4, 4, p);
DECODE_START_LEGACY_COMPAT_LEN_16(6, 4, 4, p);

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Leave this alone, and when you read out 'balancer' at the very end make it conditional on ev>=11

@@ -289,6 +290,9 @@ class MDSMap {
mds_rank_t get_max_mds() const { return max_mds; }
void set_max_mds(mds_rank_t m) { max_mds = m; }

std::string get_balancer() const { return balancer; }

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Could return a const reference here

@@ -1,4 +1,5 @@
# There are no libmds_types so use the full mds library for dencoder for now
DENCODER_SOURCES += $(LIBMDS_SOURCES)
ceph_dencoder_CPPFLAGS = $(LIBLUA_CPPFLAGS)

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

autotools is now gone in master so this part can just be removed

@@ -193,6 +193,7 @@ class MDSMap {
*/

mds_rank_t max_mds; /* The maximum number of active MDSes. Also, the maximum rank. */
string balancer; /* The name and version of the metadata load balancer. */

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

I guess it's the name and version if someone named their objects that way, but really this is just a rados object name, right?

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Please add this field to MDSMap::dump (both plain text and json)

#include <lua.hpp>
#include <list>
#include <map>
using std::list;

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Please avoid using directives in headers (we do already have some but I'd rather not add more)




class MDSRank;

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

This block of class statements looks to be unneeded

@@ -0,0 +1,34 @@
#include <lua.hpp>

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

please add #ifdef header guards and the usual modelines+copyright header (copy paste and put your name in). Same in Mantle.cc

@@ -0,0 +1,235 @@
Mantle

This comment has been minimized.

@jcsp

jcsp Sep 9, 2016
Contributor

Wow, I wish every PR came with this much documentation!

@jcsp
Copy link
Contributor

@jcsp jcsp commented Sep 9, 2016

This looks to be exactly what was discussed at the last CDM, so all of my comments are just detail things. At a high level: 👍

Does #7338 need poking along somehow?

@michaelsevilla
Copy link
Contributor Author

@michaelsevilla michaelsevilla commented Sep 9, 2016

I already poked #7338 and @noahdesu is going to take care of it.

Dumb question: is it better to rebase (keep history clean but lose your comments) or just add more commits?

@jcsp
Copy link
Contributor

@jcsp jcsp commented Sep 9, 2016

You can either add more commits (and rebase later before merge) or just do a straight history-squashing rebase.

Usually we leave comments on the "files changed" view (not individual commits) so that they survive a rebase, but I mistakenly put some on commits here -- my fault if they get lost in a rebase.

@gregsfortytwo
Copy link
Member

@gregsfortytwo gregsfortytwo commented Sep 9, 2016

Luckily Github finally fixed that nonsense, at least as long as you navigate from the PR view into the commit (not sure if you just go straight to a commit). You can tell it's working because all the comments are on the diff, not on a commit. ;)

Personally I prefer commits prefixed with "SQUASH" that are intended to be rebase, as it makes life easier on the reviewer when there are follow-up comments, but that may just be me.

@michaelsevilla michaelsevilla force-pushed the michaelsevilla:mantle branch from 3c7c8ff to 5f67c19 Oct 3, 2016
@gregsfortytwo
Copy link
Member

@gregsfortytwo gregsfortytwo commented Oct 18, 2016

I see we did get some updated commits at some point that at least invalidated most comments. @jcsp, want to check this again? Be nice to get it in for @batrick's scaling tests.

@jcsp
Copy link
Contributor

@jcsp jcsp commented Oct 24, 2016

This is good to go now from my point of view (apologies @michaelsevilla for delay). Please could you do a quick rebase as something else has touched CMakeLists.txt in the meantime

Introduces Mantle, a programmable metadata load balancer. Policies for making
migration decisions are written in Lua but the Migrator and Balancer modules
still do fragmentation and migration. If the Lua balancer fails, control falls
back to the original balancer implementation.

Signed-off-by: Michael Sevilla <mikesevilla3@gmail.com>
- add docs and sample balancer (greedy-spill)

Signed-off-by: Michael Sevilla <mikesevilla3@gmail.com>
- fix legacy encoding in mds map and add balancer to dumps
- fix blocking rados read and remove temporary files
- fix beacon message spamming

Signed-off-by: Michael Sevilla <mikesevilla3@gmail.com>
@michaelsevilla michaelsevilla force-pushed the michaelsevilla:mantle branch from 5f67c19 to 5cc43cf Oct 25, 2016
@michaelsevilla
Copy link
Contributor Author

@michaelsevilla michaelsevilla commented Oct 26, 2016

Done -- @batrick: could you send me info on the scaling tests?

@batrick
Copy link
Member

@batrick batrick commented Oct 26, 2016

@jcsp jcsp merged commit 32b15e3 into ceph:master Oct 26, 2016
2 checks passed
2 checks passed
Signed-off-by all commits in this PR are signed
Details
default Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.