New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: improved module loading for error reporting etc #19235
Conversation
i am still reviewing it. |
src/messages/MMgrBeacon.h
Outdated
@@ -119,18 +92,11 @@ class MMgrBeacon : public PaxosServiceMessage { | |||
return command_descs; | |||
} | |||
|
|||
std::set<std::string> get_available_modules() const | |||
const std::vector<MgrMap::ModuleInfo> &get_available_modules() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, trailing space.
DECODE_FINISH(p); | ||
} | ||
|
||
bool have_module(const std::string &module_name) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could use std::find()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have to use a find_if because we're comparing the name member rather than the whole object, but yes.
src/mon/MgrMap.h
Outdated
for (const auto &i : available_modules) { | ||
old_available_modules.insert(i.name); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old_available_modules
is not encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, fixed.
src/messages/MMgrBeacon.h
Outdated
::decode(available_modules, p); | ||
std::set<std::string> module_name_list; | ||
::decode(module_name_list, p); | ||
// Only need to unpack this field if we won't have the full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to read and ignore it as long as header.version >= 3
. because the peer does send it to us even if header.version >= 7
, as long as the compact_version < 7
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what the code is doing? The ::decode
is happening under the >=3
condition, and the usage of the value is happening under the <7
condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i must have been reading another commit i made up in my mind. =(
src/mgr/ActivePyModule.h
Outdated
{} | ||
|
||
int load(ActivePyModules *py_modules); | ||
void notify(const std::string ¬ify_type, const std::string ¬ify_id); | ||
void notify_clog(const LogEntry &le); | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not remove this method if it's not used anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/mgr/DaemonServer.cc
Outdated
"command '" << prefix << "'): use `ceph mgr module enable " | ||
<< handler_name << "` to enable it"; | ||
dout(4) << ss.str() << dendl; | ||
// FIXME: ENODEV is weird but it's not like there's an errno for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is EOPNOTSUPP
a better alternative? TBH, i don't know what errno we can use here =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like EOPNOTSUPP, let's use that.
src/mgr/DaemonServer.cc
Outdated
bool allow_failed_run = (self_test_prefix == prefix); | ||
|
||
// Validate that the module is healthy | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, this is difficult to read, how about
if (module->is_loaded()) {
// bypass the test for already loaded module
} else if (module->get_can_run() && !module->is_failed()) {
// healthy module
} else if (self_test_prefix == prefix) {
// run self-test method on unhealthy module
} else {
// ...reply with errno
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That block doesn't quite capture the case where we fail a command for !is_loaded(), but I agree that it was hard to read so I've refactored it slightly differently, and taken the opportunity to differentiate between the load error and the runtime error in the messages.
src/mgr/PyModuleRegistry.h
Outdated
static void list_modules(std::set<std::string> *modules); | ||
void list_modules(std::set<std::string> *modules); | ||
|
||
void get_modules(std::list<PyModuleRef> *modules_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just return the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although not part of this PR the same question applies to the implementation of list_modules()
and get_health_checks()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this now -- the pointer argument was probably just me following what was there
@@ -181,8 +269,8 @@ class MgrMap | |||
} | |||
f->close_section(); | |||
f->open_array_section("available_modules"); | |||
for (auto& j : available_modules) { | |||
f->dump_string("module", j); | |||
for (const auto& j : available_modules) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this method is marked const
, the loop variable will be const
also.
@@ -234,9 +322,18 @@ class MgrMap | |||
m.print_summary(nullptr, &ss); | |||
return out << ss.str(); | |||
} | |||
|
|||
friend ostream& operator<<(ostream& out, const std::vector<ModuleInfo>& mi) { | |||
for (const auto &i : mi) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could use
std::copy(begin(mi), end(mi), ceph::make_ostream_joiner(out, ' '));
to avoid the trailing space in the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's an array of ModuleInfo (not strings), it's kind of awkward to do something like that (iirc this method is only used in debug output)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
src/mgr/DaemonServer.cc
Outdated
std::stringstream ss; | ||
|
||
// Validate that the module is enabled | ||
PyModuleRef module = py_modules.get_module(handler_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: trailing ws
src/messages/MMgrBeacon.h
Outdated
@@ -87,6 +92,11 @@ class MMgrBeacon : public PaxosServiceMessage { | |||
return command_descs; | |||
} | |||
|
|||
const std::vector<MgrMap::ModuleInfo> &get_available_modules() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: trainling ws
src/mgr/PyModule.cc
Outdated
} | ||
} | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: trainling ws
src/mgr/PyModuleRegistry.cc
Outdated
@@ -484,3 +251,99 @@ void PyModuleRegistry::list_modules(std::set<std::string> *modules) | |||
_list_modules(g_conf->get_val<std::string>("mgr_module_path"), modules); | |||
} | |||
|
|||
int PyModuleRegistry::handle_command( | |||
std::string const &module_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consistent position of const keyword
src/mgr/PyModuleRegistry.cc
Outdated
|
||
for (const auto &i : modules) { | ||
auto module = i.second; | ||
if (module->is_enabled() && (!module->get_can_run())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: superfluous parenthesis around !module->get_can_run()
and again two lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we could remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/mgr/PyModuleRegistry.h
Outdated
static void list_modules(std::set<std::string> *modules); | ||
void list_modules(std::set<std::string> *modules); | ||
|
||
void get_modules(std::list<PyModuleRef> *modules_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although not part of this PR the same question applies to the implementation of list_modules()
and get_health_checks()
d5d5530
to
2cbb4aa
Compare
src/messages/MMgrBeacon.h
Outdated
::decode(available_modules, p); | ||
std::set<std::string> module_name_list; | ||
::decode(module_name_list, p); | ||
// Only need to unpack this field if we won't have the full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i must have been reading another commit i made up in my mind. =(
src/messages/MMgrBeacon.h
Outdated
|
||
// Fill out old-style list of module names (deprecated by | ||
// later field of full ModuleInfos) | ||
std::vector<std::string> module_names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, might need to use std::set<>
. yeah, i know std::vector<>
encodes just the same way as std::set<>
. but better off being more consistent. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - done
accept_command = false; | ||
ss << "Module '" << handler_name << "' failed to load and " | ||
"cannot handle commands: " << module->get_error_string(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, much better!
src/mgr/PyModuleRegistry.cc
Outdated
|
||
for (const auto &i : modules) { | ||
auto module = i.second; | ||
if (module->is_enabled() && (!module->get_can_run())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we could remove them.
src/mgr/PyModuleRegistry.h
Outdated
* Get references to all modules (whether they have loaded and/or | ||
* errored) or not. | ||
*/ | ||
std::list<PyModuleRef> get_modules() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could mark this method const
also.
src/mon/MgrMap.h
Outdated
std::set<std::string> old_available_modules; | ||
for (const auto &i : available_modules) { | ||
old_available_modules.insert(i.name); | ||
::encode(old_available_modules, bl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we can encode a std::set<>
this way, we would have some extra incomplete available_modules
encoded here before the complete version of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, oops, this should have been outside the loop. fixed.
@@ -234,9 +322,18 @@ class MgrMap | |||
m.print_summary(nullptr, &ss); | |||
return out << ss.str(); | |||
} | |||
|
|||
friend ostream& operator<<(ostream& out, const std::vector<ModuleInfo>& mi) { | |||
for (const auto &i : mi) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, right!
ee51ca5
to
391a9cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
c4b5180
to
0f41a0d
Compare
retest this please (jenkins) (the docs build check appears to be a spurious failure downloading a dependency) |
|
||
class PyModule | ||
{ | ||
mutable Mutex lock{"PyModule::lock"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need use a different name for every instance of PyModule
, otherwise lockdep will be annoyed. see http://pulpito.ceph.com/kchai-2018-01-07_09:47:26-rados-wip-kefu-testing-2018-01-07-1345-distro-basic-mira/2039373/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure if this is because of this lock: we're getting a floating point exception in lockdep_register rather than a proper lockdep error. The backtrace looks like it's happening somewhere being called back from the python code during import (possibly construction of librados/libcephfs bits?). I'm investigating.
Update: I'm having a hard time with this lockdep issue. It seems that the call that's crashing is a lockdep_register call while the dashboard module is in The lockdep globals should be separate for the dynamic libraries (librados) vs. the ceph-mgr binary itself, but when I put in some debug prints to print the address of a global like lockdep_mutex, the address in the librados lockdep calls is the same as from the main binary lockdep calls. The only real change I can see here is the order of events: we are loading the python modules much earlier than we otherwise did. Usually lockdep failures are somewhat deterministic, so it's strange that this was happening in teuthology and not during startup of a mgr when I'm running it locally. I tried to reproduce it by restarting mgr daemons in a loop (using mgr fail), and the daemons do indeed crash eventually, but without a backtrace in the log file. However, the crash is right after "Loading module dashboard", so it is probable that it's the same crash. That intermittent failure would support the hypothesis that this is some kind of timing issue that has been exacerbated by loading modules earlier than we did before. The crash when restarting daemons in a loop is also reproducible in master. |
f259641
to
387c981
Compare
So, judging from this run: http://pulpito.ceph.com/jspray-2018-01-15_12:01:39-rados-wip-mgr-can-run-distro-basic-smithi/ ...putting in a hack to avoid librados being imported by the dashboard module at import time makes the lockdep crashes go away. At this point I'm reasonably sure that the changes in this PR itself aren't really wrong, but there's something subtle going on with lockdep vs. dynamic libs here that I'm not getting. |
564b783
to
ef875a1
Compare
agreed. i checked the build tested by http://pulpito.ceph.com/kchai-2018-01-07_09:47:26-rados-wip-kefu-testing-2018-01-07-1345-distro-basic-mira/2039373/, it does not contain 02a720b, that's why ceph-mgr had two copies of lockdeps table and other global locks in libceph-common dynamically linked by librados and libcommon statically linked by ceph-mgr. in short, the issue was fixed in master =) |
retest this please |
Woohoo! |
I had retested the failures recently (after rebase on master) here: http://pulpito.ceph.com/jspray-2018-01-18_15:07:23-rados-wip-mgr-can-run-distro-basic-smithi/ and the lockdep issue was gone, although there is now an upgrade test failing with "failed to decode message of type 1796 v1: buffer::malformed_input: void MgrMap::ModuleInfo::decode(ceph::buffer::list::iterator&) decode past end of struct encoding" ...so I need to fix that. |
This is to enable us to learn more about the module before it is enabled, such as whether its can_run method return true. We can also use this to enable loading a module's commands before it is enabled, to give the user a better response when they try to use a command belong to a module that is not loaded. Signed-off-by: John Spray <john.spray@redhat.com>
...and transmit the result to the monitor in our beacon. Fixes: http://tracker.ceph.com/issues/21502 Signed-off-by: John Spray <john.spray@redhat.com>
...and for all modules, not just the active ones. This enables us to give better feedback to the user when they try and use a command from a disabled module, and also fixes the race between enabling a module and trying to use its commands. Fixes: http://tracker.ceph.com/issues/21683 Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
Enable people to see can_run failures and the explanatory messages (telling them about a missing dependency) before trying to enable a module. Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
list_modules is really about searching for them on disk, so it's now probe_modules and private. Both methods now return values instead of populating an argument, since when called they were always writing into a newly constructed container. Signed-off-by: John Spray <john.spray@redhat.com>
Previously only a service with debug_ms>=1 would dump corrupt messages: in an upgrade test we're *alway* interested in a corrupt message. Signed-off-by: John Spray <john.spray@redhat.com>
Chased down that final issue, now passing: http://pulpito.ceph.com/jspray-2018-01-25_08:42:56-rados-wip-mgr-can-run-distro-basic-smithi/ |
can_run
method to module interface to enable reporting missing dependencies (http://tracker.ceph.com/issues/21502), and expose health check if any can_run=false modules are enabled.There is scope for some follow on work after this:
Happy to have feedback about some of the naming etc, I've gone back and forth about about whether the
can_run
bit should be called something with "dependency" in the name to make it more obvious.