Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/MDSMonitor: fix segv when multiple MDSs raise same alert #16302

Merged
merged 1 commit into from Jul 17, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/mon/MDSMonitor.cc
Expand Up @@ -203,7 +203,7 @@ void MDSMonitor::encode_pending(MonitorDBStore::TransactionRef t)
}
for (const auto &metric : health.metrics) {
int const rank = info.rank;
health_check_t *check = &new_checks.add(
health_check_t *check = &new_checks.get_or_add(
Copy link
Contributor

@tchaikov tchaikov Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this happens. and should in the case of multi-mds deployment.

but if we always overwrite the health.metrics from one mds instance with those from another. why bothering iterating through the fsmap for all mds instances?

in the multi-mds setting, if all MDSes are in sync in the sense that the health data are same across all active MDSes, we can just collect the health check info from one MDS then.

if they are not in sync, we probably need to find out which copy is authoritative.

but either way, i am just not sure we should do the overwrite.

@jcsp @ukernel what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with Sage's change we are still accumulating the information about each unhealthy MDS in the detail part of the check. The message in the summary part is overwritten, but it would always be the same because it is just determined from the health check code. That message has a %num in it, so when two MDSs raise the same health metric, we'll get e.g. "2 MDSs behind on trimming" as expected.

We do need to do this across all MDSs, because the checks are daemon-local things, that are not necessarily the same across the cluster. e.g. it is quite possible for one MDS to be behind on trimming its journal but the others not to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks John!

mds_metric_name(metric.type),
metric.sev,
mds_metric_summary(metric.type));
Expand Down
8 changes: 8 additions & 0 deletions src/mon/health_check.h
Expand Up @@ -100,6 +100,14 @@ struct health_check_map_t {
r.summary = summary;
return r;
}
health_check_t& get_or_add(const std::string& code,
health_status_t severity,
const std::string& summary) {
health_check_t& r = checks[code];
r.severity = severity;
r.summary = summary;
return r;
}

void merge(const health_check_map_t& o) {
for (auto& p : o.checks) {
Expand Down