New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/MDSMonitor: fix segv when multiple MDSs raise same alert #16302

Merged
merged 1 commit into from Jul 17, 2017

Conversation

Projects
None yet
3 participants
@liewegas
Member

liewegas commented Jul 12, 2017

Signed-off-by: Sage Weil sage@redhat.com

mon/MDSMonitor: fix segv when multiple MDSs raise same alert
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas requested a review from tchaikov Jul 17, 2017

@@ -203,7 +203,7 @@ void MDSMonitor::encode_pending(MonitorDBStore::TransactionRef t)
}
for (const auto &metric : health.metrics) {
int const rank = info.rank;
health_check_t *check = &new_checks.add(
health_check_t *check = &new_checks.get_or_add(

This comment has been minimized.

@tchaikov

tchaikov Jul 17, 2017

Contributor

yeah, this happens. and should in the case of multi-mds deployment.

but if we always overwrite the health.metrics from one mds instance with those from another. why bothering iterating through the fsmap for all mds instances?

in the multi-mds setting, if all MDSes are in sync in the sense that the health data are same across all active MDSes, we can just collect the health check info from one MDS then.

if they are not in sync, we probably need to find out which copy is authoritative.

but either way, i am just not sure we should do the overwrite.

@jcsp @ukernel what do you think?

This comment has been minimized.

@jcsp

jcsp Jul 17, 2017

Contributor

I think with Sage's change we are still accumulating the information about each unhealthy MDS in the detail part of the check. The message in the summary part is overwritten, but it would always be the same because it is just determined from the health check code. That message has a %num in it, so when two MDSs raise the same health metric, we'll get e.g. "2 MDSs behind on trimming" as expected.

We do need to do this across all MDSs, because the checks are daemon-local things, that are not necessarily the same across the cluster. e.g. it is quite possible for one MDS to be behind on trimming its journal but the others not to be.

This comment has been minimized.

@tchaikov

tchaikov Jul 17, 2017

Contributor

thanks John!

@jcsp

jcsp approved these changes Jul 17, 2017

@@ -203,7 +203,7 @@ void MDSMonitor::encode_pending(MonitorDBStore::TransactionRef t)
}
for (const auto &metric : health.metrics) {
int const rank = info.rank;
health_check_t *check = &new_checks.add(
health_check_t *check = &new_checks.get_or_add(

This comment has been minimized.

@jcsp

jcsp Jul 17, 2017

Contributor

I think with Sage's change we are still accumulating the information about each unhealthy MDS in the detail part of the check. The message in the summary part is overwritten, but it would always be the same because it is just determined from the health check code. That message has a %num in it, so when two MDSs raise the same health metric, we'll get e.g. "2 MDSs behind on trimming" as expected.

We do need to do this across all MDSs, because the checks are daemon-local things, that are not necessarily the same across the cluster. e.g. it is quite possible for one MDS to be behind on trimming its journal but the others not to be.

@tchaikov tchaikov removed the needs-review label Jul 17, 2017

@tchaikov tchaikov merged commit 2b3adf7 into ceph:master Jul 17, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@liewegas liewegas deleted the liewegas:wip-mds-dup-alerts branch Jul 17, 2017

@jcsp jcsp added the cephfs label Jul 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment