New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: don't kill MDSs unless some beacons are getting through #15308

Merged
merged 5 commits into from Jun 15, 2017

Conversation

Projects
None yet
3 participants
@jcsp
Contributor

jcsp commented May 26, 2017

In the absence of a more scientific way to track how long a mon has been holding up messages in between ticks, this is hopefully a catch-all way to ensure that the MDSMonitor itself is never killing MDSs unless there are some beacons getting through.

@@ -1895,7 +1907,8 @@ void MDSMonitor::maybe_replace_gid(mds_gid_t gid,
info.state != MDSMap::STATE_STANDBY_REPLAY &&
!pending_fsmap.get_filesystem(fscid)->mds_map.test_flag(CEPH_MDSMAP_DOWN) &&
(sgid = pending_fsmap.find_replacement_for({fscid, info.rank}, info.name,
g_conf->mon_force_standby_active)) != MDS_GID_NONE)
g_conf->mon_force_standby_active)) != MDS_GID_NONE
&& may_replace)

This comment has been minimized.

@ukernel

ukernel May 30, 2017

Member

I think it's better to check may_replace before pending_fsmap.find_replacement_for(). The rest change looks good.

This comment has been minimized.

@jcsp

jcsp May 31, 2017

Contributor

thanks, updated

@batrick

This comment has been minimized.

Member

batrick commented May 30, 2017

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

Makes sense to me.

jcsp added some commits May 25, 2017

mon: emit cluster log messages on daemon failures
Two reasons:
 - usability
 - make our tests fail on the WRN message if
   a daemon unexpectedly bounces during a test like in
   http://tracker.ceph.com/issues/19706

Signed-off-by: John Spray <john.spray@redhat.com>
mon: don't kill MDSs unless some beacons are getting through
...to avoid killing MDSs when the MDS is fine, but the mon
is failing to process beacons due to a laggy peer.

Fixes: http://tracker.ceph.com/issues/19706
Signed-off-by: John Spray <john.spray@redhat.com>
qa: update log whitelist for new MDSMonitor messages
Signed-off-by: John Spray <john.spray@redhat.com>
qa: fix daemon restart between tests
Previously, calling mds_stop without mds_fail meant
that if the filesystem creation was not quick, then
we would see those daemons go laggy.  This starts
to trigger failures now that we have cluster log
messages that fire when a daemon gets failed out
due to being laggy.

Signed-off-by: John Spray <john.spray@redhat.com>
qa: whitelist MDS restarts when thrashing
Signed-off-by: John Spray <john.spray@redhat.com>

@jcsp jcsp merged commit 18fbf24 into ceph:master Jun 15, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@jcsp jcsp deleted the jcsp:wip-19706 branch Jun 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment