Skip to content

Commit

Permalink
mon: support min_down_reporter conuted by subtree level
Browse files Browse the repository at this point in the history
In many case OSDs in an isolated(public/cluster connection lost but osd<->mon is good) node
will report other OSD down to monitor,which usually wrongly mark someone down.

Nowaday the "osd_min_down_reporters", we would like to extend the semantic to
allow it counted by host or rack, thus user could require failure reports from at
least two nodes to mark an OSD down, which shoudl prevent an isoloated host make trouble
to the cluster.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
  • Loading branch information
xiaoxichen819 committed Nov 30, 2015
1 parent 97affaa commit bcb8f36
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 4 deletions.
3 changes: 2 additions & 1 deletion src/common/config_opts.h
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,8 @@ OPTION(mon_sync_debug_leader, OPT_INT, -1) // monitor to be used as the sync lea
OPTION(mon_sync_debug_provider, OPT_INT, -1) // monitor to be used as the sync provider
OPTION(mon_sync_debug_provider_fallback, OPT_INT, -1) // monitor to be used as fallback if sync provider fails
OPTION(mon_inject_sync_get_chunk_delay, OPT_DOUBLE, 0) // inject N second delay on each get_chunk request
OPTION(mon_osd_min_down_reporters, OPT_INT, 2) // number of OSDs who need to report a down OSD for it to count
OPTION(mon_osd_min_down_reporters, OPT_INT, 2) // number of OSDs from different subtrees who need to report a down OSD for it to count
OPTION(mon_osd_reporter_subtree_level , OPT_STR, "host") // in which level of parent bucket the reporters are counted
OPTION(mon_osd_force_trim_to, OPT_INT, 0) // force mon to trim maps to this point, regardless of min_last_epoch_clean (dangerous, use with care)
OPTION(mon_mds_force_trim_to, OPT_INT, 0) // force mon to trim mdsmaps to this point (dangerous, use with care)

Expand Down
20 changes: 17 additions & 3 deletions src/mon/OSDMonitor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1635,6 +1635,8 @@ void OSDMonitor::check_failures(utime_t now)

bool OSDMonitor::check_failure(utime_t now, int target_osd, failure_info_t& fi)
{
set<string> reporters_by_subtree;
string reporter_subtree_level = g_conf->mon_osd_reporter_subtree_level;
utime_t orig_grace(g_conf->osd_heartbeat_grace, 0);
utime_t max_failed_since = fi.get_failed_since();
utime_t failed_for = now - max_failed_since;
Expand Down Expand Up @@ -1663,6 +1665,16 @@ bool OSDMonitor::check_failure(utime_t now, int target_osd, failure_info_t& fi)
for (map<int,failure_reporter_t>::iterator p = fi.reporters.begin();
p != fi.reporters.end();
++p) {
// get the parent bucket whose type matches with "reporter_subtree_level".
// fall back to OSD if the level doesn't exist.
map<string, string> reporter_loc = osdmap.crush->get_full_location(p->first);
map<string, string>::iterator iter = reporter_loc.find(reporter_subtree_level);
if (iter == reporter_loc.end()) {
reporters_by_subtree.insert("osd." + to_string(p->first));
} else {
reporters_by_subtree.insert(iter->second);
}

const osd_xinfo_t& xi = osdmap.get_xinfo(p->first);
utime_t elapsed = now - xi.down_stamp;
double decay = exp((double)elapsed * decay_k);
Expand All @@ -1685,15 +1697,17 @@ bool OSDMonitor::check_failure(utime_t now, int target_osd, failure_info_t& fi)
return true;
}


if (failed_for >= grace &&
((int)fi.reporters.size() >= g_conf->mon_osd_min_down_reporters)) {
(int)reporters_by_subtree.size() >= g_conf->mon_osd_min_down_reporters) {
dout(1) << " we have enough reporters to mark osd." << target_osd
<< " down" << dendl;
pending_inc.new_state[target_osd] = CEPH_OSD_UP;

mon->clog->info() << osdmap.get_inst(target_osd) << " failed ("
<< (int)fi.reporters.size() << " reporters after "
<< failed_for << " >= grace " << grace << ")\n";
<< (int)reporters_by_subtree.size() << " reporters from different "
<< reporter_subtree_level << " after "
<< failed_for << " >= grace " << grace << ")\n";
return true;
}
return false;
Expand Down

1 comment on commit bcb8f36

@cfanz
Copy link
Contributor

@cfanz cfanz commented on bcb8f36 May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit doesn't cope with one situation. For example, a physical host may have more than one osd, some of them are sata disk while others are ssd disk. If sata osds and ssd osds are logically organized in different cushmap root, they will be treated as two different reporter even if they are in did came from same host.

Please sign in to comment.