Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/OSDMonitor: Reset grace period if failure interval exceeds a threshold. #35490

Merged
merged 1 commit into from Jun 26, 2020

Conversation

sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Jun 8, 2020

Reset the grace hearbeat timer if there have been no failures since the
set threshold value (48 Hrs). The mon_osd_laggy_halflife value is
leveraged to calculate the threshold.

A couple of helper functions do the following

  • get_grace_interval_threshold():
    Calculates and returns the grace interval threshold value.
  • grace_interval_threshold_exceeded(int):
    Checks if grace interval threshold is exceeded based on the last
    down stamp.
  • set_default_laggy_params(int):
    Resets the laggy_probability and laggy_interval in the
    new_xinfo structure maintained within pending_inc to be applied
    eventually as part of update from paxos.

The threshold value is checked and the laggy parameters are reset at the
following point,

  • encode_pending() - If an existing osd that is experiencing failure
    after an interval exceeding the grace threshold period.

Fixes: https://tracker.ceph.com/issues/45943
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@sseshasa
Copy link
Contributor Author

sseshasa commented Jun 8, 2020

@jdurgin @neha-ojha PTAL. Please check if the approach to reset the grace interval is acceptable. I will take some more time to test this. Please suggest/add reviewers that you think are necessary. Thanks!

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the approach looks good to me, will take a closer look later this week

@sseshasa sseshasa force-pushed the wip-grace-interval-timer-reset branch from c961a71 to f51e08d Compare June 15, 2020 13:26
if (grace_interval_threshold_exceeded(last_failure)) {
set_default_laggy_params(target_osd);
}
}
double halflife = (double)g_conf()->mon_osd_laggy_halflife;
decay_k = ::log(.5) / halflife;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-checked the math and the half-life decay should be effective, assuming the timestamps are accurate -- decay is 0.5 after one halflife, 0.25 after 2, etc. If the failure interval is inaccurate here, that would explain the extra grace period not going away. Are you still seeing more MOSDFailures sent after the osd is brought back in your testing?

Similarly, in the code setting the interval on boot, if the issue you're seeing with down_stamp not updating is present, the interval would always be set to mon_osd_max_laggy_interval, which would explain why rebooting was not helping the user who ran into this.

Your approach here of resetting things entirely is a nice fallback to avoid any further bugs like this.

Copy link
Contributor Author

@sseshasa sseshasa Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdurgin Apparently my previous fix was introduced in the wrong place. From my investigation and testing, the correct place to check for the threshold breach is when an osd is marked down and its state is being updated/encoded within the pending
incremental map.

When the osd is marked down, its state is set to CEPH_OSD_UP in
OSDMonitor::check_failure(). Eventually, when the incremental map is being encoded before applying it, a check is made to verify if the grace period has exceeding the set
threshold if the new state is set to CEPH_OSD_UP prior to resetting the laggy params.
The down_stamp for the osd is also updated as part of the reset.

@sseshasa sseshasa force-pushed the wip-grace-interval-timer-reset branch from f51e08d to 49ad6c6 Compare June 22, 2020 04:52
Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits, otherwise looks good.

src/mon/OSDMonitor.cc Outdated Show resolved Hide resolved
src/mon/OSDMonitor.cc Outdated Show resolved Hide resolved
src/mon/OSDMonitor.cc Outdated Show resolved Hide resolved
src/mon/OSDMonitor.cc Outdated Show resolved Hide resolved
@sseshasa sseshasa force-pushed the wip-grace-interval-timer-reset branch from 49ad6c6 to 0339eef Compare June 23, 2020 06:35
@sseshasa sseshasa force-pushed the wip-grace-interval-timer-reset branch from 0339eef to dfb24f1 Compare June 23, 2020 19:34
@sseshasa sseshasa changed the title mon/OSDMonitor: Reset grace hearbeat period if it exceeds a threshold. mon/OSDMonitor: Reset grace period if failure interval exceeds a threshold. Jun 23, 2020
@neha-ojha
Copy link
Member

retest this please

@sseshasa sseshasa force-pushed the wip-grace-interval-timer-reset branch from dfb24f1 to ca39db7 Compare June 24, 2020 10:25
…shold.

Reset the grace hearbeat period if there have been no failures since the
set threshold value (48 Hrs). The mon_osd_laggy_halflife value is
leveraged to calculate the threshold.

A couple of helper functions do the following:
 - get_grace_interval_threshold():
    Calculates and returns the grace interval threshold value.
 - grace_interval_threshold_exceeded(int):
    Checks if grace interval threshold is exceeded based on the last
    down stamp.
 - set_default_laggy_params(int):
     Resets the laggy_probability and laggy_interval in the
     new_xinfo structure maintained within pending_inc to be applied
     eventually as part of update from paxos.

The threshold value is checked and the laggy parameters are reset at the
following point,
 - encode_pending() - If an existing osd is experiencing failure
   after an interval exceeding the failure threshold period.

Fixes: https://tracker.ceph.com/issues/45943
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa
Copy link
Contributor Author

Pulpito Run:
http://pulpito.ceph.com/sseshasa-2020-06-24_17:46:09-rados-wip-sseshasa-testing-2020-06-24-1858-distro-basic-smithi/

There were 14 failures out of 295 tests. Checked all the failed ones and confirmed that none of them are related to this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants