New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon/OSDMonitor: Reset grace period if failure interval exceeds a threshold. #35490
Conversation
@jdurgin @neha-ojha PTAL. Please check if the approach to reset the grace interval is acceptable. I will take some more time to test this. Please suggest/add reviewers that you think are necessary. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the approach looks good to me, will take a closer look later this week
c961a71
to
f51e08d
Compare
src/mon/OSDMonitor.cc
Outdated
if (grace_interval_threshold_exceeded(last_failure)) { | ||
set_default_laggy_params(target_osd); | ||
} | ||
} | ||
double halflife = (double)g_conf()->mon_osd_laggy_halflife; | ||
decay_k = ::log(.5) / halflife; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-checked the math and the half-life decay should be effective, assuming the timestamps are accurate -- decay is 0.5 after one halflife, 0.25 after 2, etc. If the failure interval is inaccurate here, that would explain the extra grace period not going away. Are you still seeing more MOSDFailures sent after the osd is brought back in your testing?
Similarly, in the code setting the interval on boot, if the issue you're seeing with down_stamp not updating is present, the interval would always be set to mon_osd_max_laggy_interval, which would explain why rebooting was not helping the user who ran into this.
Your approach here of resetting things entirely is a nice fallback to avoid any further bugs like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdurgin Apparently my previous fix was introduced in the wrong place. From my investigation and testing, the correct place to check for the threshold breach is when an osd is marked down and its state is being updated/encoded within the pending
incremental map.
When the osd is marked down, its state is set to CEPH_OSD_UP in
OSDMonitor::check_failure(). Eventually, when the incremental map is being encoded before applying it, a check is made to verify if the grace period has exceeding the set
threshold if the new state is set to CEPH_OSD_UP prior to resetting the laggy params.
The down_stamp for the osd is also updated as part of the reset.
f51e08d
to
49ad6c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few nits, otherwise looks good.
49ad6c6
to
0339eef
Compare
0339eef
to
dfb24f1
Compare
retest this please |
dfb24f1
to
ca39db7
Compare
…shold. Reset the grace hearbeat period if there have been no failures since the set threshold value (48 Hrs). The mon_osd_laggy_halflife value is leveraged to calculate the threshold. A couple of helper functions do the following: - get_grace_interval_threshold(): Calculates and returns the grace interval threshold value. - grace_interval_threshold_exceeded(int): Checks if grace interval threshold is exceeded based on the last down stamp. - set_default_laggy_params(int): Resets the laggy_probability and laggy_interval in the new_xinfo structure maintained within pending_inc to be applied eventually as part of update from paxos. The threshold value is checked and the laggy parameters are reset at the following point, - encode_pending() - If an existing osd is experiencing failure after an interval exceeding the failure threshold period. Fixes: https://tracker.ceph.com/issues/45943 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
ca39db7
to
9f1d4c1
Compare
Pulpito Run: There were 14 failures out of 295 tests. Checked all the failed ones and confirmed that none of them are related to this change. |
Reset the grace hearbeat timer if there have been no failures since the
set threshold value (48 Hrs). The mon_osd_laggy_halflife value is
leveraged to calculate the threshold.
A couple of helper functions do the following
Calculates and returns the grace interval threshold value.
Checks if grace interval threshold is exceeded based on the last
down stamp.
Resets the laggy_probability and laggy_interval in the
new_xinfo structure maintained within pending_inc to be applied
eventually as part of update from paxos.
The threshold value is checked and the laggy parameters are reset at the
following point,
after an interval exceeding the grace threshold period.
Fixes: https://tracker.ceph.com/issues/45943
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox