Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
mon/OSDMonitor: drop stale failure_info
failure_info keeps strong references of the MOSDFailure messages sent by osd or peon monitors, whenever monitor starts to handle an MOSDFailure message, it registers it in its OpTracker. and the failure report messageis unregistered when monitor acks them by either canceling them or replying the reporters with a new osdmap marking the target osd down. but if this does not happen, the failure reports just pile up in OpTracker. and monitor considers them as slow ops. and they are reported as SLOW_OPS health warning. in theory, it does not take long to mark an unresponsive osd down if we have enough reporters. but there is chance, that a reporter fails to cancel its report before it reboots, and the monitor also fails to collect enough reports and mark the target osd down. so the target osd never gets an osdmap marking it down, so it won't send an alive message to monitor to fix this. in this change, we check for the stale failure info in tick(), and simply drop the stale reports. so the messages can released and marked "done". Fixes: https://tracker.ceph.com/issues/47380 Signed-off-by: Kefu Chai <kchai@redhat.com>
- Loading branch information