New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/OSDMonitor: fix process osd failure #12938

Merged
merged 2 commits into from Jan 23, 2017

Conversation

Projects
None yet
2 participants
@LiumxNL
Contributor

LiumxNL commented Jan 16, 2017

No description provided.

@LiumxNL

This comment has been minimized.

Contributor

LiumxNL commented Jan 16, 2017

@tchaikov some cleanup, pls review, thanks!

if (ls.front())
mon->no_reply(ls.front());
ls.pop_front();
MonOpRequestRef record_op = fi.cancel_report(reporter);

This comment has been minimized.

@tchaikov

tchaikov Jan 16, 2017

Contributor

if osd failed finally, this may make these reporters cannot receive lastest
update right away,

could you elaborate this a little bit?

This comment has been minimized.

@LiumxNL

LiumxNL Jan 16, 2017

Contributor

although one reporter has canceled fail, but if some more osds report this osd failed after that, and suppose that mon has enough reporter to mark this osd down, then, in process_failures(), it will call take_all_failures() which get all reporter's report_message, only if reporter has recorded report_message mon will send lastest osdmap to that reporter. so if we drop report_message of all another reporters when one cancel report may cause these reporter cannot receive the update right away, although it will finally acknowledge it by peer's sharing osdmap

This comment has been minimized.

@tchaikov

tchaikov Jan 18, 2017

Contributor

note to myself and posterity:

if osd.42 is reported to as "failed" by its peer. and later the peer receives a ping_reply from osd.42 somehow, so it sends a failure message to monitor to cancel the previous failure report. as a side-effect, on the monitor handling this failure report, all report messages from other OSD peers were also erased from the failure_info. but their osd ids and failed_since are kept around in the failure_info. and they will be taken into consideration in OSDMonitor::check_failure() even the report messages are reset.

if osd.42 is finally marked down, all the osds not reverting their failure report will not be updated with the latest osdmap. but we could.

the related change was introduced by ad12b0d, so we don't leak the failure messages before canceling the report. but failure_report and failure_reporter_t are able to take care of their life cycles just fine without the fix.

// calculate failure time
utime_t now = ceph_clock_now(g_ceph_context);
utime_t failed_since =
m->get_recv_stamp() - utime_t(m->failed_for, 0);

This comment has been minimized.

@tchaikov

tchaikov Jan 16, 2017

Contributor

wrong indent.

@tchaikov tchaikov added the mon label Jan 16, 2017

@tchaikov tchaikov self-assigned this Jan 16, 2017

@liewegas liewegas changed the title from OSDMonitor: fix process osd failure to mon/OSDMonitor: fix process osd failure Jan 17, 2017

if (ls.front())
mon->no_reply(ls.front());
ls.pop_front();
MonOpRequestRef record_op = fi.cancel_report(reporter);

This comment has been minimized.

@tchaikov

tchaikov Jan 18, 2017

Contributor

@LiumxNL

nit, could you s/record_op/report_op/? then it's good to qa run.

@tchaikov tchaikov added the needs-qa label Jan 18, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Jan 18, 2017

modulo the nit, lgtm. just needs to s/record_op/report_op/ before the merge, i think.

LiumxNL added some commits Jan 13, 2017

OSDMonitor: drop report message from all another reporters is not rea…
…sonable

if osd failed finally, this may make these reporters cannot receive lastest
update right away, besides, it's not effective to make a traverse of all reporters

Signed-off-by: Mingxin Liu <mingxin@xsky.com>
OSDMonitor: calculate failure time only when osd reported failed
Signed-off-by: Mingxin Liu <mingxin@xsky.com>
@LiumxNL

This comment has been minimized.

Contributor

LiumxNL commented Jan 18, 2017

@tchaikov updated.

@tchaikov tchaikov merged commit 1e8ca9b into ceph:master Jan 23, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment