mon: mark mgr reports as no_reply #21057

tchaikov · 2018-03-27T09:48:14Z

see also: #20517

Fixes: http://tracker.ceph.com/issues/22114
Signed-off-by: Kefu Chai kchai@redhat.com

tchaikov · 2018-03-27T09:51:07Z

it should address the failures in http://pulpito.ceph.com/kchai-2018-03-27_08:29:32-rados-wip-kefu-testing-2018-03-27-1407-distro-basic-smithi/

$ zgrep -A2 'DEBUG SLOW OP' remote/smithi020/log/ceph-mon.b.log.gz|grep desc | cut -d: -f2 | sort | uniq
 "monmgrreport(1 checks)",
 "monmgrreport(2 checks)",
$ zgrep -A2 'DEBUG SLOW OP' remote/smithi008/log/ceph-mon.c.log.gz|grep desc | cut -d: -f2 | sort | uniq
 "monmgrreport(0 checks)",
 "monmgrreport(1 checks)",
 "monmgrreport(2 checks)",
 "monmgrreport(3 checks)",

and http://pulpito.ceph.com/kchai-2018-03-27_12:22:51-rados-wip-kefu-testing-2018-03-27-1754-distro-basic-smithi/2328256/

$ zgrep -A2 'DEBUG SLOW OP' remote/smithi008/log/ceph-mon.c.log.gz | grep desc  | cut -d: -f2 | sort | uniq
 "osd_failure(failed immediate osd.0 172.21.15.8

see also: ceph#20517 Fixes: http://tracker.ceph.com/issues/22114 Signed-off-by: Kefu Chai <kchai@redhat.com>

tchaikov · 2018-03-27T16:29:39Z

http://pulpito.ceph.com/kchai-2018-03-27_15:55:56-rados-wip-kefu-testing-2018-03-27-2116-distro-basic-smithi/

tchaikov · 2018-03-28T10:21:53Z

@jecluis @gregsfortytwo ping?

jecluis

lgtm

jecluis · 2018-03-28T10:47:56Z

@tchaikov is the failed job related to this?

edit: upon further inspection, it seems a dashboard test failing. I asked @rjfd to check whether this is caused by the dashboard or a side-effect of what @tchaikov is fixing.

tchaikov · 2018-03-28T11:01:05Z

@jecluis @rjfd no. the dashboard test failure it's due to missing zstd compressor plugin on monitor. i have not looked into that issue further. but i am sure it's not caused by dashboard or my change.

jecluis · 2018-03-28T11:03:45Z

@tchaikov gotcha. thanks

yuriw · 2018-03-28T23:09:27Z

@theanalyst this needs to be backported see #21016

tchaikov · 2018-03-29T15:05:20Z

strange enough, we still have SLOW_OPS in http://pulpito.ceph.com/kchai-2018-03-29_13:20:02-rados-wip-slow-mon-ops-kefu-distro-basic-smithi/2334154/ .

tracked at http://tracker.ceph.com/issues/23511

gregsfortytwo · 2018-03-29T20:17:20Z

Given that we keep adding new "no_reply()" markings, maybe we should mark them that way on construction rather than when the monitor has to handle them? Is that possible?

tchaikov · 2018-03-30T00:58:00Z

@gregsfortytwo i am not sure if that's possible or a better way. as i think, no_reply() is an decision by the upper layer of the application stack, not the call of messenger, where the messages are decoded.
also, as some of the messages are encapsulated in the "MForward" message, and it is again the upper layer (Monitor::handle_forward()) who extracts the encapsulated message and dispatch it manually. so we need to do the markings when we handle the certain messages.

or probably i misread you completely.

jecluis · 2018-03-30T01:12:57Z

If @gregsfortytwo was suggesting having the message marked at construction by the sender, then this could possibly be feasible, but I think we have too many exceptions in the monitors to drop/no_reply vs handle that this may be even trickier than addressing individual instances.

It would be nice if we knew exactly which messages are not expecting a reply, so that we could simply mark them as such in their ctor or something. But I'm guessing those would be a small subset of the messages we actually handle as no_reply() [he said, without actually looking at the code].

On the other hand, if we were to identify those that are typically no_reply, and mark them as such by default, and only reply to them in selected cases... then that may help with things a little. However, this doesn't seem a trivial task to accomplish either.

gregsfortytwo · 2018-03-30T05:55:53Z

Yes, I meant what Joao says: the peon monitor could mark them as no-reply, that gets flagged (in the MRoute or similar) and the leader's PaxosService dispatch machinery can be responsible for sending back the blank "handled" message to the peon once the message is resolved.

This might be useless, in that it merely moves responsibility for noticing it from the recipient to the sender. Or it might be easier, since we can identify categories of messages that don't need a response and set it all up in their constructors as an obvious decision the author needs to make, instead of something to maintain in far-off code without obvious failures. (So far, they are all static: MMonMgrReport, MOSDFailure, MOSDPGCreated, MOSDBeacon, and MMgrBeacon.) (Or maybe now that we are noticing slow mon ops we don't care any more.)

tchaikov added bug-fix mon labels Mar 27, 2018

tchaikov requested review from jecluis and gregsfortytwo March 27, 2018 09:48

tchaikov mentioned this pull request Mar 27, 2018

luminous: mon: ops get stuck in "resend forwarded message to leader" #21016

Merged

tchaikov mentioned this pull request Mar 27, 2018

mon,mgr: make osd_metric more popular and report slow ops to mgr #20660

Merged

tchaikov added the wip-kefu-testing label Mar 27, 2018

mon: mark mgr reports and osd_failure as no_reply

0daccfb

see also: ceph#20517 Fixes: http://tracker.ceph.com/issues/22114 Signed-off-by: Kefu Chai <kchai@redhat.com>

tchaikov force-pushed the wip-22114 branch from 21e62f1 to 0daccfb Compare March 27, 2018 13:05

tchaikov added needs-review and removed wip-kefu-testing labels Mar 27, 2018

jecluis approved these changes Mar 28, 2018

View reviewed changes

tchaikov merged commit e4647dc into ceph:master Mar 28, 2018

tchaikov deleted the wip-22114 branch March 28, 2018 11:02

yuriw added the backport label Mar 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon: mark mgr reports as no_reply #21057

mon: mark mgr reports as no_reply #21057

tchaikov commented Mar 27, 2018

tchaikov commented Mar 27, 2018 •

edited

tchaikov commented Mar 27, 2018

tchaikov commented Mar 28, 2018

jecluis left a comment

jecluis commented Mar 28, 2018 •

edited

tchaikov commented Mar 28, 2018

jecluis commented Mar 28, 2018

yuriw commented Mar 28, 2018

tchaikov commented Mar 29, 2018 •

edited

gregsfortytwo commented Mar 29, 2018

tchaikov commented Mar 30, 2018 •

edited

jecluis commented Mar 30, 2018

gregsfortytwo commented Mar 30, 2018

mon: mark mgr reports as no_reply #21057

mon: mark mgr reports as no_reply #21057

Conversation

tchaikov commented Mar 27, 2018

tchaikov commented Mar 27, 2018 • edited

tchaikov commented Mar 27, 2018

tchaikov commented Mar 28, 2018

jecluis left a comment

Choose a reason for hiding this comment

jecluis commented Mar 28, 2018 • edited

tchaikov commented Mar 28, 2018

jecluis commented Mar 28, 2018

yuriw commented Mar 28, 2018

tchaikov commented Mar 29, 2018 • edited

gregsfortytwo commented Mar 29, 2018

tchaikov commented Mar 30, 2018 • edited

jecluis commented Mar 30, 2018

gregsfortytwo commented Mar 30, 2018

tchaikov commented Mar 27, 2018 •

edited

jecluis commented Mar 28, 2018 •

edited

tchaikov commented Mar 29, 2018 •

edited

tchaikov commented Mar 30, 2018 •

edited