Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: mark mgr reports as no_reply #21057

Merged
merged 1 commit into from Mar 28, 2018
Merged

mon: mark mgr reports as no_reply #21057

merged 1 commit into from Mar 28, 2018

Conversation

tchaikov
Copy link
Contributor

see also: #20517

Fixes: http://tracker.ceph.com/issues/22114
Signed-off-by: Kefu Chai kchai@redhat.com

@tchaikov
Copy link
Contributor Author

tchaikov commented Mar 27, 2018

it should address the failures in http://pulpito.ceph.com/kchai-2018-03-27_08:29:32-rados-wip-kefu-testing-2018-03-27-1407-distro-basic-smithi/

$ zgrep -A2 'DEBUG SLOW OP' remote/smithi020/log/ceph-mon.b.log.gz|grep desc | cut -d: -f2 | sort | uniq
 "monmgrreport(1 checks)",
 "monmgrreport(2 checks)",
$ zgrep -A2 'DEBUG SLOW OP' remote/smithi008/log/ceph-mon.c.log.gz|grep desc | cut -d: -f2 | sort | uniq
 "monmgrreport(0 checks)",
 "monmgrreport(1 checks)",
 "monmgrreport(2 checks)",
 "monmgrreport(3 checks)",

and http://pulpito.ceph.com/kchai-2018-03-27_12:22:51-rados-wip-kefu-testing-2018-03-27-1754-distro-basic-smithi/2328256/

$ zgrep -A2 'DEBUG SLOW OP' remote/smithi008/log/ceph-mon.c.log.gz | grep desc  | cut -d: -f2 | sort | uniq
 "osd_failure(failed immediate osd.0 172.21.15.8

see also: ceph#20517

Fixes: http://tracker.ceph.com/issues/22114
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

@tchaikov
Copy link
Contributor Author

@jecluis @gregsfortytwo ping?

Copy link
Member

@jecluis jecluis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jecluis
Copy link
Member

jecluis commented Mar 28, 2018

@tchaikov is the failed job related to this?

edit: upon further inspection, it seems a dashboard test failing. I asked @rjfd to check whether this is caused by the dashboard or a side-effect of what @tchaikov is fixing.

@tchaikov
Copy link
Contributor Author

@jecluis @rjfd no. the dashboard test failure it's due to missing zstd compressor plugin on monitor. i have not looked into that issue further. but i am sure it's not caused by dashboard or my change.

@tchaikov tchaikov merged commit e4647dc into ceph:master Mar 28, 2018
@tchaikov tchaikov deleted the wip-22114 branch March 28, 2018 11:02
@jecluis
Copy link
Member

jecluis commented Mar 28, 2018

@tchaikov gotcha. thanks

@yuriw yuriw added the backport label Mar 28, 2018
@yuriw
Copy link
Contributor

yuriw commented Mar 28, 2018

@theanalyst this needs to be backported see #21016

@tchaikov
Copy link
Contributor Author

tchaikov commented Mar 29, 2018

@gregsfortytwo
Copy link
Member

Given that we keep adding new "no_reply()" markings, maybe we should mark them that way on construction rather than when the monitor has to handle them? Is that possible?

@tchaikov
Copy link
Contributor Author

tchaikov commented Mar 30, 2018

@gregsfortytwo i am not sure if that's possible or a better way. as i think, no_reply() is an decision by the upper layer of the application stack, not the call of messenger, where the messages are decoded.
also, as some of the messages are encapsulated in the "MForward" message, and it is again the upper layer (Monitor::handle_forward()) who extracts the encapsulated message and dispatch it manually. so we need to do the markings when we handle the certain messages.

or probably i misread you completely.

@jecluis
Copy link
Member

jecluis commented Mar 30, 2018

If @gregsfortytwo was suggesting having the message marked at construction by the sender, then this could possibly be feasible, but I think we have too many exceptions in the monitors to drop/no_reply vs handle that this may be even trickier than addressing individual instances.

It would be nice if we knew exactly which messages are not expecting a reply, so that we could simply mark them as such in their ctor or something. But I'm guessing those would be a small subset of the messages we actually handle as no_reply() [he said, without actually looking at the code].

On the other hand, if we were to identify those that are typically no_reply, and mark them as such by default, and only reply to them in selected cases... then that may help with things a little. However, this doesn't seem a trivial task to accomplish either.

@gregsfortytwo
Copy link
Member

Yes, I meant what Joao says: the peon monitor could mark them as no-reply, that gets flagged (in the MRoute or similar) and the leader's PaxosService dispatch machinery can be responsible for sending back the blank "handled" message to the peon once the message is resolved.

This might be useless, in that it merely moves responsibility for noticing it from the recipient to the sender. Or it might be easier, since we can identify categories of messages that don't need a response and set it all up in their constructors as an obvious decision the author needs to make, instead of something to maintain in far-off code without obvious failures. (So far, they are all static: MMonMgrReport, MOSDFailure, MOSDPGCreated, MOSDBeacon, and MMgrBeacon.) (Or maybe now that we are noticing slow mon ops we don't care any more.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants