Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: mark mgr reports as no_reply #21057

Merged
merged 1 commit into from Mar 28, 2018

Conversation

Projects
None yet
4 participants
@tchaikov
Copy link
Contributor

commented Mar 27, 2018

see also: #20517

Fixes: http://tracker.ceph.com/issues/22114
Signed-off-by: Kefu Chai kchai@redhat.com

@tchaikov

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2018

it should address the failures in http://pulpito.ceph.com/kchai-2018-03-27_08:29:32-rados-wip-kefu-testing-2018-03-27-1407-distro-basic-smithi/

$ zgrep -A2 'DEBUG SLOW OP' remote/smithi020/log/ceph-mon.b.log.gz|grep desc | cut -d: -f2 | sort | uniq
 "monmgrreport(1 checks)",
 "monmgrreport(2 checks)",
$ zgrep -A2 'DEBUG SLOW OP' remote/smithi008/log/ceph-mon.c.log.gz|grep desc | cut -d: -f2 | sort | uniq
 "monmgrreport(0 checks)",
 "monmgrreport(1 checks)",
 "monmgrreport(2 checks)",
 "monmgrreport(3 checks)",

and http://pulpito.ceph.com/kchai-2018-03-27_12:22:51-rados-wip-kefu-testing-2018-03-27-1754-distro-basic-smithi/2328256/

$ zgrep -A2 'DEBUG SLOW OP' remote/smithi008/log/ceph-mon.c.log.gz | grep desc  | cut -d: -f2 | sort | uniq
 "osd_failure(failed immediate osd.0 172.21.15.8
mon: mark mgr reports and osd_failure as no_reply
see also: #20517

Fixes: http://tracker.ceph.com/issues/22114
Signed-off-by: Kefu Chai <kchai@redhat.com>

@tchaikov tchaikov force-pushed the tchaikov:wip-22114 branch from 21e62f1 to 0daccfb Mar 27, 2018

@tchaikov

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2018

@jecluis
Copy link
Member

left a comment

lgtm

@jecluis

This comment has been minimized.

Copy link
Member

commented Mar 28, 2018

@tchaikov is the failed job related to this?

edit: upon further inspection, it seems a dashboard test failing. I asked @rjfd to check whether this is caused by the dashboard or a side-effect of what @tchaikov is fixing.

@tchaikov

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2018

@jecluis @rjfd no. the dashboard test failure it's due to missing zstd compressor plugin on monitor. i have not looked into that issue further. but i am sure it's not caused by dashboard or my change.

@tchaikov tchaikov merged commit e4647dc into ceph:master Mar 28, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@tchaikov tchaikov deleted the tchaikov:wip-22114 branch Mar 28, 2018

@jecluis

This comment has been minimized.

Copy link
Member

commented Mar 28, 2018

@tchaikov gotcha. thanks

@yuriw yuriw added the backport label Mar 28, 2018

@yuriw

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2018

@theanalyst this needs to be backported see #21016

@tchaikov

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2018

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Mar 29, 2018

Given that we keep adding new "no_reply()" markings, maybe we should mark them that way on construction rather than when the monitor has to handle them? Is that possible?

@tchaikov

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2018

@gregsfortytwo i am not sure if that's possible or a better way. as i think, no_reply() is an decision by the upper layer of the application stack, not the call of messenger, where the messages are decoded.
also, as some of the messages are encapsulated in the "MForward" message, and it is again the upper layer (Monitor::handle_forward()) who extracts the encapsulated message and dispatch it manually. so we need to do the markings when we handle the certain messages.

or probably i misread you completely.

@jecluis

This comment has been minimized.

Copy link
Member

commented Mar 30, 2018

If @gregsfortytwo was suggesting having the message marked at construction by the sender, then this could possibly be feasible, but I think we have too many exceptions in the monitors to drop/no_reply vs handle that this may be even trickier than addressing individual instances.

It would be nice if we knew exactly which messages are not expecting a reply, so that we could simply mark them as such in their ctor or something. But I'm guessing those would be a small subset of the messages we actually handle as no_reply() [he said, without actually looking at the code].

On the other hand, if we were to identify those that are typically no_reply, and mark them as such by default, and only reply to them in selected cases... then that may help with things a little. However, this doesn't seem a trivial task to accomplish either.

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Mar 30, 2018

Yes, I meant what Joao says: the peon monitor could mark them as no-reply, that gets flagged (in the MRoute or similar) and the leader's PaxosService dispatch machinery can be responsible for sending back the blank "handled" message to the peon once the message is resolved.

This might be useless, in that it merely moves responsibility for noticing it from the recipient to the sender. Or it might be easier, since we can identify categories of messages that don't need a response and set it all up in their constructors as an obvious decision the author needs to make, instead of something to maintain in far-off code without obvious failures. (So far, they are all static: MMonMgrReport, MOSDFailure, MOSDPGCreated, MOSDBeacon, and MMgrBeacon.) (Or maybe now that we are noticing slow mon ops we don't care any more.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.