New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msgr: AsyncMessenger faulted connections metrics #50393
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very misleading to me: if the counter is named msgr_faulted_connections, it should report the number of faulted connections (basically, any time fault is called). But here it is bumped in two isolated cases that happen to correspond to two very different timeouts ("connect" timeout which often indicates connectivity issues and "idle" timeout which just frees up OS resources).
It would make much more sense to report these cases as two different counters, so that it's clear which is which IMO.
4aae646
to
afb9473
Compare
|
jenkins test api |
|
jenkins test windows |
|
@petrutlucian94 The windows check seems to be failing reliably (unrelated to this PR):
|
|
We got two cephfs test failures, will look into it ASAP. |
|
@ljflores Is the teuthology run finished? This should be ready to be merged @rzarzynski |
|
@idryomov @rzarzynski I've changed the method name to |
|
jenkins test make check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks fine to me, over to @rzarzynski for approval on punting the work to https://tracker.ceph.com/issues/61500 and @ljflores or @yaarith for telemetry bits.
Add msgr_connection_idle_timeouts and msgr_connection_ready_timeouts labeled perfcounters to keep track of failed connections with prometheus metrics. Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com> Fixes: https://tracker.ceph.com/issues/59076
cf35033
to
61be723
Compare
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
|
jenkins test submodules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over to @rzarzynski for approval on punting the work to https://tracker.ceph.com/issues/61500 and @ljflores or @yaarith for telemetry bits.
|
Note: merging this PR would expose OSD to https://tracker.ceph.com/issues/61587 (currently only rbd-mirror daemon is exposed). @alimaredia is working on a fix so it's not much of a concern, but worth mentioning. |
|
@rzarzynski : Would you mind to review again and update your thoughts? Thanks |
| @@ -233,9 +245,11 @@ class Worker { | |||
| Worker& operator=(const Worker&) = delete; | |||
|
|
|||
| Worker(CephContext *c, unsigned worker_id) | |||
| : cct(c), perf_logger(NULL), id(worker_id), references(0), center(c) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in the ctor's body we do:
perf_logger = plb.create_perf_counters();| } | ||
| virtual ~Worker() { | ||
| if (perf_logger) { | ||
| cct->get_perfcounters_collection()->remove(perf_logger); | ||
| delete perf_logger; | ||
| } | ||
| if (perf_labeled_logger) { | ||
| cct->get_perfcounters_collection()->remove(perf_labeled_logger); | ||
| delete perf_labeled_logger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This follows the convention established by perf_logger. In the future we could move them both to std::unique_ptr<T>.
| @@ -331,6 +332,12 @@ void MgrClient::_send_report() | |||
| const PerfCounters::perf_counter_data_any_d &ctr, | |||
| const PerfCounters &perf_counters) | |||
| { | |||
| // FIXME: We don't send labeled perf counters to the mgr currently. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciating the comment.
| ldout(cct, 20) << " declare " << path << dendl; | ||
| PerfCounterType type; | ||
| type.path = path; | ||
| if (data.description) { | ||
| type.description = data.description; | ||
| } | ||
| if (data.nick) { | ||
| type.nick = data.nick; | ||
| } | ||
| type.type = data.type; | ||
| type.priority = perf_counters.get_adjusted_priority(data.prio); | ||
| type.unit = data.unit; | ||
| report->declare_types.push_back(std::move(type)); | ||
| session->declared.insert(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is just an indention fix.
| @@ -79,4 +79,4 @@ class PerfCounters(RESTController): | |||
| @EndpointDoc("Display Perf Counters", | |||
| responses={200: PERF_SCHEMA}) | |||
| def list(self): | |||
| return mgr.get_all_perf_counters() | |||
| return mgr.get_unlabeled_perf_counters() | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Built this branch in vstart and compilied the Telemetry report successfully. Looks good to go for testing from my point of view!
|
@pereman2 please add the "needs-qa" label when you have gotten all the approvals you need. |
|
From @kamoltat ref: https://trello.com/c/7y6uj4bo RADOS approved, all failures/dead jobs are unrelated and known failures. Failures: 7332480 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard Deads: 7332357 -> Bug #61164: Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds - Infrastructure - Ceph -> Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds |
Add msgr_faulted_connections perfcounters to keep track of failed
connections with prometheus metrics.
Options
ms_connection_ready_timeoutandms_connection_idle_timeouthave a 15 min default so they must be take care of in case of wanting a higher rate oftickschecking whether a connection failed.Below you can see two instances of worker perfcounter dumps, labeled ones with new metrics and non-labeled ones with the label embedded in the name.
{ "AsyncMessenger::Worker": { "labels": { "id": "0" }, "counters": { "msgr_connection_ready_timeouts": 0, "msgr_connection_idle_timeouts": 0 } }, "AsyncMessenger::Worker": { "labels": { "id": "1" }, "counters": { "msgr_connection_ready_timeouts": 0, "msgr_connection_idle_timeouts": 0 } }, "AsyncMessenger::Worker": { "labels": { "id": "2" }, "counters": { "msgr_connection_ready_timeouts": 0, "msgr_connection_idle_timeouts": 0 } }, "AsyncMessenger::Worker-0": { "labels": {}, "counters": { "msgr_recv_messages": 51, "msgr_send_messages": 219, "msgr_recv_bytes": 8336, "msgr_send_bytes": 1367984, "msgr_created_connections": 15, "msgr_active_connections": 3, "msgr_running_total_time": 0.061759432, "msgr_running_send_time": 0.039364411, "msgr_running_recv_time": 0.016690043, "msgr_running_fast_dispatch_time": 0.000000000, "msgr_send_messages_queue_lat": { "avgcount": 219, "sum": 0.019063451, "avgtime": 0.000087047 }, "msgr_handle_ack_lat": { "avgcount": 0, "sum": 0.000000000, "avgtime": 0.000000000 }, "msgr_recv_encrypted_bytes": 8336, "msgr_send_encrypted_bytes": 1367984 } },Fixes: https://tracker.ceph.com/issues/59076
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows