Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msgr: AsyncMessenger faulted connections metrics #50393

Merged
merged 2 commits into from Jul 14, 2023

Conversation

pereman2
Copy link
Contributor

@pereman2 pereman2 commented Mar 6, 2023

Add msgr_faulted_connections perfcounters to keep track of failed
connections with prometheus metrics.

Options ms_connection_ready_timeout and ms_connection_idle_timeout have a 15 min default so they must be take care of in case of wanting a higher rate of ticks checking whether a connection failed.

Below you can see two instances of worker perfcounter dumps, labeled ones with new metrics and non-labeled ones with the label embedded in the name.

{
    "AsyncMessenger::Worker": {
        "labels": {
            "id": "0"
        },
        "counters": {
            "msgr_connection_ready_timeouts": 0,
            "msgr_connection_idle_timeouts": 0
        }
    },
    "AsyncMessenger::Worker": {
        "labels": {
            "id": "1"
        },
        "counters": {
            "msgr_connection_ready_timeouts": 0,
            "msgr_connection_idle_timeouts": 0
        }
    },
    "AsyncMessenger::Worker": {
        "labels": {
            "id": "2"
        },
        "counters": {
            "msgr_connection_ready_timeouts": 0,
            "msgr_connection_idle_timeouts": 0
        }
    },
    "AsyncMessenger::Worker-0": {
        "labels": {},
        "counters": {
            "msgr_recv_messages": 51,
            "msgr_send_messages": 219,
            "msgr_recv_bytes": 8336,
            "msgr_send_bytes": 1367984,
            "msgr_created_connections": 15,
            "msgr_active_connections": 3,
            "msgr_running_total_time": 0.061759432,
            "msgr_running_send_time": 0.039364411,
            "msgr_running_recv_time": 0.016690043,
            "msgr_running_fast_dispatch_time": 0.000000000,
            "msgr_send_messages_queue_lat": {
                "avgcount": 219,
                "sum": 0.019063451,
                "avgtime": 0.000087047
            },
            "msgr_handle_ack_lat": {
                "avgcount": 0,
                "sum": 0.000000000,
                "avgtime": 0.000000000
            },
            "msgr_recv_encrypted_bytes": 8336,
            "msgr_send_encrypted_bytes": 1367984
        }
    },

Fixes: https://tracker.ceph.com/issues/59076

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@pereman2 pereman2 requested a review from jmolmo March 6, 2023 09:29
@pereman2 pereman2 requested a review from a team as a code owner March 6, 2023 09:29
@github-actions github-actions bot added the core label Mar 6, 2023
@pereman2 pereman2 changed the title msgr: labeled asynconnection perfcounters rfc msgr: AsyncMessenger faulted connections metrics Mar 7, 2023
src/msg/async/AsyncConnection.h Outdated Show resolved Hide resolved
src/msg/async/Stack.h Outdated Show resolved Hide resolved
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very misleading to me: if the counter is named msgr_faulted_connections, it should report the number of faulted connections (basically, any time fault is called). But here it is bumped in two isolated cases that happen to correspond to two very different timeouts ("connect" timeout which often indicates connectivity issues and "idle" timeout which just frees up OS resources).

It would make much more sense to report these cases as two different counters, so that it's clear which is which IMO.

src/msg/async/AsyncConnection.cc Outdated Show resolved Hide resolved
src/msg/async/AsyncConnection.cc Outdated Show resolved Hide resolved
jmolmo
jmolmo previously approved these changes Mar 13, 2023
src/msg/async/Stack.h Outdated Show resolved Hide resolved
src/msg/async/Stack.h Outdated Show resolved Hide resolved
@pereman2 pereman2 force-pushed the mirror-con branch 2 times, most recently from 4aae646 to afb9473 Compare March 14, 2023 09:25
src/msg/async/Stack.h Outdated Show resolved Hide resolved
src/msg/async/Stack.h Outdated Show resolved Hide resolved
idryomov
idryomov previously approved these changes Mar 14, 2023
rzarzynski
rzarzynski previously approved these changes Mar 14, 2023
@idryomov
Copy link
Contributor

jenkins test api

@idryomov
Copy link
Contributor

jenkins test windows

@idryomov
Copy link
Contributor

@petrutlucian94 The windows check seems to be failing reliably (unrelated to this PR):

[2023-03-14T13:59:57.000Z] [isolated][googletest] ceph_test_libcephfs failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs\ceph_test_libcephfs_results.xml --gtest_filter="*" >> C:\workspace\test_results\out\ceph_test_libcephfs\ceph_test_libcephfs_results.log 2>&1'".

@petrutlucian94
Copy link
Contributor

We got two cephfs test failures, will look into it ASAP.

[ RUN      ] LibCephFS.SnapdirAttrsOnSnapDelete
2023-03-14T13:55:06.127Coordinated Universal Time 1 -1 asok(0x22cdfc84030) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to 'C:/ProgramData/ceph/out/client.admin.936.asok': (17) File exists
/home/ubuntu/ceph/src/test/libcephfs/test.cc:3816: Failure
Expected: (utime_t(stx_snap_dir_1.stx_mtime)) < (utime_t(stx_snap_dir_2.stx_mtime)), actual: 2023-03-14T13:55:06.159452Coordinated Universal Time vs 2023-03-14T13:55:06.159452Coordinated Universal Time
[  FAILED  ] LibCephFS.SnapdirAttrsOnSnapDelete (63 ms)
[ RUN      ] LibCephFS.SnapdirAttrsOnSnapRename
2023-03-14T13:55:06.222Coordinated Universal Time 1 -1 asok(0x22cdf9f68e0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to 'C:/ProgramData/ceph/out/client.admin.936.asok': (17) File exists
/home/ubuntu/ceph/src/test/libcephfs/test.cc:3841: Failure
Expected equality of these values:
  ceph_mkdir(cmount, dir_path, 0777)
    Which is: -17
  0
[  FAILED  ] LibCephFS.SnapdirAttrsOnSnapRename (94 ms)

@pereman2
Copy link
Contributor Author

@ljflores Is the teuthology run finished? This should be ready to be merged @rzarzynski

@pereman2 pereman2 requested review from cloudbehl and Pegonzal and removed request for a team May 29, 2023 10:54
@pereman2
Copy link
Contributor Author

@idryomov @rzarzynski I've changed the method name to get_unlabeled_perf_counters as per your request.

src/mgr/MgrClient.cc Outdated Show resolved Hide resolved
src/mgr/MgrClient.cc Outdated Show resolved Hide resolved
src/mgr/ActivePyModules.cc Outdated Show resolved Hide resolved
src/mgr/MgrClient.cc Outdated Show resolved Hide resolved
src/pybind/mgr/telemetry/module.py Outdated Show resolved Hide resolved
@ljflores
Copy link
Contributor

jenkins test make check

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks fine to me, over to @rzarzynski for approval on punting the work to https://tracker.ceph.com/issues/61500 and @ljflores or @yaarith for telemetry bits.

src/mgr/MgrClient.cc Outdated Show resolved Hide resolved
src/msg/async/Stack.h Outdated Show resolved Hide resolved
src/mgr/MgrClient.cc Outdated Show resolved Hide resolved
Add msgr_connection_idle_timeouts and msgr_connection_ready_timeouts
labeled perfcounters to keep track of failed connections with prometheus metrics.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Fixes: https://tracker.ceph.com/issues/59076
@pereman2 pereman2 force-pushed the mirror-con branch 2 times, most recently from cf35033 to 61be723 Compare June 5, 2023 07:47
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
@idryomov
Copy link
Contributor

idryomov commented Jun 6, 2023

jenkins test submodules

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over to @rzarzynski for approval on punting the work to https://tracker.ceph.com/issues/61500 and @ljflores or @yaarith for telemetry bits.

@idryomov
Copy link
Contributor

idryomov commented Jun 6, 2023

Note: merging this PR would expose OSD to https://tracker.ceph.com/issues/61587 (currently only rbd-mirror daemon is exposed). @alimaredia is working on a fix so it's not much of a concern, but worth mentioning.

@jmolmo
Copy link
Member

jmolmo commented Jun 8, 2023

@rzarzynski : Would you mind to review again and update your thoughts? Thanks

@@ -233,9 +245,11 @@ class Worker {
Worker& operator=(const Worker&) = delete;

Worker(CephContext *c, unsigned worker_id)
: cct(c), perf_logger(NULL), id(worker_id), references(0), center(c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in the ctor's body we do:

    perf_logger = plb.create_perf_counters();

}
virtual ~Worker() {
if (perf_logger) {
cct->get_perfcounters_collection()->remove(perf_logger);
delete perf_logger;
}
if (perf_labeled_logger) {
cct->get_perfcounters_collection()->remove(perf_labeled_logger);
delete perf_labeled_logger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the convention established by perf_logger. In the future we could move them both to std::unique_ptr<T>.

@@ -331,6 +332,12 @@ void MgrClient::_send_report()
const PerfCounters::perf_counter_data_any_d &ctr,
const PerfCounters &perf_counters)
{
// FIXME: We don't send labeled perf counters to the mgr currently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciating the comment.

Comment on lines +377 to +390
ldout(cct, 20) << " declare " << path << dendl;
PerfCounterType type;
type.path = path;
if (data.description) {
type.description = data.description;
}
if (data.nick) {
type.nick = data.nick;
}
type.type = data.type;
type.priority = perf_counters.get_adjusted_priority(data.prio);
type.unit = data.unit;
report->declare_types.push_back(std::move(type));
session->declared.insert(path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is just an indention fix.

@@ -79,4 +79,4 @@ class PerfCounters(RESTController):
@EndpointDoc("Display Perf Counters",
responses={200: PERF_SCHEMA})
def list(self):
return mgr.get_all_perf_counters()
return mgr.get_unlabeled_perf_counters()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty explicit.

Copy link
Contributor

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built this branch in vstart and compilied the Telemetry report successfully. Looks good to go for testing from my point of view!

@ljflores
Copy link
Contributor

@pereman2 please add the "needs-qa" label when you have gotten all the approvals you need.

@idryomov
Copy link
Contributor

@pereman2 please add the "needs-qa" label when you have gotten all the approvals you need.

@ljflores I went ahead and added it. I think this is ready, it's just Radek's review notes tend to generate unresolved conversations ;)

@yuriw
Copy link
Contributor

yuriw commented Jul 14, 2023

From @kamoltat ref: https://trello.com/c/7y6uj4bo

RADOS approved, all failures/dead jobs are unrelated and known failures.

Failures:

7332480 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard
7332558 -> Bug #57302: ERROR: test_get_status (tasks.mgr.dashboard.test_cluster.ClusterTest) - Dashboard - Ceph -> Test failure: test_create_access_permissions (tasks.mgr.dashboard.test_pool.PoolTest)
7332565 -> Bug #57754: test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure - Ceph -> test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
7332612 -> Bug #57754: test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure - Ceph -> test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
7332613 -> Bug #55347: SELinux Denials during cephadm/workunits/test_cephadm - Infrastructure - Ceph -> SELinux Denials during cephadm/workunits/test_cephadm
7332636 -> Bug #58946: cephadm: KeyError: 'osdspec_affinity' - Dashboard - Ceph -> cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard

Deads:

7332357 -> Bug #61164: Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds - Infrastructure - Ceph -> Error reimaging machines: reached maximum tries (100) after waiting for 600 seconds
7332405 -> Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph -> rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"
7332559 -> Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph -> rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

@yuriw yuriw merged commit 06d93ef into ceph:main Jul 14, 2023
10 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants