mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47163

aaSharma14 · 2022-07-19T09:05:50Z

Fixes: https://tracker.ceph.com/issues/56625
SIgned-off-by:Aashish Sharma aasharma@redhat.com

New metrics-

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

epuertat

Awesome RFE-to-PR time @aaSharma14 ! Some suggestions over there.

src/pybind/mgr/prometheus/module.py

nizamial09

LGTM! Is there anything left here @aaSharma14 ?

Also, we should squash the commits and update its title..

…hboard Get metrics of the different "HDDRule" and "MixedUse" classes of the "Raw Storage" for their ceph VMs. So that Prometheus can scrape the data and display it to them in grafana Fixes: https://tracker.ceph.com/issues/56625 Signed-off-by: Aashish Sharma <aasharma@redhat.com>

aaSharma14 · 2022-07-20T06:09:01Z

LGTM! Is there anything left here @aaSharma14 ?

Also, we should squash the commits and update its title..

Nothing left from my side here @nizamial09 . I have squashed the commits as well

epuertat

LGTM. Thanks a lot for addressing my comments!

Given we don't have any integration or e2e testing on the mgr/prometheus module, what about running the qa/tasks/mgr/test_prometheus.py test in our API test suite?

I'd suggest adding an additional test there that compares the dump of the tested http://<prometheus>:9283/metrics endpoint with a git-versioned template like this:

# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
...
# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="",public_addr="172.20.0.2",rank="0",ceph_version=""} 1.0
# HELP ceph_mgr_metadata MGR metadata
# TYPE ceph_mgr_metadata gauge
ceph_mgr_metadata{ceph_daemon="mgr.x",hostname="ceph",ceph_version="ceph version 17.0.0-13521-gf9bd8922 (f9bd8922d75a454b4a0ace5a85637213e5ee5700) quincy (dev)"} 1.0
# HELP ceph_mgr_status MGR status (0=standby, 1=active)
# TYPE ceph_mgr_status gauge
ceph_mgr_status{ceph_daemon="mgr.x"} 1.0
...

The above dump was obtained from a run of ceph-dev in my laptop, so many fields will differ from other runs (IPs, timestamps, metric values, etc). One alternative would be going through that dump looking for IP address, hostnames, etc., and replace that with REGEX (e.g.: hostname="172.20.0.2" with hostname=REGEX_HOSTNAME). However that would require a lot of editing and would make it really complicate to update that dump when new fields are added/removed.

So my suggestion would be to implement the 'smartness' in the code instead. Instead of making the changes inline:

We take each metric (including the # HELP and # TYPE metadata) in the tested prometheus and in the reference dump and produce a tuple of three elements to compare:
- HELP: strict comparison
- TYPE: strict comparison
- metric itself: we compare the following:
  - the metric name (all metrics in the reference dump should be present in the tested dump)
  - the metric label names (but not the metric label values)

I'm ok with going with this PR as is, but an interesting example would be to run that test in this PR and verify that it fails because the cluster_by_class_* metrics are missing.

What do you think?

aaSharma14 · 2022-07-20T10:48:20Z

LGTM. Thanks a lot for addressing my comments!

Given we don't have any integration or e2e testing on the mgr/prometheus module, what about running the qa/tasks/mgr/test_prometheus.py test in our API test suite?

I'd suggest adding an additional test there that compares the dump of the tested http://<prometheus>:9283/metrics endpoint with a git-versioned template like this:
# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
...
# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="",public_addr="172.20.0.2",rank="0",ceph_version=""} 1.0
# HELP ceph_mgr_metadata MGR metadata
# TYPE ceph_mgr_metadata gauge
ceph_mgr_metadata{ceph_daemon="mgr.x",hostname="ceph",ceph_version="ceph version 17.0.0-13521-gf9bd8922 (f9bd8922d75a454b4a0ace5a85637213e5ee5700) quincy (dev)"} 1.0
# HELP ceph_mgr_status MGR status (0=standby, 1=active)
# TYPE ceph_mgr_status gauge
ceph_mgr_status{ceph_daemon="mgr.x"} 1.0
...
The above dump was obtained from a run of ceph-dev in my laptop, so many fields will differ from other runs (IPs, timestamps, metric values, etc). One alternative would be going through that dump looking for IP address, hostnames, etc., and replace that with REGEX (e.g.: hostname="172.20.0.2" with hostname=REGEX_HOSTNAME). However that would require a lot of editing and would make it really complicate to update that dump when new fields are added/removed.

So my suggestion would be to implement the 'smartness' in the code instead. Instead of making the changes inline:

We take each metric (including the # HELP and # TYPE metadata) in the tested prometheus and in the reference dump and produce a tuple of three elements to compare:

HELP: strict comparison

TYPE: strict comparison

metric itself: we compare the following:

the metric name (all metrics in the reference dump should be present in the tested dump)

the metric label names (but not the metric label values)

I'm ok with going with this PR as is, but an interesting example would be to run that test in this PR and verify that it fails because the cluster_by_class_* metrics are missing.

What do you think?

Yes, makes sense @epuertat , maybe we can discuss this in the standup call as well to get more clear overview of its implementation

github-actions bot added monitoring pybind labels Jul 19, 2022

aaSharma14 marked this pull request as ready for review July 19, 2022 12:32

aaSharma14 requested review from sunilangadi2, avanthakkar, epuertat and nizamial09 July 19, 2022 12:33

epuertat requested changes Jul 19, 2022

View reviewed changes

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

aaSharma14 force-pushed the get-storage-class-metrics branch 2 times, most recently from 58742c1 to eb5d69c Compare July 20, 2022 05:30

aaSharma14 requested a review from epuertat July 20, 2022 05:31

nizamial09 approved these changes Jul 20, 2022

View reviewed changes

aaSharma14 force-pushed the get-storage-class-metrics branch from eb5d69c to c4b344e Compare July 20, 2022 05:55

aaSharma14 force-pushed the get-storage-class-metrics branch from c4b344e to 2070567 Compare July 20, 2022 06:08

aaSharma14 changed the title ~~get different storage class metrics~~ mgr/dashboard:Get different storage class metrics in Prometheus dashboard Jul 20, 2022

epuertat approved these changes Jul 20, 2022

View reviewed changes

avanthakkar approved these changes Jul 20, 2022

View reviewed changes

epuertat merged commit 60febf5 into ceph:main Jul 20, 2022

epuertat deleted the get-storage-class-metrics branch July 20, 2022 12:08

aaSharma14 mentioned this pull request Jul 21, 2022

quincy: mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47163

mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47163

aaSharma14 commented Jul 19, 2022 •

edited

epuertat left a comment

nizamial09 left a comment

aaSharma14 commented Jul 20, 2022

epuertat left a comment

aaSharma14 commented Jul 20, 2022

mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47163

mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47163

Conversation

aaSharma14 commented Jul 19, 2022 • edited

Contribution Guidelines

Checklist

epuertat left a comment

Choose a reason for hiding this comment

nizamial09 left a comment

Choose a reason for hiding this comment

aaSharma14 commented Jul 20, 2022

epuertat left a comment

Choose a reason for hiding this comment

aaSharma14 commented Jul 20, 2022

aaSharma14 commented Jul 19, 2022 •

edited