Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/dashboard:Get different storage class metrics in Prometheus dashboard #47163

Merged
merged 1 commit into from Jul 20, 2022

Conversation

aaSharma14
Copy link
Contributor

@aaSharma14 aaSharma14 commented Jul 19, 2022

Fixes: https://tracker.ceph.com/issues/56625
SIgned-off-by:Aashish Sharma aasharma@redhat.com

New metrics-
Screenshot from 2022-07-20 10-57-48

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome RFE-to-PR time @aaSharma14 ! Some suggestions over there.

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
@aaSharma14 aaSharma14 force-pushed the get-storage-class-metrics branch 2 times, most recently from 58742c1 to eb5d69c Compare July 20, 2022 05:30
@aaSharma14 aaSharma14 requested a review from epuertat July 20, 2022 05:31
Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Is there anything left here @aaSharma14 ?

Also, we should squash the commits and update its title..

…hboard

Get metrics of the different "HDDRule" and "MixedUse" classes of the "Raw Storage" for their ceph VMs. So that Prometheus can scrape the data and display it to them in grafana

Fixes: https://tracker.ceph.com/issues/56625
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
@aaSharma14
Copy link
Contributor Author

LGTM! Is there anything left here @aaSharma14 ?

Also, we should squash the commits and update its title..

Nothing left from my side here @nizamial09 . I have squashed the commits as well

@aaSharma14 aaSharma14 changed the title get different storage class metrics mgr/dashboard:Get different storage class metrics in Prometheus dashboard Jul 20, 2022
Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for addressing my comments!

Given we don't have any integration or e2e testing on the mgr/prometheus module, what about running the qa/tasks/mgr/test_prometheus.py test in our API test suite?

I'd suggest adding an additional test there that compares the dump of the tested http://<prometheus>:9283/metrics endpoint with a git-versioned template like this:

# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
...
# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="",public_addr="172.20.0.2",rank="0",ceph_version=""} 1.0
# HELP ceph_mgr_metadata MGR metadata
# TYPE ceph_mgr_metadata gauge
ceph_mgr_metadata{ceph_daemon="mgr.x",hostname="ceph",ceph_version="ceph version 17.0.0-13521-gf9bd8922 (f9bd8922d75a454b4a0ace5a85637213e5ee5700) quincy (dev)"} 1.0
# HELP ceph_mgr_status MGR status (0=standby, 1=active)
# TYPE ceph_mgr_status gauge
ceph_mgr_status{ceph_daemon="mgr.x"} 1.0
...

The above dump was obtained from a run of ceph-dev in my laptop, so many fields will differ from other runs (IPs, timestamps, metric values, etc). One alternative would be going through that dump looking for IP address, hostnames, etc., and replace that with REGEX (e.g.: hostname="172.20.0.2" with hostname=REGEX_HOSTNAME). However that would require a lot of editing and would make it really complicate to update that dump when new fields are added/removed.

So my suggestion would be to implement the 'smartness' in the code instead. Instead of making the changes inline:

  • We take each metric (including the # HELP and # TYPE metadata) in the tested prometheus and in the reference dump and produce a tuple of three elements to compare:
    • HELP: strict comparison
    • TYPE: strict comparison
    • metric itself: we compare the following:
      • the metric name (all metrics in the reference dump should be present in the tested dump)
      • the metric label names (but not the metric label values)

I'm ok with going with this PR as is, but an interesting example would be to run that test in this PR and verify that it fails because the cluster_by_class_* metrics are missing.

What do you think?

@aaSharma14
Copy link
Contributor Author

LGTM. Thanks a lot for addressing my comments!

Given we don't have any integration or e2e testing on the mgr/prometheus module, what about running the qa/tasks/mgr/test_prometheus.py test in our API test suite?

I'd suggest adding an additional test there that compares the dump of the tested http://<prometheus>:9283/metrics endpoint with a git-versioned template like this:

# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
...
# HELP ceph_mon_metadata MON Metadata
# TYPE ceph_mon_metadata untyped
ceph_mon_metadata{ceph_daemon="mon.a",hostname="",public_addr="172.20.0.2",rank="0",ceph_version=""} 1.0
# HELP ceph_mgr_metadata MGR metadata
# TYPE ceph_mgr_metadata gauge
ceph_mgr_metadata{ceph_daemon="mgr.x",hostname="ceph",ceph_version="ceph version 17.0.0-13521-gf9bd8922 (f9bd8922d75a454b4a0ace5a85637213e5ee5700) quincy (dev)"} 1.0
# HELP ceph_mgr_status MGR status (0=standby, 1=active)
# TYPE ceph_mgr_status gauge
ceph_mgr_status{ceph_daemon="mgr.x"} 1.0
...

The above dump was obtained from a run of ceph-dev in my laptop, so many fields will differ from other runs (IPs, timestamps, metric values, etc). One alternative would be going through that dump looking for IP address, hostnames, etc., and replace that with REGEX (e.g.: hostname="172.20.0.2" with hostname=REGEX_HOSTNAME). However that would require a lot of editing and would make it really complicate to update that dump when new fields are added/removed.

So my suggestion would be to implement the 'smartness' in the code instead. Instead of making the changes inline:

  • We take each metric (including the # HELP and # TYPE metadata) in the tested prometheus and in the reference dump and produce a tuple of three elements to compare:

    • HELP: strict comparison

    • TYPE: strict comparison

    • metric itself: we compare the following:

      • the metric name (all metrics in the reference dump should be present in the tested dump)
      • the metric label names (but not the metric label values)

I'm ok with going with this PR as is, but an interesting example would be to run that test in this PR and verify that it fails because the cluster_by_class_* metrics are missing.

What do you think?

Yes, makes sense @epuertat , maybe we can discuss this in the standup call as well to get more clear overview of its implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants