mgr/dashboard: fix Grafana OSD/host panels #43685

p-se · 2021-10-27T09:03:15Z

Fix issues with PromQL expressions and vector matching with the
ceph_disk_occupation metric.

As it turns out, ceph_disk_occupation cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk. This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros). The data we
have expected is simply different in some rare cases.

I have not found a sole PromQL solution to this issue. What we basically
need is the following.

Match on labels host and instance to get one or more OSD names
from a metadata metric (ceph_disk_occupation) to let a user know
about which OSDs belong to which disk.
Match on labels ceph_daemon of the ceph_disk_occupation metric,
in which case the value of ceph_daemon must not refer to more than
a single OSD. The exact opposite to requirement 1.

As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where ceph_daemon could possibly be osd.1 or
osd.1+osd.2) and to match a vector by having a single ceph_daemon in
the condition for the matching.

Although the ceph_daemon label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk). This means that only the ceph_disk_occupation
metadata metric seems to need to be extended and provided as two
metrics.

ceph_disk_occupation is supposed to be used for matching the
ceph_daemon label value.

foo * on(ceph_daemon) group_left ceph_disk_occupation

ceph_disk_occupation_human is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).

foo * on(device,instance)
group_left(ceph_daemon) ceph_disk_occupation_human

Fixes: https://tracker.ceph.com/issues/52974

Signed-off-by: Patrick Seidensal pseidensal@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

github-actions · 2021-10-29T02:57:14Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

alfonsomthd

UPDATE here.

@p-se I tested it and I got the "many-to-many matching not allowed: matching labels must be unique on one side" error. Can you take a look at it?

"found duplicate series for the match group {ceph_daemon="osd.1", device="dm-0", instance="ceph-node-01"} on the right hand-side of the operation: [{name="ceph_disk_occupation", ceph_daemon="osd.1", device="dm-0", device_ids="N/A", devices="vdb", instance="ceph-node-01", job="ceph"}, {name="ceph_disk_occupation", ceph_daemon="osd.1", device="dm-0", instance="ceph-node-01", job="ceph"}];many-to-many matching not allowed: matching labels must be unique on one side"

monitoring/grafana/dashboards/host-details.json

alfonsomthd · 2021-11-18T11:44:01Z

@p-se I tested it and I got the "many-to-many matching not allowed: matching labels must be unique on one side" error. Can

Sorry @p-se I was testing with an outdated grafana panel so ignore my previous comment.

The issue I'm facing now is that ceph_disk_occupation_human metric is being generated

Please lok at this.

avanthakkar · 2021-12-13T17:12:38Z

I tried testing some IOPS panels, but I don't see any data.
Steps:

Used this benchmark for iops.
Check grafana panels for osd R/W IOPS (IOPs are shown fine)
Checking host details -> OSD Disk Performance Statistics -> No data

^^ ceph_disk_occupation_human prom query shows no data, I think that's the reason metrics shows no data.
Same for Avg. Disk Utilization

I think it's because per host (host details) osd statistics showed no data so avg. will be the same.
Only reason I could find for No data metrics is because ceph_disk_occupation_human shows no data.
Bcz other queries grouped with this metric works fine. for e.g. node_disk_io_time_seconds_total shows correct data

p-se · 2021-12-13T17:17:49Z

I tried testing some IOPS panels, but I don't see any data. Steps:

Thank you @avanthakkar . I'll check tomorrow and get back to you!

p-se · 2021-12-14T11:41:37Z

@avanthakkar I've checked and it makes sense to me. I also do not have a ceph_disk_occupation metric on my vstart cluster and I also haven't had it before I created this PR. The reason is that, without virtualization, a Ceph cluster doesn't seem to be able to request some required information from my disk. This leads to the mgr/prometheus module not being able to provide data for the ceph_disk_occupation metric and hence the ceph_disk_occupation_human metric cannot be derived from it. I'm assuming you've the same issue.

So, you'll likely either need to use virtualization (not sure if kcli can already be used for it) or tamper with the mgr/prometheus/module.py file. Unfortunately, I do not have the patch at hand anymore, need to rebuild Ceph and have an issue doing so at the moment.

avanthakkar · 2021-12-14T17:55:38Z

@avanthakkar I've checked and it makes sense to me. I also do not have a ceph_disk_occupation metric on my vstart cluster and I also haven't had it before I created this PR. The reason is that, without virtualization, a Ceph cluster doesn't seem to be able to request some required information from my disk. This leads to the mgr/prometheus module not being able to provide data for the ceph_disk_occupation metric and hence the ceph_disk_occupation_human metric cannot be derived from it. I'm assuming you've the same issue.

So, you'll likely either need to use virtualization (not sure if kcli can already be used for it) or tamper with the mgr/prometheus/module.py file. Unfortunately, I do not have the patch at hand anymore, need to rebuild Ceph and have an issue doing so at the moment.

Yes, it worked with kcli env. Thanks @p-se

avanthakkar · 2021-12-14T17:58:03Z

jenkins test dashboard cephadm

avanthakkar · 2021-12-14T17:58:13Z

jenkins test dashboard

epuertat

Really nice work here @p-se (and über-sorry for the late reply)! I left some comments over there. None of them is definitely blocking, since this tackles an issue, but given it's introducing a new metric (and deprecating an existing one) I wondered if we should take the opportunity to clean up the metric in order to simplify the resulting queries.

epuertat · 2021-12-21T17:15:23Z

doc/mgr/prometheus.rst

@@ -236,11 +236,11 @@ drive statistics, special series are output like this:

 ::

-    ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}
+    ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}


Here it mentions that this is a metadata label (a label only used for join-like queries), so as we're using the _metadata suffix for other metrics like this, why not here too?:

Suggested change

ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}

ceph_disk_metadata{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}

Or if we want to keep the _occupation_ part, why not adding a _v2? The _human suffix sounds weird to me, and I cannot relate this to the formerly existing ceph_disk_occupation metric:

Suggested change

ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}

ceph_disk_occupation_v2{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}

epuertat · 2021-12-21T18:30:36Z

monitoring/grafana/dashboards/tests/features/host-details.feature

+    | node_disk_writes_completed_total{device="sda",instance="localhost:9100"} | 10+60x1 |
+    | node_disk_writes_completed_total{device="sdb",instance="localhost:9100"} | 10+60x1 |
+    | ceph_disk_occupation_human{ceph_daemon="osd.0 osd.1 osd.2",device="/dev/sda",instance="localhost:9283"} | 1.0 |
+    | ceph_disk_occupation_human{ceph_daemon="osd.3 osd.4 osd.5",device="/dev/sdb",instance="localhost:9283"} | 1.0 |


I'm wondering: given we're creating a new metadata metric from the old ceph_disk_occupation, why taking the chance for fixing other issues: all the metrics involving ceph_disk_occupation require a doubly nested label_replace():

label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace( label_replace(ceph_disk_occupation_human{ceph_daemon=~"$osd"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

By:

Aligning device names (in node_disk_* devices are sd*, while in ceph_disk_* ones they're /dev/sd*). Why not aligning the new ceph_disk_occupation to use sd* too? That way we would remove several label_replace() all over the code,

Aligning instance names,

we should expect the same promql query to work as:

irate(node_disk_read_bytes_total[1m]) and on (instance, device) ceph_disk_occupation_human{ceph_daemon=~"$osd"}

epuertat · 2021-12-21T18:36:23Z