Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus: mgr/dashboard: fix Grafana OSD/host panels #44924

Merged
merged 1 commit into from Mar 10, 2022

Conversation

p-se
Copy link
Contributor

@p-se p-se commented Feb 7, 2022

backport tracker: https://tracker.ceph.com/issues/53883


backport of #43685
parent tracker: https://tracker.ceph.com/issues/52974

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

@p-se p-se requested a review from a team as a code owner February 7, 2022 14:10
@p-se p-se added this to the octopus milestone Feb 7, 2022
@p-se p-se requested review from sunilangadi2 and Sarthak0702 and removed request for a team February 7, 2022 14:10
@p-se p-se added the dashboard label Feb 7, 2022
@p-se p-se requested review from alfonsomthd, aaSharma14 and a team February 7, 2022 14:14
@pereman2
Copy link
Contributor

pereman2 commented Feb 8, 2022

jenkins test make check

@p-se
Copy link
Contributor Author

p-se commented Feb 8, 2022

jenkins test make check

Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.

As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk.  This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros).  The data we
have expected is simply different in some rare cases.

I have not found a sole PromQL solution to this issue. What we basically
need is the following.

1. Match on labels `host` and `instance` to get one or more OSD names
   from a metadata metric (`ceph_disk_occupation`) to let a user know
   about which OSDs belong to which disk.

2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
   in which case the value of `ceph_daemon` must not refer to more than
   a single OSD. The exact opposite to requirement 1.

As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.

Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk).  This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.

`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.

    foo * on(ceph_daemon) group_left ceph_disk_occupation

`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).

    foo * on(device,instance)
    group_left(ceph_daemon) ceph_disk_occupation_human

Fixes: https://tracker.ceph.com/issues/52974

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
(cherry picked from commit 18d3a71)

Conflicts:
        monitoring/grafana/dashboards/host-details.json
        monitoring/grafana/dashboards/hosts-overview.json
        monitoring/grafana/dashboards/jsonnet/grafana_dashboards.jsonnet
        monitoring/grafana/dashboards/osd-device-details.json
        monitoring/grafana/dashboards/tests/features/hosts_overview.feature
        src/pybind/mgr/prometheus/module.py

- Octopus does not generate Grafana dashboards using jsonnet, hence
  grafana_dashboards.jsonnet was removed.
- Octopus does not support features, hence hosts_overview.feature was
  removed.
- Features implemented in prometheus/module.py that never were
  backported to Octopus were removed.
- `tox.ini` file adapted to include mgr/prometheus tests introduced by
  the backport.
- Add `cherrypy` to src/pybind/mgr/requirements.txt to fix Prometheus
  unit testing.
@nizamial09 nizamial09 added this to In progress in Dashboard via automation Feb 15, 2022
@p-se p-se requested a review from a team February 16, 2022 15:43
@pereman2 pereman2 moved this from In progress to needs-qa in Dashboard Feb 22, 2022
@pereman2 pereman2 moved this from needs-qa to Ready-to-merge in Dashboard Feb 22, 2022
@p-se p-se added the bug-fix label Mar 3, 2022
@epuertat epuertat merged commit 51c9d99 into ceph:octopus Mar 10, 2022
Dashboard automation moved this from Ready-to-merge to Done Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
4 participants