New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/prometheus: expose ceph healthchecks as metrics #43293
Conversation
|
@p-se @s0nea here's the draft PR covering the inclusion of the alert changes. I think it covers everything we discussed
here's an example of the CLI output [ceph: root@c8-node1 /]# ceph healthcheck history ls Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active MON_DISK_CRIT 2021/09/24 02:11:24 2021/09/24 02:11:24 1 No MON_DISK_LOW 2021/09/24 02:10:54 2021/09/24 02:10:54 1 No OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/21 23:29:05 9 No OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 No PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 No 5 health check(s) listed if your happy that the all the pieces are there, I'll move out of draft to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really promising work and nice refresh of the alerts! Thanks a lot @pcuzner !
I wonder if it'd make sense to have most of this built into the core, extending the mon/health_check.h:health_check_t class to include the missing fields (timestamps), and return the health check stats from the ceph -s command itself, rather exposing them in the Prom exporter.
I'm a bit wary of persisting this in the mon KV store though... If persisted on the monmap/mgrmap, what else are we missing (inactive alerts? fire timestamp?).
One last thing: I think it would be great if we could somehow display the alerts/events as Grafana annotations, so you can easily correlate how these alerts impact the timeseries (it seems there's an official alpha Alertmanager data source for Grafana, and also a non-official connector).
| [ceph: root@c8-node1 /]# ceph healthcheck history ls | ||
| Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active | ||
| OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No | ||
| OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes | ||
| PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes | ||
| 3 health check(s) listed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to expose this in "ceph -s" (or maybe with an extra flag "-c")?
# ceph -s
cluster:
id: 26df92e6-d9a8-4542-a0f4-712a4b303e6a
health: HEALTH_WARN
Healthcheck Name First Seen Last seen Count Active
OSDMAP_FLAGS 1 day ago 23 hours ago 2 No
OSD_DOWN 3 hours ago now 1 Yes
PG_DEGRADED 2 hours ago now 1 Yes
services:
mon: 1 daemons, quorum a (age 104s)
mgr: x(active, since 92s)
osd: 0 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Typically 'ceph -s' is focused on current state. Including historical information doesn't make sense IMO.
src/pybind/mgr/prometheus/module.py
Outdated
| class Format(enum.Enum): | ||
| plain = 'plain' | ||
| json = 'json' | ||
| json_pretty = 'json-pretty' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is useful for other modules, let's have it in mgr_module.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - the Format class is defined in the Orchestrator module too with yaml support - so although I agree I don't think this PR should be fixing other issues.
| @@ -44,6 +45,61 @@ groups: | |||
| - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} | |||
| {{- end }} | |||
|
|
|||
| - alert: Monitor down | |||
| expr: ceph_health_detail{name="MON_DOWN"} == 1 and count(ceph_mon_quorum_status == 1) >= (floor(count(ceph_mon_metadata) / 2) + 1) | |||
| for: 5m | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious (I know it's not specific to this alert alone), based on your experience, isn't 5 mins too long? I recently had to reproduce a field issue (OSD down) and I felt this "delay" (it was the 15min that you changed below) like a laggy alerting... Shouldn't these be on the minute scale?
I wondered about where to add this, but the fundamental requirement is to expose from mgr/prometheus. That was where I started. The CLI was a bonus to get the most out of the fact I have to persist healthcheck state - so that's a "value add". |
|
jenkins test dashboard cephadm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to seeing it merged!
@epuertat
I do think annotations are great and would also like to see them, however, they are required to be implemented in panels and we'd to think about where to put them. We do not have a landing page that shows Grafana graphs and no dedicated panel that shows how Ceph health checks behave. Do you already have an idea where they could be added?
I remember @jecluis mentioned about the impact of frequent updates in the mon KV as a non recommendable practice. About the alternatives, if we want this data to survive across cluster reboots it clearly cannot be in-memory, but OSD storage might be unavailable in the cases some of this alerts make more sense... Not sure TBH. |
|
I know I have 2 alert rules not unit tested...there may be more. I'm currently checking and will resolve asap |
I was just thinking on making them available to the Grafana panels. Enriching timeseries with the system events (OSD down, HEALTH_WARN, etc) helps better correlate cause/effect issues. |
| # CEPHADM orchestrator alert triggers | ||
| - interval: 30s | ||
| input_series: | ||
| - series: 'ceph_health_detail{name="UPGRADE_EXCEPTION"}' | ||
| values: '1+0x40' | ||
| promql_expr_test: | ||
| - expr: ceph_health_detail{name="UPGRADE_EXCEPTION"} > 0 | ||
| eval_time: 2m | ||
| exp_samples: | ||
| - labels: '{__name__="ceph_health_detail", name="UPGRADE_EXCEPTION"}' | ||
| value: 1 | ||
| alert_rule_test: | ||
| - eval_time: 1m | ||
| alertname: Cluster upgrade has failed | ||
| - eval_time: 5m | ||
| alertname: Cluster upgrade has failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! Do you find the promql_expr_test useful? I commented with @aaSharma14 that perhaps we could just go for testing the alert_rule_test. For the promql_expr_test I'd like to find a way to extract those from the Grafana JSON/jsonnet files and feed this with some captured metrics (rather than synthetic ones).
|
@aaSharma14 could you plz have a look at this one and share your thoughts? |
|
BTW @pcuzner as this PR will now forward the cephadm MTU health check as an alert, please remove the existing MTU alert from the alertmanager. Thanks! |
@epuertat tricky one that. the current MTU check is against the raw metrics, so it will work with rook and cephadm. Given that, I don't think it makes sense to remove the current check and replace with a cephadm centric one. |
|
This is the complete list of alerts; |
|
@sebastian-philipp Also, just to cover the MTU conflict you mention. The config checker is disabled by default, and it's MTU check can be disabled independently should the admin consider it a pain - so at best you have to opt in to get the conflict, and if you do you can simply disable the mtu check. The MTU check within the config checker layer covers cases where the admin installs without 'our' monitoring and opts for the influx or telegraf modules. |
looks great! FYI, I think ceph/src/pybind/mgr/cephadm/serve.py Lines 587 to 590 in 84d37ec
is miss from this PR.
and the config checker is only usable for cephadm deployments. Which means those two checks have somewhat distinct use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really superb work productizing the alerts @pcuzner ! I just left a few comments about leveraging the existing JSON Schema for the Alertmanager rules file, and also some questions about URLs.
| def _check_doclink(self): | ||
| annotations = self.rule.get('annotations', {}) | ||
| doclink = annotations.get(DOCLINK_NAME, '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern about this approach is how to deal with cross version issues: when this becomes quincy (or gets backported to pacific), it'll still fetch https://docs.ceph.com/en/latest/.... If a check is dropped or renamed in master, this will fail in pacific/quincy... Here it's when I think we should get involvement from the docs team (@zdover23), @djgalloway or @tchaikov, to have a https://docs.ceph.com/en/quincy/ (or the ongoing release) available together with latest.
Regarding downstream productization, one possibility would be to have a ceph_default_alerts.yml.in (with some variable expansion for the Documentation URL) or given we're already using jsonnet for Grafana, rely on that (as Kubernetes does).
Additionally, this will require internet connection during building, and I remember issues reported from SUSE/FreeBSD on caged build environments (for Promtool/Grafonnet I remember those were disabled until explicitly enabled, which only happened in Jenkins make check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epuertat Agree - and this is something that will need some attention, since we need the alerts backported to Pacific!
This patch creates a health history object maintained in the modules kvstore. The history and current health checks are used to create a metric per healthcheck whilst also providing a history feature. Two new commands are added: ceph healthcheck history ls ceph healthcheck history clear In addition to the new commands, the additional metrics have been used to update the prometheus alerts Fixes: https://tracker.ceph.com/issues/52638 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Focus all tests inside a tests directory, and use pytest/tox to perform validation of the overall content. tox tests also use promtool if available to provide rule checks and unittest runs. In addition to these checks a validate_rules script provides the format, and content checks against all rules - which is also called via tox (but can be run independently too) Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
|
jenkins retest this please |
1 similar comment
|
jenkins retest this please |
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Temporary removal of the cmake test integration Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
|
jenkins retest this please |
2 similar comments
|
jenkins retest this please |
|
jenkins retest this please |
|
jenkins test make check |
| pyyaml | ||
| bs4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I suggest you to pin this to specific versions: Dashboard requirements-lint.txt broke the "make check" test last week because of unpinned deps (lesson learnt).

This patch creates a health history object maintained in the modules kv store. The
history and current health checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
Fixes: https://tracker.ceph.com/issues/52638
Signed-off-by: Paul Cuzner pcuzner@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox