Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/prometheus: expose ceph healthchecks as metrics #43293

Merged
merged 5 commits into from Oct 28, 2021
Merged

mgr/prometheus: expose ceph healthchecks as metrics #43293

merged 5 commits into from Oct 28, 2021

Conversation

pcuzner
Copy link
Contributor

@pcuzner pcuzner commented Sep 24, 2021

This patch creates a health history object maintained in the modules kv store. The
history and current health checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:

  • ceph healthcheck history ls
  • ceph healthcheck history clear

Fixes: https://tracker.ceph.com/issues/52638

Signed-off-by: Paul Cuzner pcuzner@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@pcuzner
Copy link
Contributor Author

pcuzner commented Sep 24, 2021

@p-se @s0nea here's the draft PR covering the inclusion of the alert changes. I think it covers everything we discussed

  • new metrics,
  • history of healthchecks
  • CLI interaction to view prior healthchecks
  • updated default alerts.

here's an example of the CLI output

[ceph: root@c8-node1 /]# ceph healthcheck history ls 
Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
MON_DISK_CRIT             2021/09/24 02:11:24   2021/09/24 02:11:24       1    No  
MON_DISK_LOW              2021/09/24 02:10:54   2021/09/24 02:10:54       1    No  
OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/21 23:29:05       9    No  
OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1    No  
PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1    No  
5 health check(s) listed

if your happy that the all the pieces are there, I'll move out of draft to review.

@pcuzner pcuzner requested a review from s0nea September 24, 2021 02:41
Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really promising work and nice refresh of the alerts! Thanks a lot @pcuzner !

I wonder if it'd make sense to have most of this built into the core, extending the mon/health_check.h:health_check_t class to include the missing fields (timestamps), and return the health check stats from the ceph -s command itself, rather exposing them in the Prom exporter.

I'm a bit wary of persisting this in the mon KV store though... If persisted on the monmap/mgrmap, what else are we missing (inactive alerts? fire timestamp?).

One last thing: I think it would be great if we could somehow display the alerts/events as Grafana annotations, so you can easily correlate how these alerts impact the timeseries (it seems there's an official alpha Alertmanager data source for Grafana, and also a non-official connector).

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
Comment on lines +130 to +135
[ceph: root@c8-node1 /]# ceph healthcheck history ls
Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active
OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No
OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
3 health check(s) listed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to expose this in "ceph -s" (or maybe with an extra flag "-c")?

# ceph -s
  cluster:
    id:     26df92e6-d9a8-4542-a0f4-712a4b303e6a
    health: HEALTH_WARN
        Healthcheck Name          First Seen            Last seen             Count  Active
        OSDMAP_FLAGS              1 day ago             23 hours ago              2      No
        OSD_DOWN                  3 hours ago           now                       1     Yes
        PG_DEGRADED               2 hours ago           now                       1     Yes
 
  services:
    mon: 1 daemons, quorum a (age 104s)
    mgr: x(active, since 92s)
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:  

Copy link
Contributor Author

@pcuzner pcuzner Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Typically 'ceph -s' is focused on current state. Including historical information doesn't make sense IMO.

Comment on lines 142 to 145
class Format(enum.Enum):
plain = 'plain'
json = 'json'
json_pretty = 'json-pretty'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is useful for other modules, let's have it in mgr_module.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - the Format class is defined in the Orchestrator module too with yaml support - so although I agree I don't think this PR should be fixing other issues.

monitoring/prometheus/alerts/ceph_default_alerts.yml Outdated Show resolved Hide resolved
monitoring/prometheus/alerts/ceph_default_alerts.yml Outdated Show resolved Hide resolved
@@ -44,6 +45,61 @@ groups:
- {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
{{- end }}

- alert: Monitor down
expr: ceph_health_detail{name="MON_DOWN"} == 1 and count(ceph_mon_quorum_status == 1) >= (floor(count(ceph_mon_metadata) / 2) + 1)
for: 5m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious (I know it's not specific to this alert alone), based on your experience, isn't 5 mins too long? I recently had to reproduce a field issue (OSD down) and I felt this "delay" (it was the 15min that you changed below) like a laggy alerting... Shouldn't these be on the minute scale?

src/pybind/mgr/prometheus/module.py Show resolved Hide resolved
@pcuzner
Copy link
Contributor Author

pcuzner commented Sep 28, 2021

Really promising work and nice refresh of the alerts! Thanks a lot @pcuzner !

I wonder if it'd make sense to have most of this built into the core, extending the mon/health_check.h:health_check_t class to include the missing fields (timestamps), and return the health check stats from the ceph -s command itself, rather exposing them in the Prom exporter.

I'm a bit wary of persisting this in the mon KV store though... If persisted on the monmap/mgrmap, what else are we missing (inactive alerts? fire timestamp?).

One last thing: I think it would be great if we could somehow display the alerts/events as Grafana annotations, so you can easily correlate how these alerts impact the timeseries (it seems there's an official alpha Alertmanager data source for Grafana, and also a non-official connector).

I wondered about where to add this, but the fundamental requirement is to expose from mgr/prometheus. That was where I started. The CLI was a bonus to get the most out of the fact I have to persist healthcheck state - so that's a "value add".
What's the issue with using the KV store? that's the mgr module pattern right - so what's the alternative?

@nizamial09
Copy link
Member

jenkins test dashboard cephadm

Copy link
Contributor

@p-se p-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to seeing it merged!

@epuertat
I do think annotations are great and would also like to see them, however, they are required to be implemented in panels and we'd to think about where to put them. We do not have a landing page that shows Grafana graphs and no dedicated panel that shows how Ceph health checks behave. Do you already have an idea where they could be added?

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
@epuertat
Copy link
Member

I wondered about where to add this, but the fundamental requirement is to expose from mgr/prometheus. That was where I started. The CLI was a bonus to get the most out of the fact I have to persist healthcheck state - so that's a "value add". What's the issue with using the KV store? that's the mgr module pattern right - so what's the alternative?

I remember @jecluis mentioned about the impact of frequent updates in the mon KV as a non recommendable practice. About the alternatives, if we want this data to survive across cluster reboots it clearly cannot be in-memory, but OSD storage might be unavailable in the cases some of this alerts make more sense... Not sure TBH.

@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 1, 2021

I know I have 2 alert rules not unit tested...there may be more. I'm currently checking and will resolve asap

@epuertat
Copy link
Member

epuertat commented Oct 1, 2021

@epuertat I do think annotations are great and would also like to see them, however, they are required to be implemented in panels and we'd to think about where to put them. We do not have a landing page that shows Grafana graphs and no dedicated panel that shows how Ceph health checks behave. Do you already have an idea where they could be added?

I was just thinking on making them available to the Grafana panels. Enriching timeseries with the system events (OSD down, HEALTH_WARN, etc) helps better correlate cause/effect issues.

image

Comment on lines 803 to 818
# CEPHADM orchestrator alert triggers
- interval: 30s
input_series:
- series: 'ceph_health_detail{name="UPGRADE_EXCEPTION"}'
values: '1+0x40'
promql_expr_test:
- expr: ceph_health_detail{name="UPGRADE_EXCEPTION"} > 0
eval_time: 2m
exp_samples:
- labels: '{__name__="ceph_health_detail", name="UPGRADE_EXCEPTION"}'
value: 1
alert_rule_test:
- eval_time: 1m
alertname: Cluster upgrade has failed
- eval_time: 5m
alertname: Cluster upgrade has failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! Do you find the promql_expr_test useful? I commented with @aaSharma14 that perhaps we could just go for testing the alert_rule_test. For the promql_expr_test I'd like to find a way to extract those from the Grafana JSON/jsonnet files and feed this with some captured metrics (rather than synthetic ones).

@epuertat
Copy link
Member

epuertat commented Oct 1, 2021

@aaSharma14 could you plz have a look at this one and share your thoughts?

@epuertat
Copy link
Member

epuertat commented Oct 4, 2021

BTW @pcuzner as this PR will now forward the cephadm MTU health check as an alert, please remove the existing MTU alert from the alertmanager. Thanks!

@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 4, 2021

BTW @pcuzner as this PR will now forward the cephadm MTU health check as an alert, please remove the existing MTU alert from the alertmanager. Thanks!

@epuertat tricky one that. the current MTU check is against the raw metrics, so it will work with rook and cephadm. Given that, I don't think it makes sense to remove the current check and replace with a cephadm centric one.

@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 6, 2021

This is the complete list of alerts;
- alert: health error
- alert: health warn
- alert: Monitor down, quorum is at risk
- alert: Monitor down
- alert: Ceph mon disk space critically low
- alert: Ceph mon disk space running low
- alert: Clock skew detected across Ceph Monitor daemons
- alert: 10% OSDs down
- alert: OSD Host is down
- alert: OSD down
- alert: OSDs near full
- alert: OSD Full
- alert: OSD unable to perform rebalance
- alert: OSD too many read repairs
- alert: OSD hearbeats running slow (frontend)
- alert: OSD hearbeats running slow (backend)
- alert: OSD disk size mismatch
- alert: Device failure predicted
- alert: Too many devices predicted to fail
- alert: Device failure predicted, but automatic drain is incomplete
- alert: Flapping OSD
- alert: OSD Read errors
- alert: high pg count deviation
- alert: Ceph Filesystem damage detected
- alert: Ceph Filesystem switched to READ ONLY
- alert: mgr module failure
- alert: mgr prometheus module is not active
- alert: pgs inactive
- alert: pgs unclean
- alert: Placement Group (PG) damaged
- alert: Recovery at risk, cluster too full
- alert: I/O blocked to some data
- alert: Cluster too full, automatic data recovery impaired
- alert: Placement Group(s) have not been scrubbed
- alert: Placement Group(s) have not been 'DEEP' scrubbed
- alert: root volume full
- alert: network packets dropped
- alert: network packet errors
- alert: storage filling up
- alert: MTU Mismatch
- alert: pool full
- alert: pool filling up (growth forecast)
- alert: Ceph pool is too full for recovery/rebalance
- alert: Ceph pool is full - writes blocked
- alert: Ceph pool is approaching full
- alert: Slow OSD Ops
- alert: Cluster upgrade has failed
- alert: A daemon managed by cephadm is down
- alert: cephadm management has been paused
- alert: Scrape job is missing
- alert: Data not found/missing

@pcuzner pcuzner requested a review from b-ranto October 6, 2021 01:28
@pcuzner pcuzner marked this pull request as ready for review October 6, 2021 01:28
@pcuzner pcuzner requested a review from a team as a code owner October 6, 2021 01:28
@pcuzner pcuzner requested a review from p-se October 6, 2021 01:28
@liewegas liewegas changed the title mgr/prometheus:expose ceph healthchecks as metrics mgr/prometheus: expose ceph healthchecks as metrics Oct 6, 2021
@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 19, 2021

@sebastian-philipp Also, just to cover the MTU conflict you mention.

The config checker is disabled by default, and it's MTU check can be disabled independently should the admin consider it a pain - so at best you have to opt in to get the conflict, and if you do you can simply disable the mtu check.

The MTU check within the config checker layer covers cases where the admin installs without 'our' monitoring and opts for the influx or telegraf modules.

@sebastian-philipp
Copy link
Contributor

Latest commit provides the test/validation environment with a readme. The current alert rules do violate some of the standards being enforced by validation. I've left them in, to help review. If you choose to move forward with this PR, I'll fix in a follow up (I have to update the MIB content anyway!)

looks great!

FYI, I think

self.mgr.set_health_warning('CEPHADM_APPLY_SPEC_FAIL',
f"Failed to apply {len(self.mgr.apply_spec_fails)} service(s): {','.join(x[0] for x in self.mgr.apply_spec_fails)}",
len(self.mgr.apply_spec_fails),
warnings)

is miss from this PR.

@sebastian-philipp Also, just to cover the MTU conflict you mention.

The config checker is disabled by default, and it's MTU check can be disabled independently should the admin consider it a pain - so at best you have to opt in to get the conflict, and if you do you can simply disable the mtu check.

The MTU check within the config checker layer covers cases where the admin installs without 'our' monitoring and opts for the influx or telegraf modules.

and the config checker is only usable for cephadm deployments. Which means those two checks have somewhat distinct use cases.

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really superb work productizing the alerts @pcuzner ! I just left a few comments about leveraging the existing JSON Schema for the Alertmanager rules file, and also some questions about URLs.

monitoring/prometheus/tests/README.md Show resolved Hide resolved
monitoring/prometheus/tests/validate_rules.py Show resolved Hide resolved
Comment on lines +132 to +136
def _check_doclink(self):
annotations = self.rule.get('annotations', {})
doclink = annotations.get(DOCLINK_NAME, '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern about this approach is how to deal with cross version issues: when this becomes quincy (or gets backported to pacific), it'll still fetch https://docs.ceph.com/en/latest/.... If a check is dropped or renamed in master, this will fail in pacific/quincy... Here it's when I think we should get involvement from the docs team (@zdover23), @djgalloway or @tchaikov, to have a https://docs.ceph.com/en/quincy/ (or the ongoing release) available together with latest.

Regarding downstream productization, one possibility would be to have a ceph_default_alerts.yml.in (with some variable expansion for the Documentation URL) or given we're already using jsonnet for Grafana, rely on that (as Kubernetes does).

Additionally, this will require internet connection during building, and I remember issues reported from SUSE/FreeBSD on caged build environments (for Promtool/Grafonnet I remember those were disabled until explicitly enabled, which only happened in Jenkins make check).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epuertat Agree - and this is something that will need some attention, since we need the alerts backported to Pacific!

Dashboard automation moved this from In progress to Reviewer approved Oct 19, 2021
pcuzner and others added 3 commits October 22, 2021 13:32
This patch creates a health history object maintained in
the modules kvstore.  The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear

In addition to the new commands, the additional metrics
have been used to update the prometheus alerts

Fixes: https://tracker.ceph.com/issues/52638

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.

In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 22, 2021

jenkins retest this please

1 similar comment
@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 22, 2021

jenkins retest this please

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@github-actions github-actions bot added the tests label Oct 25, 2021
@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Oct 26, 2021
Temporary removal of the cmake test integration

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 26, 2021

jenkins retest this please

2 similar comments
@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 27, 2021

jenkins retest this please

@pcuzner
Copy link
Contributor Author

pcuzner commented Oct 27, 2021

jenkins retest this please

@sebastian-philipp
Copy link
Contributor

jenkins test make check

@sebastian-philipp sebastian-philipp added ready-to-merge and removed wip-swagner-testing My Teuthology tests labels Oct 28, 2021
Comment on lines +1 to +2
pyyaml
bs4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I suggest you to pin this to specific versions: Dashboard requirements-lint.txt broke the "make check" test last week because of unpinned deps (lesson learnt).

@sebastian-philipp sebastian-philipp merged commit aae2ea3 into ceph:master Oct 28, 2021
8 of 9 checks passed
Dashboard automation moved this from Reviewer approved to Done Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
7 participants