New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon: add ability to mute health alerts #29422
Conversation
|
retest this please |
|
@LenzGr do we need a dashboard integration? |
Definitely! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for 0ef1b8d should say "unmute" not "unmount"
1490f4c
to
fa511c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a rebase and left a few nits, other looks great! (sorry for the delay in reviewing this)
| MGR_DOWN | ||
| ________ | ||
|
|
||
| All manager` daemons are currently down. The cluster should normally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra "`"
|
|
||
| All manager` daemons are currently down. The cluster should normally | ||
| have at least one running manager (``ceph-mgr``) daemon. If no | ||
| manager daemon is running, the clusters ability to monitor itself will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cluster's
| ``mon_osd_snap_trim_queue_warn_on`` option (default: 32768). | ||
|
|
||
| This warning may trigger if OSDs are under excessive load and unable | ||
| to keep up with their background work, or if the OSDs internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: OSD's or OSDs'
| ``mon_data_size_warn`` (default: 15 GiB). | ||
|
|
||
| A large database is unusual, but may not necessarily indicate a | ||
| problem. Monitor databases may grow in size of there are placement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of/when/
doc/rados/operations/monitoring.rst
Outdated
| ceph health unmute OSD_DOWN | ||
|
|
||
| A health check mute may optionally have a TTL (time to live) | ||
| associated with it, sucht that the mute will automatically expire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/sucht/such
src/mon/MonCommands.h
Outdated
| @@ -211,6 +211,11 @@ COMMAND_WITH_FLAG("injectargs " \ | |||
| COMMAND("status", "show cluster status", "mon", "r") | |||
| COMMAND("health name=detail,type=CephChoices,strings=detail,req=false", \ | |||
| "show cluster health", "mon", "r") | |||
| COMMAND("health mute name=code,type=CephString name=ttl,type=CephString,req=false " \ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: having each name in a fresh line may be easier to read
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Get rid of single caller helpers. Instead, assimilate all the checks together at once, and have two separate blocks, one for formatted, and one for plaintext output. Much easier to follow! Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
- de-escalate severity - mark mutes in structured output - note mutes in summary text output - mark mutes in detail text output Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ears If the alert goes away, drop the mute. Signed-off-by: Sage Weil <sage@redhat.com>
This operates exclusively on HealthMonitor members. Make public member private again. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Count goes first. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Count goes first. Signed-off-by: Sage Weil <sage@redhat.com>
If the summary starts with a digit, parse a count. If the count goes up, clear the mute. If the count goes down, update the mute so that we ratchet the threshold down. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Use dump() member instead of duplicating! The only reason we had this before was because the detail portion was optinoal Signed-off-by: Sage Weil <sage@redhat.com>
0 means this is a singleton. Otherwise, we can sum this up, either via merge() or get_or_add(). We always structure this so the count goes toward zero (more healthy), so if a value is too low, then we count how much too low it is. Signed-off-by: Sage Weil <sage@redhat.com>
…string This is more explicit and robust, and works with the PG warnings, which don't conform to the "%d ..." form that the other messages do. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
The mitigation steps are weak, but it's not clear concrete guidance to provide. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Make sure mute and unmute work. Make sure stick is sticky. Mkae sure counts can go down bupt if they go upt hte mute clears. Signed-off-by: Sage Weil <sage@redhat.com>
I think someday the docs for how health alerts work (here) and the enumeration of all actual alerts should be restructured. For now this si the simplest placde to fit this! Signed-off-by: Sage Weil <sage@redhat.com>t
Also fix the 'checks' field, which is a list of objects, not strings. (The test doesn't notice because it's empty.) Signed-off-by: Sage Weil <sage@redhat.com>
* refs/pull/29422/head:
qa/tasks/mgr/dashboard/test_health: update schema
doc/rados/operations/monitoring: document muting health alerts
qa/standalone/mon/health-mutes: add tests
doc/rados/operations/health-checks: document MON_DISK_{LOW,CRIT,BIG}
doc/rados/operations/health-checks: document OSD_NO_DOWN_OUT_INTERVAL
doc/rados/operations/health-checks: document AUTH_BAD_CAPS
doc/reados/operations/health-checks: document PG_SLOW_SNAP_TRIMMING
doc/rados/operations/health-checks: document MGR_DOWN
mon/HealthCheck: check mutes based on count, not parsing the summary string
mon/health_checks: associate a count with health_alert_t
mon/HealthMonitor: simplify health alert dump
mon/PGMap: use nice timespan for PG stuck warnings
mon/HealthMonitor: allow muted alert counts to decrease but not increase
mon/PGMap: fix summary form for bluestore health alerts
doc/rados/operations/health-alerts: document BLUESTORE_NO_COMPRESSION
mon/PGMap: fix summary form for POOL_APP_NOT_ENABLED
mon/HealthMonitor: persist summary for non-sticky mutes
mon/HealthMonitor: move get_health_status()
mon/HealthMonitor: automatically clear non-sticky mutes when alert clears
mon/HealthMonitor: add gather_all_health_checks helper
mon/HealthMonitor: add sticky flag to mutes
mon/HealthMonitor: expire mutes based on ttl
mon: apply mutes to health [detail]
mon/HealthMonitor: implement mute and unmount commands
mon/HealthMonitor: maintain list of mutes
mon: refactor/simplify health [detail]
mon/health_checks: format 'health summary' with a colon
mon/health_checks: drop dump_summary_compat
Reviewed-by: Neha Ojha <nojha@redhat.com>
|
@liewegas (echoing https://tracker.ceph.com/issues/40420#note-2) Since https://tracker.ceph.com/issues/40420 references this PR as its fix, I added that URL to the description. Should this feature be backported to nautilus? |
https://tracker.ceph.com/issues/40420
TODO