Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: add ability to mute health alerts #29422

Merged
merged 28 commits into from Aug 15, 2019
Merged

mon: add ability to mute health alerts #29422

merged 28 commits into from Aug 15, 2019

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Jul 31, 2019

https://tracker.ceph.com/issues/40420


TODO

  • add qa/standalone test(s)
  • documentation mute/unmute commands, behavior

@liewegas
Copy link
Member Author

liewegas commented Aug 1, 2019

retest this please

@sebastian-philipp
Copy link
Contributor

@LenzGr do we need a dashboard integration?

@liewegas
Copy link
Member Author

liewegas commented Aug 1, 2019

@LenzGr do we need a dashboard integration?

Definitely!

@liewegas liewegas changed the title mon: add ability to mute health aleeeerts mon: add ability to mute health alerts Aug 5, 2019
@liewegas liewegas requested a review from dzafman August 6, 2019 22:08
Copy link
Contributor

@dzafman dzafman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for 0ef1b8d should say "unmute" not "unmount"

qa/standalone/mon/health-mute.sh Show resolved Hide resolved
qa/standalone/mon/health-mute.sh Outdated Show resolved Hide resolved
src/mon/HealthMonitor.cc Show resolved Hide resolved
Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a rebase and left a few nits, other looks great! (sorry for the delay in reviewing this)

MGR_DOWN
________

All manager` daemons are currently down. The cluster should normally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra "`"


All manager` daemons are currently down. The cluster should normally
have at least one running manager (``ceph-mgr``) daemon. If no
manager daemon is running, the clusters ability to monitor itself will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cluster's

``mon_osd_snap_trim_queue_warn_on`` option (default: 32768).

This warning may trigger if OSDs are under excessive load and unable
to keep up with their background work, or if the OSDs internal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: OSD's or OSDs'

``mon_data_size_warn`` (default: 15 GiB).

A large database is unusual, but may not necessarily indicate a
problem. Monitor databases may grow in size of there are placement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of/when/

ceph health unmute OSD_DOWN

A health check mute may optionally have a TTL (time to live)
associated with it, sucht that the mute will automatically expire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/sucht/such

@@ -211,6 +211,11 @@ COMMAND_WITH_FLAG("injectargs " \
COMMAND("status", "show cluster status", "mon", "r")
COMMAND("health name=detail,type=CephChoices,strings=detail,req=false", \
"show cluster health", "mon", "r")
COMMAND("health mute name=code,type=CephString name=ttl,type=CephString,req=false " \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: having each name in a fresh line may be easier to read

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Get rid of single caller helpers.  Instead, assimilate all the checks
together at once, and have two separate blocks, one for formatted, and
one for plaintext output.  Much easier to follow!

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
- de-escalate severity
- mark mutes in structured output
- note mutes in summary text output
- mark mutes in detail text output

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ears

If the alert goes away, drop the mute.

Signed-off-by: Sage Weil <sage@redhat.com>
This operates exclusively on HealthMonitor members.  Make public member
private again.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Count goes first.

Signed-off-by: Sage Weil <sage@redhat.com>
Count goes first.

Signed-off-by: Sage Weil <sage@redhat.com>
If the summary starts with a digit, parse a count.

If the count goes up, clear the mute.

If the count goes down, update the mute so that we ratchet the threshold
down.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Use dump() member instead of duplicating!  The only reason we had this
before was because the detail portion was optinoal

Signed-off-by: Sage Weil <sage@redhat.com>
0 means this is a singleton.  Otherwise, we can sum this up, either
via merge() or get_or_add().

We always structure this so the count goes toward zero (more healthy), so
if a value is too low, then we count how much too low it is.

Signed-off-by: Sage Weil <sage@redhat.com>
…string

This is more explicit and robust, and works with the PG warnings, which
don't conform to the "%d ..." form that the other messages do.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
The mitigation steps are weak, but it's not clear concrete guidance to
provide.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Make sure mute and unmute work.  Make sure stick is sticky. Mkae sure
counts can go down bupt if they go upt hte mute clears.

Signed-off-by: Sage Weil <sage@redhat.com>
I think someday the docs for how health alerts work (here) and the
enumeration of all actual alerts should be restructured.  For now this
si the simplest placde to fit this!

Signed-off-by: Sage Weil <sage@redhat.com>t
Also fix the 'checks' field, which is a list of objects, not strings.  (The
test doesn't notice because it's empty.)

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit 403f119 into ceph:master Aug 15, 2019
liewegas added a commit that referenced this pull request Aug 15, 2019
* refs/pull/29422/head:
	qa/tasks/mgr/dashboard/test_health: update schema
	doc/rados/operations/monitoring: document muting health alerts
	qa/standalone/mon/health-mutes: add tests
	doc/rados/operations/health-checks: document MON_DISK_{LOW,CRIT,BIG}
	doc/rados/operations/health-checks: document OSD_NO_DOWN_OUT_INTERVAL
	doc/rados/operations/health-checks: document AUTH_BAD_CAPS
	doc/reados/operations/health-checks: document PG_SLOW_SNAP_TRIMMING
	doc/rados/operations/health-checks: document MGR_DOWN
	mon/HealthCheck: check mutes based on count, not parsing the summary string
	mon/health_checks: associate a count with health_alert_t
	mon/HealthMonitor: simplify health alert dump
	mon/PGMap: use nice timespan for PG stuck warnings
	mon/HealthMonitor: allow muted alert counts to decrease but not increase
	mon/PGMap: fix summary form for bluestore health alerts
	doc/rados/operations/health-alerts: document BLUESTORE_NO_COMPRESSION
	mon/PGMap: fix summary form for POOL_APP_NOT_ENABLED
	mon/HealthMonitor: persist summary for non-sticky mutes
	mon/HealthMonitor: move get_health_status()
	mon/HealthMonitor: automatically clear non-sticky mutes when alert clears
	mon/HealthMonitor: add gather_all_health_checks helper
	mon/HealthMonitor: add sticky flag to mutes
	mon/HealthMonitor: expire mutes based on ttl
	mon: apply mutes to health [detail]
	mon/HealthMonitor: implement mute and unmount commands
	mon/HealthMonitor: maintain list of mutes
	mon: refactor/simplify health [detail]
	mon/health_checks: format 'health summary' with a colon
	mon/health_checks: drop dump_summary_compat

Reviewed-by: Neha Ojha <nojha@redhat.com>
@smithfarm
Copy link
Contributor

smithfarm commented Sep 17, 2019

@liewegas (echoing https://tracker.ceph.com/issues/40420#note-2) Since https://tracker.ceph.com/issues/40420 references this PR as its fix, I added that URL to the description.

Should this feature be backported to nautilus?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants