New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: revamp health check/warning system #15643

Merged
merged 46 commits into from Jul 12, 2017

Conversation

Projects
None yet
8 participants
@liewegas
Member

liewegas commented Jun 12, 2017

Implements new health checks with more structure. Old checks are left untouched but will only be used for the duration of the upgrade to luminous; once the upgrade completes the new checks are used instead.

Caveats

  • Monitor::log_health should atomically include the generated log msgs in the commit.
  • It is somewhat less convenient to log problems with the quorum itself in the official health history. This is partly because before it was a total mess and now we have a structure, so that's not necessarily a bad ting. Mostly it highlights the fact that we're tying health alerts explicitly to paxos consensus state.
  • the format of 'ceph health detail' (plaintext) is different: we indent to group details by summary. If you need something that is compatible with pre-luminous you can enable mon_health_preluminous_compat=true and the json/xml output will include both smooshed together (plaintext will look roughly like the old format).
  • the structured format of 'ceph health [detail] -f wahtever' has lost the extra weird crap about timechecks etc that didn't look like other health checks.

TODO

  • OSDMonitor checks
  • PGMap to generate a health_check_map_t
  • update mgr -> mon messaging and MgrStatMonitor health structs to be health_check_map_t
  • MDSMonitor new-style checks

That's about it!

qa

  • rados
  • rgw
  • rbd
  • fs

@liewegas liewegas requested review from jcsp and jdurgin Jun 12, 2017

@liewegas liewegas changed the title from mon: revamp health check/warning system to RFC mon: revamp health check/warning system Jun 12, 2017

#include <stdlib.h>
#include <limits.h>
#include <sstream>
#include <regex>

This comment has been minimized.

@tchaikov

This comment has been minimized.

@wjwithagen

wjwithagen Jun 13, 2017

Contributor

@tchaikov @liewegas
I would also be worried about differences in regex implementations between Linux and BSDs. I've already run into some. Chances are that Boost::Regex is going to be a more stabil denominator.

friend bool operator==(const health_check_map_t& l,
const health_check_map_t& r) {
if (l.checks.size() != r.checks.size()) {

This comment has been minimized.

@tchaikov

tchaikov Jun 13, 2017

Contributor

std::map offers operator==(), which does exactly the same thing.

}
return true;
}
friend bool operator!=(const health_check_map_t& l,

This comment has been minimized.

@tchaikov

tchaikov Jun 13, 2017

Contributor

and also operator!=.

@dmick

This comment has been minimized.

Member

dmick commented Jun 13, 2017

You probably are, but, be aware that deployment tools, custom monitoring scripts, health checks, etc. are all going to be impacted, particularly if you change the formatted output (which is assumed to be more stable).

@knandya

This comment has been minimized.

knandya commented Jun 13, 2017

I agree with @dmick opinion on this. Consumers of current health checks made custom changes for their monitoring (Sensu, Nagios etc) and created their own alerts.

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 14, 2017

There is a middle ground.. the json fields of the old format and the new format are mostly non-overlapping, so we could include them both for the next cycle. The overlap currently is that the 'detail' section previous was an array of strings and now it's a dict of error codes to arrays. We could rename it, or (perhaps better) move the 'detail' list for each health alert as an item in that dict instead of a separate top-level structure. (This is better for programmatic parsing IMO, but a bit less friendly for humans reading the JSON.)

@leseb

This comment has been minimized.

Contributor

leseb commented Jun 14, 2017

@liewegas can we get a sample output once you're done?
Thanks!

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 14, 2017

Before:

{
    "summary": [
        {
            "severity": "HEALTH_WARN",
            "summary": "8 pgs degraded"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "8 pgs stale"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "8 pgs undersized"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "1 root (2 osds) down"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "1 host (2 osds) down"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "1 osds down"
        }
    ],
    "overall_status": "HEALTH_WARN",
    "detail": [ 
        "pg 0.7 is stale+active+undersized+degraded, acting [0]",
        "pg 0.6 is stale+active+undersized+degraded, acting [0]",
        "pg 0.5 is stale+active+undersized+degraded, acting [0]",
        "pg 0.4 is stale+active+undersized+degraded, acting [0]",
        "pg 0.0 is stale+active+undersized+degraded, acting [0]",
        "pg 0.1 is stale+active+undersized+degraded, acting [0]",
        "pg 0.2 is stale+active+undersized+degraded, acting [0]",
        "pg 0.3 is stale+active+undersized+degraded, acting [0]",
        "root default (2 osds) is down",
        "host gnit (root=default) (2 osds) is down",
        "osd.0 (root=default,host=gnit) is down"
    ]
}
HEALTH_WARN 8 pgs degraded; 8 pgs stale; 8 pgs undersized; 1 root (2 osds) down; 1 host (2 osds) down; 1 osds down
pg 0.7 is stale+active+undersized+degraded, acting [0]
pg 0.6 is stale+active+undersized+degraded, acting [0]
pg 0.5 is stale+active+undersized+degraded, acting [0]
pg 0.4 is stale+active+undersized+degraded, acting [0]
pg 0.0 is stale+active+undersized+degraded, acting [0]
pg 0.1 is stale+active+undersized+degraded, acting [0]
pg 0.2 is stale+active+undersized+degraded, acting [0]
pg 0.3 is stale+active+undersized+degraded, acting [0]
root default (2 osds) is down
host gnit (root=default) (2 osds) is down
osd.0 (root=default,host=gnit) is down

after

{
    "checks": {
        "PG_DEGRADED": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs degraded"
        },
        "PG_STALE": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs stale"
        },
        "PG_UNDERSIZED": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs undersized"
        }
    },
    "status": "HEALTH_WARN",
    "detail": { 
        "PG_DEGRADED": [
            "pg 0.7 is stale+active+undersized+degraded, acting [0]",
            "pg 0.6 is stale+active+undersized+degraded, acting [0]",
            "pg 0.5 is stale+active+undersized+degraded, acting [0]",
            "pg 0.4 is stale+active+undersized+degraded, acting [0]",
            "pg 0.0 is stale+active+undersized+degraded, acting [0]",
            "pg 0.1 is stale+active+undersized+degraded, acting [0]",
            "pg 0.2 is stale+active+undersized+degraded, acting [0]",
            "pg 0.3 is stale+active+undersized+degraded, acting [0]"
        ],
        "PG_STALE": [
            "pg 0.7 is stale+active+undersized+degraded, acting [0]",
            "pg 0.6 is stale+active+undersized+degraded, acting [0]",
            "pg 0.5 is stale+active+undersized+degraded, acting [0]",
            "pg 0.4 is stale+active+undersized+degraded, acting [0]",
            "pg 0.0 is stale+active+undersized+degraded, acting [0]",
            "pg 0.1 is stale+active+undersized+degraded, acting [0]",
            "pg 0.2 is stale+active+undersized+degraded, acting [0]",
            "pg 0.3 is stale+active+undersized+degraded, acting [0]"
        ],
        "PG_UNDERSIZED": [
            "pg 0.7 is stale+active+undersized+degraded, acting [0]",
            "pg 0.6 is stale+active+undersized+degraded, acting [0]",
            "pg 0.5 is stale+active+undersized+degraded, acting [0]",
            "pg 0.4 is stale+active+undersized+degraded, acting [0]",
            "pg 0.0 is stale+active+undersized+degraded, acting [0]",
            "pg 0.1 is stale+active+undersized+degraded, acting [0]",
            "pg 0.2 is stale+active+undersized+degraded, acting [0]",
            "pg 0.3 is stale+active+undersized+degraded, acting [0]"
        ]
    }
}

HEALTH_WARN 2 osds down; 1 host (2 osds) down; 1 root (2 osds) down; 8 pgs stale
OSD_DOWN 2 osds down
    osd.0 (root=default,host=gnit) is down
    osd.1 (root=default,host=gnit) is down
OSD_HOST_DOWN 1 host (2 osds) down
    host gnit (root=default) (2 osds) is down
OSD_ROOT_DOWN 1 root (2 osds) down
    root default (2 osds) is down
PG_STALE 8 pgs stale
    pg 0.7 is stale+active+clean, acting [1,0]
    pg 0.6 is stale+active+clean, acting [0,1]
    pg 0.5 is stale+active+clean, acting [0,1]
    pg 0.4 is stale+active+clean, acting [0,1]
    pg 0.0 is stale+active+clean, acting [0,1]
    pg 0.1 is stale+active+clean, acting [1,0]
    pg 0.2 is stale+active+clean, acting [0,1]
    pg 0.3 is stale+active+clean, acting [0,1]

@liewegas liewegas added the build/ops label Jun 14, 2017

@leseb

This comment has been minimized.

Contributor

leseb commented Jun 14, 2017

@liewegas thanks!

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 14, 2017

Okay, added a mon_health_preluminous_compat option that is off by default but can be enabled. Also moved detail into each health section; I think this is simpler.

Normal mode:

{
    "checks": {
        "OSD_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "2 osds down",
            "detail": [
                "osd.0 (root=default,host=gnit) is down",
                "osd.1 (root=default,host=gnit) is down"
            ]
        },
        "OSD_HOST_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "1 host (2 osds) down",
            "detail": [
                "host gnit (root=default) (2 osds) is down"
            ]
        },
        "OSD_ROOT_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "1 root (2 osds) down",
            "detail": [
                "root default (2 osds) is down"
            ]
        },
        "PG_DEGRADED": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs degraded",
            "detail": [
                "pg 0.7 is stale+active+undersized+degraded, acting [0]",
                "pg 0.6 is stale+active+undersized+degraded, acting [0]",
                "pg 0.5 is stale+active+undersized+degraded, acting [0]",
                "pg 0.4 is stale+active+undersized+degraded, acting [0]",
                "pg 0.0 is stale+active+undersized+degraded, acting [0]",
                "pg 0.1 is stale+active+undersized+degraded, acting [0]",
                "pg 0.2 is stale+active+undersized+degraded, acting [0]",
                "pg 0.3 is stale+active+undersized+degraded, acting [0]"
            ]
        },
        "PG_STALE": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs stale",
            "detail": [
                "pg 0.7 is stale+active+undersized+degraded, acting [0]",
                "pg 0.6 is stale+active+undersized+degraded, acting [0]",
                "pg 0.5 is stale+active+undersized+degraded, acting [0]",
                "pg 0.4 is stale+active+undersized+degraded, acting [0]",
                "pg 0.0 is stale+active+undersized+degraded, acting [0]",
                "pg 0.1 is stale+active+undersized+degraded, acting [0]",
                "pg 0.2 is stale+active+undersized+degraded, acting [0]",
                "pg 0.3 is stale+active+undersized+degraded, acting [0]"
            ]
        },
        "PG_UNDERSIZED": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs undersized",
            "detail": [
                "pg 0.7 is stale+active+undersized+degraded, acting [0]",
                "pg 0.6 is stale+active+undersized+degraded, acting [0]",
                "pg 0.5 is stale+active+undersized+degraded, acting [0]",
                "pg 0.4 is stale+active+undersized+degraded, acting [0]",
                "pg 0.0 is stale+active+undersized+degraded, acting [0]",
                "pg 0.1 is stale+active+undersized+degraded, acting [0]",
                "pg 0.2 is stale+active+undersized+degraded, acting [0]",
                "pg 0.3 is stale+active+undersized+degraded, acting [0]"
            ]
        }
    },
    "status": "HEALTH_WARN"
}
HEALTH_WARN 2 osds down; 1 host (2 osds) down; 1 root (2 osds) down; 8 pgs degraded; 8 pgs stale; 8 pgs undersized
OSD_DOWN 2 osds down
    osd.0 (root=default,host=gnit) is down
    osd.1 (root=default,host=gnit) is down
OSD_HOST_DOWN 1 host (2 osds) down
    host gnit (root=default) (2 osds) is down
OSD_ROOT_DOWN 1 root (2 osds) down
    root default (2 osds) is down
PG_DEGRADED 8 pgs degraded
    pg 0.7 is stale+active+undersized+degraded, acting [0]
    pg 0.6 is stale+active+undersized+degraded, acting [0]
    pg 0.5 is stale+active+undersized+degraded, acting [0]
    pg 0.4 is stale+active+undersized+degraded, acting [0]
    pg 0.0 is stale+active+undersized+degraded, acting [0]
    pg 0.1 is stale+active+undersized+degraded, acting [0]
    pg 0.2 is stale+active+undersized+degraded, acting [0]
    pg 0.3 is stale+active+undersized+degraded, acting [0]
PG_STALE 8 pgs stale
    pg 0.7 is stale+active+undersized+degraded, acting [0]
    pg 0.6 is stale+active+undersized+degraded, acting [0]
    pg 0.5 is stale+active+undersized+degraded, acting [0]
    pg 0.4 is stale+active+undersized+degraded, acting [0]
    pg 0.0 is stale+active+undersized+degraded, acting [0]
    pg 0.1 is stale+active+undersized+degraded, acting [0]
    pg 0.2 is stale+active+undersized+degraded, acting [0]
    pg 0.3 is stale+active+undersized+degraded, acting [0]
PG_UNDERSIZED 8 pgs undersized
    pg 0.7 is stale+active+undersized+degraded, acting [0]
    pg 0.6 is stale+active+undersized+degraded, acting [0]
    pg 0.5 is stale+active+undersized+degraded, acting [0]
    pg 0.4 is stale+active+undersized+degraded, acting [0]
    pg 0.0 is stale+active+undersized+degraded, acting [0]
    pg 0.1 is stale+active+undersized+degraded, acting [0]
    pg 0.2 is stale+active+undersized+degraded, acting [0]
    pg 0.3 is stale+active+undersized+degraded, acting [0]

Compat mode:

gnit:build (wip-health) 11:54 AM $ bin/ceph health detail -f json-pretty ; bin/ceph health detail

{
    "checks": {
        "OSD_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "2 osds down",
            "detail": [
                "osd.0 (root=default,host=gnit) is down",
                "osd.1 (root=default,host=gnit) is down"
            ]
        },
        "OSD_HOST_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "1 host (2 osds) down",
            "detail": [
                "host gnit (root=default) (2 osds) is down"
            ]
        },
        "OSD_ROOT_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "1 root (2 osds) down",
            "detail": [
                "root default (2 osds) is down"
            ]
        },
        "PG_DEGRADED": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs degraded",
            "detail": [
                "pg 0.7 is stale+active+undersized+degraded, acting [0]",
                "pg 0.6 is stale+active+undersized+degraded, acting [0]",
                "pg 0.5 is stale+active+undersized+degraded, acting [0]",
                "pg 0.4 is stale+active+undersized+degraded, acting [0]",
                "pg 0.0 is stale+active+undersized+degraded, acting [0]",
                "pg 0.1 is stale+active+undersized+degraded, acting [0]",
                "pg 0.2 is stale+active+undersized+degraded, acting [0]",
                "pg 0.3 is stale+active+undersized+degraded, acting [0]"
            ]
        },
        "PG_STALE": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs stale",
            "detail": [
                "pg 0.7 is stale+active+undersized+degraded, acting [0]",
                "pg 0.6 is stale+active+undersized+degraded, acting [0]",
                "pg 0.5 is stale+active+undersized+degraded, acting [0]",
                "pg 0.4 is stale+active+undersized+degraded, acting [0]",
                "pg 0.0 is stale+active+undersized+degraded, acting [0]",
                "pg 0.1 is stale+active+undersized+degraded, acting [0]",
                "pg 0.2 is stale+active+undersized+degraded, acting [0]",
                "pg 0.3 is stale+active+undersized+degraded, acting [0]"
            ]
        },
        "PG_UNDERSIZED": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs undersized",
            "detail": [
                "pg 0.7 is stale+active+undersized+degraded, acting [0]",
                "pg 0.6 is stale+active+undersized+degraded, acting [0]",
                "pg 0.5 is stale+active+undersized+degraded, acting [0]",
                "pg 0.4 is stale+active+undersized+degraded, acting [0]",
                "pg 0.0 is stale+active+undersized+degraded, acting [0]",
                "pg 0.1 is stale+active+undersized+degraded, acting [0]",
                "pg 0.2 is stale+active+undersized+degraded, acting [0]",
                "pg 0.3 is stale+active+undersized+degraded, acting [0]"
            ]
        }
    },
    "status": "HEALTH_WARN",
    "summary": [
        {
            "severity": "HEALTH_WARN",
            "summary": "2 osds down"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "1 host (2 osds) down"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "1 root (2 osds) down"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "8 pgs degraded"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "8 pgs stale"
        },
        {
            "severity": "HEALTH_WARN",
            "summary": "8 pgs undersized"
        }
    ],
    "overall_status": "HEALTH_WARN",
    "detail": [
        "osd.0 (root=default,host=gnit) is down",
        "osd.1 (root=default,host=gnit) is down",
        "host gnit (root=default) (2 osds) is down",
        "root default (2 osds) is down",
        "pg 0.7 is stale+active+undersized+degraded, acting [0]",
        "pg 0.6 is stale+active+undersized+degraded, acting [0]",
        "pg 0.5 is stale+active+undersized+degraded, acting [0]",
        "pg 0.4 is stale+active+undersized+degraded, acting [0]",
        "pg 0.0 is stale+active+undersized+degraded, acting [0]",
        "pg 0.1 is stale+active+undersized+degraded, acting [0]",
        "pg 0.2 is stale+active+undersized+degraded, acting [0]",
        "pg 0.3 is stale+active+undersized+degraded, acting [0]",
        "pg 0.7 is stale+active+undersized+degraded, acting [0]",
        "pg 0.6 is stale+active+undersized+degraded, acting [0]",
        "pg 0.5 is stale+active+undersized+degraded, acting [0]",
        "pg 0.4 is stale+active+undersized+degraded, acting [0]",
        "pg 0.0 is stale+active+undersized+degraded, acting [0]",
        "pg 0.1 is stale+active+undersized+degraded, acting [0]",
        "pg 0.2 is stale+active+undersized+degraded, acting [0]",
        "pg 0.3 is stale+active+undersized+degraded, acting [0]",
        "pg 0.7 is stale+active+undersized+degraded, acting [0]",
        "pg 0.6 is stale+active+undersized+degraded, acting [0]",
        "pg 0.5 is stale+active+undersized+degraded, acting [0]",
        "pg 0.4 is stale+active+undersized+degraded, acting [0]",
        "pg 0.0 is stale+active+undersized+degraded, acting [0]",
        "pg 0.1 is stale+active+undersized+degraded, acting [0]",
        "pg 0.2 is stale+active+undersized+degraded, acting [0]",
        "pg 0.3 is stale+active+undersized+degraded, acting [0]"
    ]
}
HEALTH_WARN 2 osds down; 1 host (2 osds) down; 1 root (2 osds) down; 8 pgs degraded; 8 pgs stale; 8 pgs undersized
osd.0 (root=default,host=gnit) is down
osd.1 (root=default,host=gnit) is down
host gnit (root=default) (2 osds) is down
root default (2 osds) is down
pg 0.7 is stale+active+undersized+degraded, acting [0]
pg 0.6 is stale+active+undersized+degraded, acting [0]
pg 0.5 is stale+active+undersized+degraded, acting [0]
pg 0.4 is stale+active+undersized+degraded, acting [0]
pg 0.0 is stale+active+undersized+degraded, acting [0]
pg 0.1 is stale+active+undersized+degraded, acting [0]
pg 0.2 is stale+active+undersized+degraded, acting [0]
pg 0.3 is stale+active+undersized+degraded, acting [0]
pg 0.7 is stale+active+undersized+degraded, acting [0]
pg 0.6 is stale+active+undersized+degraded, acting [0]
pg 0.5 is stale+active+undersized+degraded, acting [0]
pg 0.4 is stale+active+undersized+degraded, acting [0]
pg 0.0 is stale+active+undersized+degraded, acting [0]
pg 0.1 is stale+active+undersized+degraded, acting [0]
pg 0.2 is stale+active+undersized+degraded, acting [0]
pg 0.3 is stale+active+undersized+degraded, acting [0]
pg 0.7 is stale+active+undersized+degraded, acting [0]
pg 0.6 is stale+active+undersized+degraded, acting [0]
pg 0.5 is stale+active+undersized+degraded, acting [0]
pg 0.4 is stale+active+undersized+degraded, acting [0]
pg 0.0 is stale+active+undersized+degraded, acting [0]
pg 0.1 is stale+active+undersized+degraded, acting [0]
pg 0.2 is stale+active+undersized+degraded, acting [0]
pg 0.3 is stale+active+undersized+degraded, acting [0]
@liewegas

This comment has been minimized.

Member

liewegas commented Jun 14, 2017

The detail output may get some duplication and the new checks' detail messages may not be fully explanatory (they are assumed to be nested beneath a summary message), but this is at least in a form that won't break a tool.

@liewegas liewegas changed the title from RFC mon: revamp health check/warning system to mon: revamp health check/warning system Jun 15, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 15, 2017

@jcsp added MDS checks. It was a little tricky to group the MDSHealth events since we only have a single layer of grouping overall and we have items from multiple FSs and MDSs, but I think the final result makes sense...

@gregsfortytwo

I didn't go over the whole implementation but O like the overall structure and the new (to me?) warning messages..

// OSD_DOWN
// OSD_$subtree_DOWN
// OSD_ORPHAN
if (num_osds >= 0) {

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jun 16, 2017

Member

This if block is a little weird following the '''goto out'''' above. Especially because it's so long and covers a bunch of different tests differently than the following blocks.

continue;
if (!osdmap.is_up(i)) {
down_in_osds.insert(i);
int parent_id = 0;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jun 16, 2017

Member

Is this indentation single-space instead of 2?

<< roundf(max_osd_usage*1000.0)/100.0
<< "%) osd usage " << roundf(diff*1000.0)/100.0 << "% > "
<< roundf(cct->_conf->mon_warn_osd_usage_min_max_delta*1000.0)/100.0
<< " (mon_warn_osd_usage_min_max_delta)";

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jun 16, 2017

Member

This should probably include a link to something explaining PG and OSD balancing.

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 16, 2017

retest this please

}
}
out:

This comment has been minimized.

@jcsp

jcsp Jun 20, 2017

Contributor

out label unused

@jcsp

This comment has been minimized.

Contributor

jcsp commented Jun 20, 2017

I think the 'message' field on each check should be a dict called 'summary' with a 'message' item in it. Then, we can add more fields there with machine-readable summary info. The items in detail should be dicts too, so it's like this:

"PG_STUCK_UNCLEAN": {
  "severity": "HEALTH_ERR",
  "summary": {message: "13 pgs stuck unclean", pool_ids=[4], pg_count=13},
  "detail": [{"message": "pg blahblah", pg_id="4.3"}, ...]
}

The alternative would be to keep the "message" string field and add a separate "metadata" field for the summary metadata, but having one 'summary' object that contains both the human and machine readable toplevel info feels more natural to me.

For example, the summary part of a PG error state might be used to indicate which pools are affected (I claim that while having O(pgs) items in a summary structure is bad, having O(pools) is okay). Ideally, callers would have enough information in the summary dict to reconstruct the message string for themselves if they want to.

The text of the log messages when health conditions come and go needs to be a lot friendlier, but we can do that in a followup.

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 20, 2017

I think that works. One way to enforce the scale of the summary info is to make the summary string include all of the summary info. If pools are enumerated, list them in the summary string too.

I'm not sure the best way to include this free-form info in the health records. Perhaps we can just encode JSON in a string?

liewegas and others added some commits Jun 28, 2017

mon/PGMap: do not warn about recovering, peering, stale
Wait for stuck before complaining.  These aren't scary in and of
themselves.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: some stuck warnings are err, some warn
inactive and stale -> error
degraded, unclean, undersized -> warning

Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: only warn about too few pgs after >0 pools exist
Signed-off-by: Sage Weil <sage@redhat.com>
messages/MMonMgrReport: show health check count
Signed-off-by: Sage Weil <sage@redhat.com>
mon/MgrStatMonitor: show health check count on receipt
Signed-off-by: Sage Weil <sage@redhat.com>
mgr/DaemonServer: debug log health checks
Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: rename a few health checks
Signed-off-by: Sage Weil <sage@redhat.com>
mon: prefix periodic health reminder with 'overall'
...so we can whitelist it.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: rename a few health checks
Signed-off-by: Sage Weil <sage@redhat.com>
mgr: fix spurious PG health messages on mgr restart
Previously, the mgr would send MMonMgrReport indicating
a very unhappy PGMap to the mon right after startup.

This is a change to hold off on sending that report until
all the OSDs have reported in, or until some time has passed.

Signed-off-by: John Spray <john.spray@redhat.com>
mon: prettify health check log messages
Add a "Cluster is now healthy" to give clarity
after a series of "health check cleared" that
they were the last ones.

Convert certain health check messages into
well formed sentences.

Don't print severity in the log string (it's already
expressed in the severity of the log entry.

Signed-off-by: John Spray <john.spray@redhat.com>
mon: demote cluster map prints to DEBUG level
The PaxosService subclasses should be writing out
informative log messages, and not relying on
a stream of map summary prints to communicate
changes.

Signed-off-by: John Spray <john.spray@redhat.com>
mgr/dashboard: update for new style health checks
Signed-off-by: John Spray <john.spray@redhat.com>
mon: simplify PG health checks
Instead of a distinct health check for each possible
PG state, group the states into categories for availability,
degraded, damage, and report on that.

That way, while a PG/pool is suffering from one of those
bad PG states, health conditions don't keep toggling on and
off as we transition from one unavailable state to another
unavailable state.

Signed-off-by: John Spray <john.spray@redhat.com>
osd: don't log per-PG backfill messages at INFO level
This behaviour led to way too many messages going to
the cluster log when an OSD is marked in.  Retain
the messages at debug level.

Signed-off-by: John Spray <john.spray@redhat.com>
mon: clean up `osd out` messages
Cleaner prose for the auto-out case, and add
a cluster log message for OSDs that go out
at the behest of the administrator.

Signed-off-by: John Spray <john.spray@redhat.com>
mon/MgrMonitor: clear last_beacon after mon election
The last_beacon map is local to an election interval; if there is a new
election completed we should reset it or else we may kill an apparently
laggy mgr that hasn't been able to get a beacon processed due to the mon
quorum changing, or had its beacon processed on a different leader.

Signed-off-by: Sage Weil <sage@redhat.com>
qa: whitelist health warnings
Signed-off-by: Sage Weil <sage@redhat.com>
qa/workunits/cephtool/test.sh: adjust for new health error codes
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rbd: whitelist health messages
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rgw/thrash: whitelist
Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/fs: whitelist health warnings
Signed-off-by: Sage Weil <sage@redhat.com>
qa/tasks/ceph_test_case.py: update health check helpers
Signed-off-by: Sage Weil <sage@redhat.com>
qa/tasks/ceph: wait for osds to come up before creating pool
Avoid health warnings.

Signed-off-by: Sage Weil <sage@redhat.com>
qa/workunits/cephtool/test.sh: adjust full tests to avoid races
OSDs may report fullness in any order.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: adjust scrub checks to avoid overflow for future stamps
Avoid an overflow (and false warning) when scrub stamps are in the future.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas merged commit 8859627 into ceph:master Jul 12, 2017

2 of 4 checks passed

make check running make check
Details
make check (arm64) running make check
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details

@liewegas liewegas deleted the liewegas:wip-health branch Jul 12, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment