Dashboard updates #90

pcuzner · 2017-08-02T02:45:08Z

dashboard fixes and enhancements moving towards a 1.0, which includes an update to the default alert triggers.
When deploying consider the following;

The default alerts have changed, so please remove the existing alert-status dashboard prior to deploying the updated set
Some dashboards are removed/renamed, so to prevent old dashboards being retained, run dashUpdater in refresh mode and manually remove the following redundant dashboards from grafana; ceph-rados, ceph-osd-latency, ceph-frontend

In addition to the updates the dashboards are stored in formatted json now to help readability, and a tools/prettyJSON.py added to create the formatted json from the exported (api) output of Grafana

Changes are as follows; ceph-osd-latency --> ceph-osd-information ceph-rados --> ceph-cluster ceph-frontend --> ceph-pools

Added triggers for OSD node loss and Long OSD response times removed IO Stall trigger - firing to often during scrub

Graphs now aggregate across ceph roles, and provide drill down by role to show network load at each layer.

Renamed to ceph-cluster to better reflect the metrics and data shown on the dashboard

changes include; - support for osd and disk name to provide filters to the graphs - add osd overview row containing - raw capacity panel - host.disk -> disk size table - host.disk -> osd id - shortcut links added to other overview type dashboards

…ance data Summary row shows; - osd count - osd up count - osd's down - disk size summary (pie chart showing what sizes of disk are in the cluster - table of osd to disk size - OSD encryption summary (how many of my OSDs are encrypted?) - OSD type status (how many OSDs are filestore vs bluestore Panel includes an OSD id which is used as a filter for the filestore performance row The performance row now shows average OSD performance for a single OSD or all OSDs. This can then be used for side-by-side comparison with OSD performance across the cluster at the 95%ile.

OSD capacity by Host table added to the summary row, showing capacity by host Performance data split into two rows, allowing each one to be hidden independently

In addition to the link updates - scrub state reflects both scrub and deep-scrub status - the scrub panel no longer turns to "warn" when scrub is active - it's a natural feature of the cluster and not a problem! However, it will turn red, if scrub or deep-scrub is disabled - disk util panel changed to a graph and switched to average and 95%ile (95%ile on a busy cluster just shows too much variance) - OSDs panel now links to the ceph-osd-information dashboard

zmc · 2017-08-02T05:29:55Z

tools/prettyJSON.py

@@ -0,0 +1,77 @@
+#!/usr/bin/env python2


We don't actually need this; one can simply cat foo.json | python -m 'json.tool'

Agree. The code I added ; validates the file is a standard name, checks the json itself is valid, and outputs the results to a _new.json version of the file. I'll drop the commit.

pcuzner · 2017-08-02T07:17:45Z

@zmc prettyJSON.py removed

Initial commit did not use the right metric - missed due to low load, more obvious on larger systems!

In a 600+ OSD environment the charts were based on averageSeries which was taking a long time. This has now been changed, so the comparison chart only shows current values for a given OSD for comparison

pcuzner added 12 commits August 2, 2017 13:59

dashboard.yml updated to reflect the new dashboard names

6df7a78

Changes are as follows; ceph-osd-latency --> ceph-osd-information ceph-rados --> ceph-cluster ceph-frontend --> ceph-pools

alert-status : dashboard updated - 2 triggers added

746f08a

Added triggers for OSD node loss and Long OSD response times removed IO Stall trigger - firing to often during scrub

network-usage dashboard updated to show load by ceph role

9e52364

Graphs now aggregate across ceph roles, and provide drill down by role to show network load at each layer.

minor updates adding a shortcuts widget to each dashboard

57ea9c7

removed dashboards

2cbef6f

diagram updated to reflect name changes

476e600

replacement version for ceph-rados dashboard

4845cd3

Renamed to ceph-cluster to better reflect the metrics and data shown on the dashboard

ceph-pools: rename of the dashboard that provides pool performance data

ec24b5e

ceph-backend-storage: rows reorganized and additional host tables added

f9faba0

OSD capacity by Host table added to the summary row, showing capacity by host Performance data split into two rows, allowing each one to be hidden independently

pcuzner requested a review from zmc August 2, 2017 02:45

This was referenced Aug 2, 2017

Add a per-host filter on OSD node detail? #70

Closed

provide a widget to show ratio of bluestore vs filestore OSDs #83

Closed

zmc requested changes Aug 2, 2017

View reviewed changes

pcuzner force-pushed the dashboard-updates branch from 41bd829 to 72af517 Compare August 2, 2017 07:12

pcuzner added 2 commits August 3, 2017 16:20

network-usage : fix yaxis units on mon/rgw and osd graphs

be12878

Initial commit did not use the right metric - missed due to low load, more obvious on larger systems!

osd-information: minor fixes for larger environments

5f8992b

In a 600+ OSD environment the charts were based on averageSeries which was taking a long time. This has now been changed, so the comparison chart only shows current values for a given OSD for comparison

zmc approved these changes Aug 4, 2017

View reviewed changes

zmc merged commit 86c775d into master Aug 4, 2017

zmc deleted the dashboard-updates branch August 4, 2017 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard updates #90

Dashboard updates #90

pcuzner commented Aug 2, 2017

zmc Aug 2, 2017

pcuzner Aug 2, 2017

pcuzner commented Aug 2, 2017

Dashboard updates #90

Dashboard updates #90

Conversation

pcuzner commented Aug 2, 2017

zmc Aug 2, 2017

Choose a reason for hiding this comment

pcuzner Aug 2, 2017

Choose a reason for hiding this comment

pcuzner commented Aug 2, 2017