Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard updates #90

Merged
merged 14 commits into from
Aug 4, 2017
Merged

Dashboard updates #90

merged 14 commits into from
Aug 4, 2017

Conversation

pcuzner
Copy link
Contributor

@pcuzner pcuzner commented Aug 2, 2017

dashboard fixes and enhancements moving towards a 1.0, which includes an update to the default alert triggers.
When deploying consider the following;

  1. The default alerts have changed, so please remove the existing alert-status dashboard prior to deploying the updated set
  2. Some dashboards are removed/renamed, so to prevent old dashboards being retained, run dashUpdater in refresh mode and manually remove the following redundant dashboards from grafana; ceph-rados, ceph-osd-latency, ceph-frontend

In addition to the updates the dashboards are stored in formatted json now to help readability, and a tools/prettyJSON.py added to create the formatted json from the exported (api) output of Grafana

Changes are as follows;
ceph-osd-latency --> ceph-osd-information
ceph-rados --> ceph-cluster
ceph-frontend --> ceph-pools
Added triggers for OSD node loss and Long OSD response times
removed IO Stall trigger - firing to often during scrub
Graphs now aggregate across ceph roles, and provide drill down by role
to show network load at each layer.
Renamed to ceph-cluster to better reflect the metrics and data shown on
the dashboard
changes include;
- support for osd and disk name to provide filters to the graphs
- add osd overview row containing
  - raw capacity panel
  - host.disk -> disk size table
  - host.disk -> osd id
- shortcut links added to other overview type dashboards
…ance data

Summary row shows;
- osd count
- osd up count
- osd's down
- disk size summary (pie chart showing what sizes of disk are in the cluster
- table of osd to disk size
- OSD encryption summary (how many of my OSDs are encrypted?)
- OSD type status (how many OSDs are filestore vs bluestore

Panel includes an OSD id which is used as a filter for the filestore
performance row

The performance row now shows average OSD performance for a single OSD or
all OSDs. This can then be used for side-by-side comparison with OSD
performance across the cluster at the 95%ile.
OSD capacity by Host table added to the summary row, showing capacity by
host

Performance data split into two rows, allowing each one to be hidden
independently
In addition to the link updates
- scrub state reflects both scrub and deep-scrub status
- the scrub panel no longer turns to "warn" when scrub is active - it's
  a natural feature of the cluster and not a problem! However, it will
  turn red, if scrub or deep-scrub is disabled
- disk util panel changed to a graph and switched to average and 95%ile
  (95%ile on a busy cluster just shows too much variance)
- OSDs panel now links to the ceph-osd-information dashboard
@@ -0,0 +1,77 @@
#!/usr/bin/env python2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need this; one can simply cat foo.json | python -m 'json.tool'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The code I added ; validates the file is a standard name, checks the json itself is valid, and outputs the results to a _new.json version of the file. I'll drop the commit.

@pcuzner
Copy link
Contributor Author

pcuzner commented Aug 2, 2017

@zmc prettyJSON.py removed

Initial commit did not use the right metric - missed due to low load, more
obvious on larger systems!
In a 600+ OSD environment the charts were based on averageSeries which
was taking a long time. This has now been changed, so the comparison
chart only shows current values for a given OSD for comparison
@zmc zmc merged commit 86c775d into master Aug 4, 2017
@zmc zmc deleted the dashboard-updates branch August 4, 2017 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants