-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dashboard updates #90
Conversation
Changes are as follows; ceph-osd-latency --> ceph-osd-information ceph-rados --> ceph-cluster ceph-frontend --> ceph-pools
Added triggers for OSD node loss and Long OSD response times removed IO Stall trigger - firing to often during scrub
Graphs now aggregate across ceph roles, and provide drill down by role to show network load at each layer.
Renamed to ceph-cluster to better reflect the metrics and data shown on the dashboard
changes include; - support for osd and disk name to provide filters to the graphs - add osd overview row containing - raw capacity panel - host.disk -> disk size table - host.disk -> osd id - shortcut links added to other overview type dashboards
…ance data Summary row shows; - osd count - osd up count - osd's down - disk size summary (pie chart showing what sizes of disk are in the cluster - table of osd to disk size - OSD encryption summary (how many of my OSDs are encrypted?) - OSD type status (how many OSDs are filestore vs bluestore Panel includes an OSD id which is used as a filter for the filestore performance row The performance row now shows average OSD performance for a single OSD or all OSDs. This can then be used for side-by-side comparison with OSD performance across the cluster at the 95%ile.
OSD capacity by Host table added to the summary row, showing capacity by host Performance data split into two rows, allowing each one to be hidden independently
In addition to the link updates - scrub state reflects both scrub and deep-scrub status - the scrub panel no longer turns to "warn" when scrub is active - it's a natural feature of the cluster and not a problem! However, it will turn red, if scrub or deep-scrub is disabled - disk util panel changed to a graph and switched to average and 95%ile (95%ile on a busy cluster just shows too much variance) - OSDs panel now links to the ceph-osd-information dashboard
tools/prettyJSON.py
Outdated
@@ -0,0 +1,77 @@ | |||
#!/usr/bin/env python2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually need this; one can simply cat foo.json | python -m 'json.tool'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. The code I added ; validates the file is a standard name, checks the json itself is valid, and outputs the results to a _new.json version of the file. I'll drop the commit.
41bd829
to
72af517
Compare
@zmc prettyJSON.py removed |
Initial commit did not use the right metric - missed due to low load, more obvious on larger systems!
In a 600+ OSD environment the charts were based on averageSeries which was taking a long time. This has now been changed, so the comparison chart only shows current values for a given OSD for comparison
dashboard fixes and enhancements moving towards a 1.0, which includes an update to the default alert triggers.
When deploying consider the following;
In addition to the updates the dashboards are stored in formatted json now to help readability, and a tools/prettyJSON.py added to create the formatted json from the exported (api) output of Grafana