Latest updates covering feedback and RFE's #79

pcuzner · 2017-07-21T03:26:09Z

Here's a quick summary of the changes in this PR

osd collector now reads the 'type' file on the osd mountpoint to determine whether the osd is filestore or bluestore. This change allows the osd collector to report osd stats for the block and block.wal devices used on bluestore based clusters. this change is likely to require and update to the selinux policy
mon collector has support for Luminous release but requires the monitors to use the "mon_health_preluminous_compat=true " flag
PR includes a new dashboard called alert-status. dashUpdater/dashboard.yml have been changed to bypass further updates to this dashboard once it is loaded into grafana - this allows users to customise triggers and update the dashboards without losing their alert thresholds etc
cluster flags are now extracted from ceph health and shown in the ceph-rados dashboard (probably not the best place...but baby steps!)
nb. define an cephmetrics to grafana notifier prior to loading the new configuration - this needs to be done by dashupdater, but will have to wait until next week!
logging is now simplified - each module write to the same log file, and the log entries provide module/lineno/function name to aid debugging

A separate logging class was used, but with this in place the filename,funName parameters always showed the method in the logging class not the file/function of the calling module. This base class now just grabs the cephmetrics named logger setup in the collectd parent module, which is inherited by the OSDs/RGW and Mon classes.

During collectd registration, the module defines the cephmetrics logger which is then used by the OSDs/RGW and Mon collectors. Now all collectors write to a single logging instance.

references to logging module removed

…fo added Instead of using 'ceph osd tree' to determine the unique hosts in the cluster, 'ceph osd dump' is used. The dump provides the IP address of the owner of the OSD, so this can be used with a python set to derive the unique IP's - which infers the number of physical hosts in the cluster. 'osd tree' is problematic when crushmap is used for things like pools, since the hostnames in the map are logical to crush, not physical names within the cluster.

num_osds is now stored for each OSD host. This allows further analysis in Grafana i.e. disk to host ratio's

- Client and Disk IOPS metric type changed to None - Query for RGW hosts and OSD Hosts corrected

previous value was 2, used for testing

- dropped queue len (not showing any useful data under load) - switches failed http requests from singlestat to a graph to show ALL failures by host so you can see which host may be failing more requests (in overview row) - switched network singlestat to a graph to allow user to hover over the panel to get stats - layout order changes to minimize layout issues with graphs changing dimensions with larger legend sizes - fix requests size graph - to report per sec, the value needs to be scaled by 0.1 (i.e. divide by 10) - switched failed requests in the detailed row from a singlestat to a graph

Initially the intention was this dashboard would be opened by navigating to it by drilling through higher level dashboards. However, it is being used directly so the osd_servers and disks template variable needed to be exposed to make the charts usable. Without these variables selectable the charts are unreadable. In addition an OSD count and a breakdown of the disk -> OSD id is also provided, also tied to the same templating variables

- Pool Name variable now defined instead of showing as pool_name - top 5 pool metrics repositioned to the 2nd row on the dashboard for easier visibility of high demand pools - pool_name now part of the metric names in the overview row queries, so a change to the template variable zooms in both the over view and the per pool row information

On a healthy osd host, an admin socket should exists for all OSD daemons. However, if there are problems the osd daemon may not have an admin socket - if this is the case we raise an exception back to the caller (e.g. collectd).

Capacity Utilisation was being calculated incorrectly, this update fixes that problem. A new pie-chart panel showing reads vs writes has been added to help understand the workload ratio - the downside is the apply+commit ratio panel is now smaller, so additional help text has been added.

Bluestore OSDs are now detected and the relevant device link followed to the backing device and journal. With this commit the perf stats are *not* included - this be a subsequent commit. NB. bluetore perf counters are different to filestore, so once the counters are supported a new dashboard will be needed specific to bluestore OSDs

Dashboard contains 7 triggers for different scenarios, linking to a 'cephmetrics' notifier. It's worth noting that if the the notification definition is not in place for cephmetrics the alerts still show.

…ently The dashboard.yml file now contains a _alert_dashboard setting which identifies the dashboard that will hold the alert triggers. Once deployed this dashboard is excluded from further runs of dashUpdater so that users may add there own triggers and alerts without them being overwritten by a dashboard refresh

Changes to the collector are as follows; - rbd scan now skips default.rgw* pools...no point trying! - added 1s timeout to the rbd scan thread (stops blocking on unresponsive mon) - ceph version is detected - health checks tweaked based on pre or post Luminous release (v12) - added cluster flags like noout/norecover/noscrub - added pg stat information to determine num_pgs_stuck value

- status panel time override now set to 90s (60s not enough on some environments) nb. the time override prevents values from being interpolated by the fetch in the status panel code - recovery panel shrunk to span=1, making room for deep scrub status - singlestat panel added for the status of deep scrub - singlestat spark lines now all blue for consistency - fixed colors in pie chart resolved, by removing the other category

Query was showing osd's that have been removed from the system

…ard!

Cluster flags/features shows as singlestat panels alongside the monitor state. The flag states are 0 - enabled/inactive, 1 active and 2 disabled. thresholds are used on this panels to indicate the above states

Spark lines now blue matching the at-a-glance view for consistency

zmc

Seeing things like:

Jul 21 21:35:21 magna126 collectd[8721]: Unhandled python exception in read callback: IOError: [Errno 2] No such file or directory: '/var/lib/ceph/osd/ceph-7/type'

templating had a reference to a test server hard coded - resulting in failed queries. This fix replaces this with $domain

the presence of the type file was being relied upon across versions. However, not all versions show this file (10.2.2 did, 10.2.7 didn't!), so this fix looks for type and if it's there it uses it, if not it will look for the presence of the journal link to determine if the osd is filestore. It is assumed that bluestore will 'always' use the type file..

pcuzner · 2017-07-21T22:27:17Z

@zmc see last commit for a fix to the type problem. On my local system (10.2.2) it's there, not sure why it's not there on 10.2.7? Anyway, the code now makes a check first and falls back to the presence of the journal file to determine whether the osd is filestore or not.

zmc · 2017-07-21T22:30:48Z

@pcuzner Jul 21 22:29:44 magna126 collectd[463]: Unhandled python exception in read callback: ValueError: Unrecognised OSD type

zmc · 2017-07-21T22:33:46Z

[root@magna126 ~]# ls -l /var/lib/ceph/osd/*/journal
lrwxrwxrwx. 1 ceph ceph 58 Dec  1  2014 /var/lib/ceph/osd/ceph-36/journal -> /dev/disk/by-partuuid/38f66a37-12ee-432a-9456-bd935a5d02b2
lrwxrwxrwx. 1 ceph ceph 58 Dec  1  2014 /var/lib/ceph/osd/ceph-38/journal -> /dev/disk/by-partuuid/8c146771-85bd-4e8d-a89a-82d10c4f41e6
lrwxrwxrwx. 1 ceph ceph 58 Dec  1  2014 /var/lib/ceph/osd/ceph-39/journal -> /dev/disk/by-partuuid/c204b98e-f124-46a9-9541-2bbe5d7804fb
lrwxrwxrwx. 1 ceph ceph 58 Dec  1  2014 /var/lib/ceph/osd/ceph-40/journal -> /dev/disk/by-partuuid/2aad7274-c05e-4a84-92c7-28c24841e728
lrwxrwxrwx. 1 ceph ceph 58 Oct 23  2014 /var/lib/ceph/osd/ceph-6/journal -> /dev/disk/by-partuuid/8f49e5cc-af74-4f63-b074-0641d995da37
lrwxrwxrwx. 1 ceph ceph 58 Oct 23  2014 /var/lib/ceph/osd/ceph-7/journal -> /dev/disk/by-partuuid/1515d232-7bd5-46b9-84b8-8465f208f931
lrwxrwxrwx. 1 ceph ceph 58 Oct 23  2014 /var/lib/ceph/osd/ceph-8/journal -> /dev/disk/by-partuuid/77bb4a4e-390d-43fe-a84c-dc2b324c7a1e
[root@magna126 ~]# file -L /var/lib/ceph/osd/*/journal
/var/lib/ceph/osd/ceph-36/journal: block special
/var/lib/ceph/osd/ceph-38/journal: block special
/var/lib/ceph/osd/ceph-39/journal: block special
/var/lib/ceph/osd/ceph-40/journal: block special
/var/lib/ceph/osd/ceph-6/journal:  block special
/var/lib/ceph/osd/ceph-7/journal:  block special
/var/lib/ceph/osd/ceph-8/journal:  block special

pcuzner · 2017-07-22T04:35:04Z

@zmc selinux fun. switching to permissive, the collector is fine.

zmc · 2017-07-22T05:39:13Z

@pcuzner oh! I missed that somehow on this error. I'll come up with a fix for that.

…

On Fri, Jul 21, 2017 at 9:35 PM, pcuzner ***@***.***> wrote: @zmc <https://github.com/zmc> selinux fun. switching to permissive, the collector is fine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADjw_UUR8aEaHAscFs6q_KGlU5nCNcGks5sQXv5gaJpZM4Oe6lf> .

dashUpdater has been updated to automatically set up a cephmetrics notifications channel (if it's not already there), and the alert-status dashboard is loaded, which references the cephmetrics channel. The ansible templates has been updated to reflect the introduction of the alert-status dashboard

pcuzner · 2017-07-24T02:38:38Z

@zmc please review. The last commit brings in the alert-status dashboard and creates a default cephmetrics notification channel.

So that our SELinux policy can properly allow collectors to detect whether an OSD uses filestore or bluestore Signed-off-by: Zack Cerza <zack@redhat.com>

The collectors need to be able to determine whether an OSD uses filestore or bluestore Signed-off-by: Zack Cerza <zack@redhat.com>

zmc · 2017-07-24T23:33:14Z

Looks like my commits resolve the SELinux issue.

zmc

Thanks @pcuzner !

pcuzner added 24 commits July 17, 2017 12:07

cephmetrics: setup basic logging that all collectors use

93b7d8e

During collectd registration, the module defines the cephmetrics logger which is then used by the OSDs/RGW and Mon collectors. Now all collectors write to a single logging instance.

common : import tidy up and removal of redundant CollectorLog class

9ba85ce

references to logging module removed

rgw: fix log message format

33545aa

osd : log message tidy up and add num_osds metric

b28f83b

num_osds is now stored for each OSD host. This allows further analysis in Grafana i.e. disk to host ratio's

at-a-glance: multiple minor fixes to panels - no layout changes

1126dbb

- Client and Disk IOPS metric type changed to None - Query for RGW hosts and OSD Hosts corrected

ceph-backend-storage: updated disk full threshold

f4913e1

previous value was 2, used for testing

ceph-rgw-workload: fixed row repeat issue and minor layout change

f87f83e

osd : check that the osd admin socket exists prior to use

d28d135

On a healthy osd host, an admin socket should exists for all OSD daemons. However, if there are problems the osd daemon may not have an admin socket - if this is the case we raise an exception back to the caller (e.g. collectd).

alert-status : sample alert dashboard added

5f7b5f8

Dashboard contains 7 triggers for different scenarios, linking to a 'cephmetrics' notifier. It's worth noting that if the the notification definition is not in place for cephmetrics the alerts still show.

ceph-backend: fixed bug in osd-down table query

4c04b4b

Query was showing osd's that have been removed from the system

ceph-frontend: added recovery by pool - since this is the pool dashbo…

af9bfd4

…ard!

ceph-rados: added the cluster flags

d4dfaca

Cluster flags/features shows as singlestat panels alongside the monitor state. The flag states are 0 - enabled/inactive, 1 active and 2 disabled. thresholds are used on this panels to indicate the above states

ceph-rgw-workload: updates to metric calculations and spark line colors

3c7f99f

Spark lines now blue matching the at-a-glance view for consistency

dashboard relationships updated to show the alert-status dashboard

ec1c710

pcuzner requested review from zmc and ChristinaMeno July 21, 2017 03:26

zmc approved these changes Jul 21, 2017

View reviewed changes

zmc requested changes Jul 21, 2017

View reviewed changes

pcuzner added 2 commits July 22, 2017 09:43

osd-node-detail: fix to templating which caused charts to show no data

c2d2f25

templating had a reference to a test server hard coded - resulting in failed queries. This fix replaces this with $domain

zmc added 2 commits July 24, 2017 16:18

Restore SELinux context of OSD journals

0c22164

So that our SELinux policy can properly allow collectors to detect whether an OSD uses filestore or bluestore Signed-off-by: Zack Cerza <zack@redhat.com>

Update SELinux policy

ff83d75

The collectors need to be able to determine whether an OSD uses filestore or bluestore Signed-off-by: Zack Cerza <zack@redhat.com>

zmc approved these changes Jul 24, 2017

View reviewed changes

zmc merged commit 88a106b into master Jul 24, 2017

zmc deleted the wip-paulc branch July 24, 2017 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest updates covering feedback and RFE's #79

Latest updates covering feedback and RFE's #79

pcuzner commented Jul 21, 2017

zmc left a comment

pcuzner commented Jul 21, 2017

zmc commented Jul 21, 2017

zmc commented Jul 21, 2017

pcuzner commented Jul 22, 2017

zmc commented Jul 22, 2017 via email

pcuzner commented Jul 24, 2017

zmc commented Jul 24, 2017

zmc left a comment

Latest updates covering feedback and RFE's #79

Latest updates covering feedback and RFE's #79

Conversation

pcuzner commented Jul 21, 2017

zmc left a comment

Choose a reason for hiding this comment

pcuzner commented Jul 21, 2017

zmc commented Jul 21, 2017

zmc commented Jul 21, 2017

pcuzner commented Jul 22, 2017

zmc commented Jul 22, 2017 via email

pcuzner commented Jul 24, 2017

zmc commented Jul 24, 2017

zmc left a comment

Choose a reason for hiding this comment