-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest updates covering feedback and RFE's #79
Conversation
A separate logging class was used, but with this in place the filename,funName parameters always showed the method in the logging class not the file/function of the calling module. This base class now just grabs the cephmetrics named logger setup in the collectd parent module, which is inherited by the OSDs/RGW and Mon classes.
During collectd registration, the module defines the cephmetrics logger which is then used by the OSDs/RGW and Mon collectors. Now all collectors write to a single logging instance.
references to logging module removed
…fo added Instead of using 'ceph osd tree' to determine the unique hosts in the cluster, 'ceph osd dump' is used. The dump provides the IP address of the owner of the OSD, so this can be used with a python set to derive the unique IP's - which infers the number of physical hosts in the cluster. 'osd tree' is problematic when crushmap is used for things like pools, since the hostnames in the map are logical to crush, not physical names within the cluster.
num_osds is now stored for each OSD host. This allows further analysis in Grafana i.e. disk to host ratio's
- Client and Disk IOPS metric type changed to None - Query for RGW hosts and OSD Hosts corrected
previous value was 2, used for testing
- dropped queue len (not showing any useful data under load) - switches failed http requests from singlestat to a graph to show ALL failures by host so you can see which host may be failing more requests (in overview row) - switched network singlestat to a graph to allow user to hover over the panel to get stats - layout order changes to minimize layout issues with graphs changing dimensions with larger legend sizes - fix requests size graph - to report per sec, the value needs to be scaled by 0.1 (i.e. divide by 10) - switched failed requests in the detailed row from a singlestat to a graph
Initially the intention was this dashboard would be opened by navigating to it by drilling through higher level dashboards. However, it is being used directly so the osd_servers and disks template variable needed to be exposed to make the charts usable. Without these variables selectable the charts are unreadable. In addition an OSD count and a breakdown of the disk -> OSD id is also provided, also tied to the same templating variables
- Pool Name variable now defined instead of showing as pool_name - top 5 pool metrics repositioned to the 2nd row on the dashboard for easier visibility of high demand pools - pool_name now part of the metric names in the overview row queries, so a change to the template variable zooms in both the over view and the per pool row information
On a healthy osd host, an admin socket should exists for all OSD daemons. However, if there are problems the osd daemon may not have an admin socket - if this is the case we raise an exception back to the caller (e.g. collectd).
Capacity Utilisation was being calculated incorrectly, this update fixes that problem. A new pie-chart panel showing reads vs writes has been added to help understand the workload ratio - the downside is the apply+commit ratio panel is now smaller, so additional help text has been added.
Bluestore OSDs are now detected and the relevant device link followed to the backing device and journal. With this commit the perf stats are *not* included - this be a subsequent commit. NB. bluetore perf counters are different to filestore, so once the counters are supported a new dashboard will be needed specific to bluestore OSDs
Dashboard contains 7 triggers for different scenarios, linking to a 'cephmetrics' notifier. It's worth noting that if the the notification definition is not in place for cephmetrics the alerts still show.
…ently The dashboard.yml file now contains a _alert_dashboard setting which identifies the dashboard that will hold the alert triggers. Once deployed this dashboard is excluded from further runs of dashUpdater so that users may add there own triggers and alerts without them being overwritten by a dashboard refresh
Changes to the collector are as follows; - rbd scan now skips default.rgw* pools...no point trying! - added 1s timeout to the rbd scan thread (stops blocking on unresponsive mon) - ceph version is detected - health checks tweaked based on pre or post Luminous release (v12) - added cluster flags like noout/norecover/noscrub - added pg stat information to determine num_pgs_stuck value
- status panel time override now set to 90s (60s not enough on some environments) nb. the time override prevents values from being interpolated by the fetch in the status panel code - recovery panel shrunk to span=1, making room for deep scrub status - singlestat panel added for the status of deep scrub - singlestat spark lines now all blue for consistency - fixed colors in pie chart resolved, by removing the other category
Query was showing osd's that have been removed from the system
Cluster flags/features shows as singlestat panels alongside the monitor state. The flag states are 0 - enabled/inactive, 1 active and 2 disabled. thresholds are used on this panels to indicate the above states
Spark lines now blue matching the at-a-glance view for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing things like:
Jul 21 21:35:21 magna126 collectd[8721]: Unhandled python exception in read callback: IOError: [Errno 2] No such file or directory: '/var/lib/ceph/osd/ceph-7/type'
templating had a reference to a test server hard coded - resulting in failed queries. This fix replaces this with $domain
the presence of the type file was being relied upon across versions. However, not all versions show this file (10.2.2 did, 10.2.7 didn't!), so this fix looks for type and if it's there it uses it, if not it will look for the presence of the journal link to determine if the osd is filestore. It is assumed that bluestore will 'always' use the type file..
@zmc see last commit for a fix to the type problem. On my local system (10.2.2) it's there, not sure why it's not there on 10.2.7? Anyway, the code now makes a check first and falls back to the presence of the journal file to determine whether the osd is filestore or not. |
@pcuzner |
|
@zmc selinux fun. switching to permissive, the collector is fine. |
@pcuzner oh! I missed that somehow on this error. I'll come up with a fix
for that.
…On Fri, Jul 21, 2017 at 9:35 PM, pcuzner ***@***.***> wrote:
@zmc <https://github.com/zmc> selinux fun. switching to permissive, the
collector is fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADjw_UUR8aEaHAscFs6q_KGlU5nCNcGks5sQXv5gaJpZM4Oe6lf>
.
|
dashUpdater has been updated to automatically set up a cephmetrics notifications channel (if it's not already there), and the alert-status dashboard is loaded, which references the cephmetrics channel. The ansible templates has been updated to reflect the introduction of the alert-status dashboard
@zmc please review. The last commit brings in the alert-status dashboard and creates a default cephmetrics notification channel. |
So that our SELinux policy can properly allow collectors to detect whether an OSD uses filestore or bluestore Signed-off-by: Zack Cerza <zack@redhat.com>
The collectors need to be able to determine whether an OSD uses filestore or bluestore Signed-off-by: Zack Cerza <zack@redhat.com>
Looks like my commits resolve the SELinux issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pcuzner !
Here's a quick summary of the changes in this PR