Permalink
Browse files

update documentation for new Python package

  • Loading branch information...
1 parent b124744 commit 0227dc1fd4088a02b895ee12ccb4f85ee0fd0157 Michael Conigliaro committed May 10, 2011
Showing with 2 additions and 583 deletions.
  1. +2 −199 nagios/README.rst
  2. +0 −384 nagios/check_ganglia_metric
View
201 nagios/README.rst
@@ -2,202 +2,5 @@
check_ganglia_metric
====================
-
-Introduction
-------------
-
-**check_ganglia_metric** is a `Nagios <http://nagios.org/>`_ plugin that allows
-you to trigger alerts on any Ganglia metric. This plugin was heavily inspired
-by `Vladimir Vuksan <http://vuksan.com>`_'s check_ganglia_metric.php, but it
-comes with a number of improvements.
-
-
-Requirements
-------------
-
-#. Python >= 2.6
-#. `NagAconda <http://pypi.python.org/pypi/NagAconda>`_ >= 0.1.4
-
-To check which version of Python you have:
-
-::
-
- python -V
-
-To install NagAconda:
-
-::
-
- pip install NagAconda
-
-...or:
-
-::
-
- easy_install NagAconda
-
-
-
-Ganglia Configuration
----------------------
-
-Unless your Nagios server and Ganglia Meta Daemon are running on the same host,
-You probably need to edit your **gmetad.conf** to allow remote connections from
-your Nagios server.
-
-To allow connections from **nagios-server.example.com**:
-
-::
-
- trusted_hosts nagios-server.example.com
-
-To allow connections from **all hosts** (probably a security risk):
-
-::
-
- all_trusted on
-
-
-Testing on the Command Line
----------------------------
-
-First, let's see if **check_ganglia_metric** can communicate with the Ganglia
-Meta Daemon:
-
-::
-
- $ check_ganglia_metric --gmetad_host=gmetad-server.example.com \
- --metric_host=host.example.com --metric_name=cpu_idle
- Status Ok, CPU Idle = 99.3 %|cpu_idle=99.3%;;;;
-
-The "Status Ok" message indicates that **check_ganglia_metric** is working. If
-you're having trouble getting this to work, try again with verbose logging
-enabled (``--verbose``) in order to gain better insight into what's going
-wrong.
-
-Now let's try setting an alert threshold:
-
-::
-
- $ check_ganglia_metric --gmetad_host=gmetad-server.example.com \
- --metric_host=host.example.com --metric_name=cpu_idle --critical=99
- Status Critical, CPU Idle = 99.6 %|cpu_idle=99.6%;;99;;
-
-We told **check_ganglia_metric** to return a "Critical" status if the Idle CPU
-was greater than 99. The "Status Critical" message indicates that it worked.
-Note that **check_ganglia_metric** parses ranges and thresholds according to
-the `official Nagios plugin development guidelines
-<http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT>`_.
-
-To see a complete list of command line options with brief explanations, run
-**check_ganglia_metric** with the ``--help`` option.
-
-
-Nagios Configuration
---------------------
-
-First, create a command definition:
-
-::
-
- define command {
- command_name check_ganglia_metric
- command_line /usr/lib/nagios/plugins/check_ganglia_metric --gmetad_host=gmetad-server.example.com --metric_host=$HOSTADDRESS$ --metric_name=$ARG1$ --warning=$ARG2$ --critical=$ARG3$
- }
-
-Now you can use the above command in your service definitions:
-
-::
-
- define service {
- service_description CPU idle - Ganglia
- use some_template
- check_command check_ganglia_metric!cpu_idle!0:20!0:0
- host_name host.example.com
- }
-
-This will work fine until something goes wrong with **check_ganglia_metric**
-(e.g. the cache file can't be read/written to, the Ganglia Meta Daemon can't be
-reached, etc.). At that point, every service that relies on
-**check_ganglia_metric** will fail, possibly inundating you with alerts. We can
-prevent this through the use of `service dependencies <http://nagios.sourceforge.net/docs/3_0/dependencies.html>`_.
-
-The first thing we need is a command definition for checking the age of a file:
-
-::
-
- define command {
- command_name check_file_age
- command_line /usr/lib/nagios/plugins/check_file_age -f $ARG1$ -w $ARG2$ -c $ARG3$
- }
-
-Next, we define a service which checks the age of **check_ganglia_metric**'s
-cache file. Note that in order to be truly effective, this service needs to be
-checked at least as (preferably more) frequently than all the other checks
-that rely on **check_ganglia_metric**:
-
-::
-
- define service {
- service_description Cache for check_ganglia_metric
- use some_template
- check_command check_file_age!/var/lib/nagios/.check_ganglia_metric.cache!60!120
- host_name localhost
- check_interval 1
- max_check_attempts 1
- }
-
-::
-
-And finally, we set up the actual service dependency. Note that I've enabled
-**use_regexp_matching** in Nagios, which allows me to use regular expressions
-in my directives. By sticking "- Ganglia" at the end of every service that
-relies on **check_ganglia_metric**, I can save myself a lot of effort:
-
-::
-
- define servicedependency {
- host_name localhost
- service_description Cache for check_ganglia_metric
- dependent_host_name .*
- dependent_service_description .* \- Ganglia$
- execution_failure_criteria c,p
- }
-
-Now if something goes wrong with **check_ganglia_metric**, only one alert will
-be sent out about the cache file, and all dependent service checks will be
-paused until you fix the problem that caused **check_ganglia_metric** to fail.
-Once the problem is fixed, you'll need to update the timestamp on the cache
-file in order to put the "Cache for check_ganglia_metric" service back into an
-OK state (which will allow dependent service checks to continue):
-
-::
-
- $ touch /var/lib/nagios/.check_ganglia_metric.cache
-
-
-Tips and Tricks
----------------
-
-It's possible to get a complete list of available hosts and metrics by enabling
-"more verbose" logging (``-vv``). Since the metric_host and metric_name options
-are required, you have a little bit of a "chicken and egg" problem here, but
-that's OK. Just supply some dummy data. The plugin will error out at the end
-with a "host/metric not found" error, but not before it dumps its cache:
-
-::
-
- $ check_ganglia_metric --gmetad_host=gmetad-server.example.com \
- --metric_host=dummy --metric_name=dummy -vv
-
-
-Known Issues
-------------
-
-- Doesn't work with Python 2.4
-
-
-Author
--------
-
-Michael T. Conigliaro <mike [at] conigliaro [dot] org>
+`check_ganglia_metric <http://pypi.python.org/pypi/check_ganglia_metric/>`_ is
+a Nagios plugin that allows you to trigger alerts on any Ganglia metric.
View
384 nagios/check_ganglia_metric
@@ -1,384 +0,0 @@
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-#
-# Ganglia metric check plugin for Nagios
-#
-# Copyright (C) 2011 by Michael T. Conigliaro <mike [at] conigliaro [dot] org>.
-# All rights reserved.
-#
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in
-# all copies or substantial portions of the Software.
-#
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-# THE SOFTWARE.
-#
-
-import logging
-import os
-import pickle
-import pprint
-import random
-import socket
-import sys
-import tempfile
-import time
-from xml.etree.cElementTree import XML
-
-try:
- from NagAconda import Plugin
-except ImportError, e:
- print('%s (Hint: "pip install NagAconda" or "easy_install NagAconda")' % e)
- sys.exit(2)
-
-
-__version__ = '2011.05.03'
-
-
-class GangliaMetrics(object):
- """Ganglia metric check class"""
-
- class Error(Exception):
- """Base exception for all GangliaMetrics errors"""
-
- def __init__(self, message, log_level='error'):
- """Log all exception messages"""
-
- getattr(logging.getLogger(__name__), log_level)(message)
- super(GangliaMetrics.Error, self).__init__(message)
-
- class GmetadError(Error):
- """Base class for all gmetad errors"""
-
- class GmetadNetworkError(GmetadError):
- """Raised on gmetad network errors"""
-
- class GmetadNoDataError(GmetadError):
- """Raised when no data is received from gmetad"""
-
- class GmetadXmlError(GmetadError):
- """Raised on gmetad XML parse errors"""
-
- class CacheError(Error):
- """Base class for all cache errors"""
-
- class CacheExpiredError(CacheError):
- """Raised on cache expiration"""
-
- def __init__(self, message):
- """Override log level for these errors"""
-
- super(GangliaMetrics.CacheExpiredError, self).__init__(message, 'info')
-
- class CacheReadError(CacheError):
- """Raised on cache read errors"""
-
- class CacheWriteError(CacheError):
- """Raised on cache write errors"""
-
- class CacheLockError(CacheWriteError):
- """Raised on cache lock errors"""
-
- def __init__(self, message):
- """Override log level for these errors"""
-
- super(GangliaMetrics.CacheLockError, self).__init__(message, 'info')
-
- class StaleCacheLockError(CacheWriteError):
- """Raised when stale cache lock is detected"""
-
- class CacheUnlockError(CacheWriteError):
- """Raised on cache unlock errors"""
-
- class MetricNotFoundError(Error):
- """Raised when metric host/name is not found"""
-
- def __init__(self, gmetad_host, gmetad_port, gmetad_timeout, cache_path,
- cache_ttl, cache_ttl_splay, cache_grace, debug_level):
- """Initialization"""
-
- self.gmetad_host = gmetad_host
- self.gmetad_port = int(gmetad_port)
- self.gmetad_timeout = float(gmetad_timeout)
- self.cache_path = cache_path
- self.cache_lock_path = '%s.lock' % cache_path
-
- splay_secs = float(cache_ttl) * float(cache_ttl_splay) / 2
- self.cache_ttl = random.uniform(float(cache_ttl) - splay_secs,
- float(cache_ttl) + splay_secs)
- self.cache_grace = float(cache_grace)
-
- # Configure debug logging
- self.log = logging.getLogger(__name__)
- if debug_level:
- console_logger = logging.StreamHandler()
- console_logger.setFormatter(
- logging.Formatter("%(asctime)s %(levelname)s: %(message)s"))
- self.log.addHandler(console_logger)
- try:
- log_level = getattr(logging, [None, 'INFO', 'DEBUG'][debug_level])
- except IndexError:
- log_level = logging.DEBUG
- self.log.setLevel(log_level)
- else:
- self.log.addHandler(logging.FileHandler(os.devnull))
-
- def get_value(self, metric_host, metric_name):
- """Return a value for the specified metric host/name"""
-
- try:
- metrics = self._cache_read()
- except self.CacheError:
- try:
- self._cache_lock()
- try:
- metrics = self._gmetad_parse(self._gmetad_read())
- self._cache_write(metrics)
- finally:
- self._cache_unlock()
- except (self.CacheLockError, self.GmetadError):
- self.log.info('Attempting to force read from cache')
- metrics = self._cache_read(ignore_expiration=True)
-
- self.log.info('Found metrics for %d hosts', len(metrics))
-
- if self.log.isEnabledFor(logging.DEBUG):
- self.log.debug("Dumping metrics\n%s", pprint.pformat(metrics))
-
- if metric_host not in metrics:
- raise self.MetricNotFoundError('Host "%s" not found' % metric_host)
- elif metric_name not in metrics[metric_host]:
- raise self.MetricNotFoundError('Metric "%s" for host "%s" not found' %
- (metric_name, metric_host))
-
- return metrics[metric_host][metric_name]
-
- def _gmetad_read(self):
- """Read XML data from Ganglia meta daemon (gmetad)"""
-
- self.log.info('Connecting to gmetad at %s:%d',
- self.gmetad_host, self.gmetad_port)
-
- try:
- sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
- except StandardError, e:
- raise self.GmetadNetworkError('Error while creating socket: %s' % e)
-
- try:
- sock.settimeout(self.gmetad_timeout)
- sock.connect((self.gmetad_host, self.gmetad_port))
- except StandardError, e:
- raise self.GmetadNetworkError('Error while connecting to gmetad at %s:%d: %s' %
- (self.gmetad_host, self.gmetad_port, e))
-
- self.log.info('Reading gmetad XML')
-
- try:
- xml_data = ''
- buffer = sock.recv(4096)
- while len(buffer):
- xml_data += buffer
- buffer = sock.recv(4096)
- sock.close()
- msg = 'Read %s bytes from gmetad' % len(xml_data)
- if len(xml_data):
- self.log.info(msg)
- else:
- raise self.GmetadNoDataError('%s (Hint: Check trusted_hosts and/or all_trusted in gmetad.conf)' % msg)
- except StandardError, e:
- raise self.GmetadNetworkError('Error while reading gmetad XML from %s:%d: %s' %
- (self.gmetad_host, self.gmetad_port, e))
-
- return xml_data
-
- def _gmetad_parse(self, xml_data):
- """Parse metrics from XML data"""
-
- self.log.info('Parsing %d bytes of gmetad XML', len(xml_data))
-
- metrics = {}
- try:
- for host in XML(xml_data).findall('GRID/CLUSTER/HOST'):
- host_name = host.get('NAME')
- metrics[host_name] = {}
- for metric in host.findall('METRIC'):
- metric_name = metric.get('NAME')
- metrics[host_name][metric_name] = {
- 'units': metric.get('UNITS'),
- 'value': metric.get('VAL')
- }
- for extra_data in metric.findall('EXTRA_DATA/EXTRA_ELEMENT'):
- if extra_data.get('NAME') == 'TITLE':
- metrics[host_name][metric_name]['title'] = extra_data.get('VAL')
- break # No need for further searching
-
- except Exception, e:
- raise self.GmetadXmlError('Error while parsing gmetad XML: %s' % e)
-
- return metrics
-
- def _cache_read(self, ignore_expiration=False):
- """Read metrics from cache"""
-
- self.log.info('Checking cache at %s', self.cache_path)
-
- try:
- cache_age = time.time() - os.path.getmtime(self.cache_path)
- except StandardError, e:
- raise self.CacheReadError('Error while checking age of cache at %s: %s' %
- (self.cache_path, e))
-
- if cache_age > self.cache_ttl + self.cache_grace or \
- (cache_age > self.cache_ttl and not ignore_expiration):
- raise self.CacheExpiredError('Cache is expired by %f seconds' %
- (cache_age - self.cache_ttl))
- else:
- self.log.info('Cache expires in %f seconds' %
- (self.cache_ttl - cache_age))
- try:
- cache = open(self.cache_path, 'rb')
- metrics = pickle.load(cache)
- cache.close()
- except StandardError, e:
- raise self.CacheReadError('Error while reading from cache at %s: %s' %
- (self.cache_path, e))
-
- return metrics
-
- def _cache_write(self, metrics):
- """Write metrics to cache"""
-
- self.log.info('Updating cache at %s', self.cache_path)
-
- try:
- cache_tmp = tempfile.mkstemp(dir=os.path.dirname(self.cache_path))
- except StandardError, e:
- raise self.CacheWriteError('Error while creating temp file: %s' % e)
-
- try:
- self.log.info('Writing %d bytes to cache', os.write(cache_tmp[0],
- pickle.dumps(metrics, pickle.HIGHEST_PROTOCOL)))
- os.close(cache_tmp[0])
- os.rename(cache_tmp[1], self.cache_path)
- except StandardError, e:
- os.unlink(cache_tmp[1])
- raise self.CacheWriteError('Error while updating cache at %s: %s' %
- (self.cache_path, e))
-
- def _cache_lock(self):
- """Create the cache lock"""
-
- self.log.info('Creating cache lock at %s', self.cache_lock_path)
-
- try:
- os.mkdir(self.cache_lock_path)
- except StandardError, e:
- try:
- cache_lock_age = time.time() - os.path.getmtime(self.cache_lock_path)
- except StandardError, e:
- raise self.CacheLockError('Error while checking age of cache lock at %s: %s' %
- (self.cache_lock_path, e))
-
- if cache_lock_age > self.cache_ttl + self.cache_grace:
- try:
- os.utime(self.cache_lock_path, None)
- except StandardError:
- raise self.StaleCacheLockError('Stale cache lock found at %s' %
- self.cache_lock_path)
- else:
- raise self.CacheLockError('Error while creating cache lock at %s: %s' %
- (self.cache_lock_path, e))
-
- def _cache_unlock(self):
- """Remove the cache lock"""
-
- self.log.info('Removing cache lock at %s', self.cache_lock_path)
-
- try:
- os.rmdir(self.cache_lock_path)
- except StandardError, e:
- if os.path.exists(self.cache_lock_path):
- raise self.CacheUnlockError('Error while removing cache lock at %s: %s' %
- (self.cache_lock_path, e))
-
-
-if __name__ == '__main__':
-
- # Initialize plugin
- plugin = Plugin("Ganglia metric check plugin for Nagios", __version__)
- cache_path = os.path.join(os.path.expanduser('~'), '.check_ganglia_metric.cache')
- plugin.add_option('d', 'gmetad_host',
- 'Ganglia meta daemon host (default: localhost)',
- default='localhost')
- plugin.add_option('p', 'gmetad_port',
- 'Ganglia meta daemon port (default: 8651)',
- default=8651)
- plugin.add_option('t', 'gmetad_timeout',
- 'Ganglia meta daemon connection/read timeout in seconds (default: 2)',
- default=2)
- plugin.add_option('f', 'cache_path',
- 'Metric cache path (default: %s)' % cache_path,
- default=cache_path)
- plugin.add_option('l', 'cache_ttl',
- 'Metric cache TTL in seconds (default: 60)',
- default=60)
- plugin.add_option('s', 'cache_ttl_splay',
- 'Metric cache TTL splay factor (default: 0.5)',
- default=0.5)
- plugin.add_option('g', 'cache_grace',
- 'Metric cache grace period in seconds (default: 60)',
- default=60)
- plugin.add_option('a', 'metric_host',
- 'Metric host address',
- required=True)
- plugin.add_option('m', 'metric_name', 'Metric name', required=True)
-
- plugin.enable_status('warning')
- plugin.enable_status('critical')
-
- plugin.start()
-
- # Execute check
- try:
- metrics = GangliaMetrics(gmetad_host=plugin.options.gmetad_host,
- gmetad_port=plugin.options.gmetad_port,
- gmetad_timeout=plugin.options.gmetad_timeout,
- cache_path=plugin.options.cache_path,
- cache_ttl=plugin.options.cache_ttl,
- cache_ttl_splay=plugin.options.cache_ttl_splay,
- cache_grace=plugin.options.cache_grace,
- debug_level=plugin.options.verbose)
-
- value = metrics.get_value(metric_host=plugin.options.metric_host,
- metric_name=plugin.options.metric_name)
-
- plugin.set_status_message('%s = %s %s' % (value['title'],
- value['value'],
- value['units']))
-
- if value['units'].upper() in ('B', 'KB', 'MB', 'GB', 'TB') or \
- value['units'].lower() in ('s', 'ms', 'us', 'ns', '%'):
- plugin.set_value(plugin.options.metric_name, value['value'],
- scale=value['units'])
- else:
- plugin.set_value(plugin.options.metric_name, value['value'])
-
- except (GangliaMetrics.MetricNotFoundError), e:
- plugin.unknown_error(str(e))
- except (Exception), e:
- print(e)
- sys.exit(2)
-
- # Print results
- plugin.finish()

0 comments on commit 0227dc1

Please sign in to comment.