Skip to content

Nagios Integration

Vladimir Vuksan edited this page Jul 20, 2016 · 5 revisions

Ganglia Nagios Integration

NOTE: Please see the monitor-core wiki page on Integrating Ganglia with Nagios for an overview of different approaches to letting the two pieces of software communicate. This page is specifically about using the ganglia-web Nagios integration.

Ganglia Nagios integration is a new feature that is included with Ganglia Web 2.2.0+. It is based on following implementation

http://vuksan.com/linux/nagios_scripts.html#check_ganglia_metrics

with the exception that it uses a shell script wrapper which is more efficient since PHP interpreter doesn't need to be spawned each time we check a metric.

There are 4 different Ganglia Checks

  • Check heartbeat
  • Check single metric on a specific host
  • Check multiple metrics on a specific host
  • Check multiple metrics on a range of hosts defined with a regular expression

Check Heartbeat

Ganglia uses heartbeat packets to determine if a machine has gone down. It is reset every time a new packet is received. This check avoids you from having to do things like check_ping to make sure machine is alive. To use this check please copy check_heartbeat.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"

Define the check command in Nagios. Threshold is the amount of time since last reported heartbeat to raise critical alert.

define command {
  command_name  check_ganglia_heartbeat
  command_line  /bin/sh /var/www/html/ganglia/nagios/check_heartbeat.sh host=$HOSTNAME$ threshold=$ARG1$
}

Now for every host you want monitored change check_command to be

check_command	check_ganglia_heartbeat!50

This will mark any node that reported to Ganglia 50 seconds or more ago as CRITICAL.

Check single metric on a specific host

To use it please copy check_ganglia_metric.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"

Nagios configuration consists of defining following command

define command {
 command_name  check_ganglia_metric
 command_line  /bin/sh /var/www/html/ganglia/nagios/check_ganglia_metric.sh host=$HOSTNAME$ metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}

Now you can use it in a service check. For instance say you want to be alerted if 1-minute load average goes over 5 you would add following directive

    check_command			check_ganglia_metric!load_one!more!5

If you wanted to alert when disk space goes less than 10 GB

    check_command			check_ganglia_metric!disk_free!less!10

Be reminded that operators indicate what should be "critical" state. For instance if you use notequal it means state is critical if the value is NOT equal. etc.

Check multiple metrics on a specific host

Check multiple metrics is a modification of the check single metric script. It will check multiple metrics on the same host e.g. instead of having separate checks for e.g. disk utilization on /, /tmp and /var which may produce three separate alerts you have a single alert any time disk utilization goes below or above a threshold.

To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"

Then define a check command in Nagios

define command {
   command_name  check_ganglia_multiple_metrics
   command_line  /bin/sh /var/www/html/ganglia/nagios/check_multiple_metrics.sh host=$HOSTNAME$ checks='$ARG1$'
}

Then add a list of checks that are delimited with :. Each check consists of

metric_name,operator,critical_value 

e.g.

check_command		check_ganglia_multiple_metrics!disk_free_rootfs,less,10:disk_free_tmp,less,20

WARNING: Drawback of using check multiple metrics is that in certain instances you may not be aware of the scale of a problem. For example say you get an alert for /tmp nearing full. You get this alert over the weekend so you figure it's not THAT critical. After the alert your /var starts rapidly filling up which may be really serious. Unfortunately you will not get another alert (unless obviously you had an aggressive notification interval). Beware.

Check multiple metrics on a range of hosts defined with a regular expression

Use this check to check a single or multiple metrics on a range of hosts defined using a regular expression. This is useful when you want to get a single alert if particular metric is critical across a number of hosts.

To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"

Then define a check command in Nagios

define command {
  command_name  check_ganglia_host_regex
  command_line  /bin/sh /usr/share/ganglia-web2/nagios/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$'
}

Then add a list of checks that are delimited with :. Each check consists of

metric_name,operator,critical_value 

e.g.

For example to check free space on / and /tmp for any machine starting with web-* or app-* you would use something like this

check_command		check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,10:disk_free_tmp,less,10

DOWNSIDES: Downside of this approach similar to check multiple metrics on a single host is that in certain situation the scale of a problem may not be apparent since only a single alert will be generated. Also currently since Nagios and Ganglia are decoupled you may get an alert if machine is under scheduled maintenance and e.g. you start writing to /tmp.

Check value(s) is same on a set of hosts

Use this check to check a single or multiple metrics on a range of hosts have the same value. For example let's say you wanted to make sure SVN revision of the deployed code was the same across all servers. You would send the SVN revision as e.g. a string metric then list it as metric that needs to be same everywhere

To use it please copy check_value_same_everywhere.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"

Then define a check command in Nagios

define command {
  command_name  check_ganglia_host_regex
  command_line  /bin/sh /usr/share/ganglia-web2/nagios/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$'
}

e.g.

check_command		check_ganglia_host_regex!^web-|^app-!svn_revision,num_config_files