Skip to content
This repository has been archived by the owner on Apr 30, 2024. It is now read-only.
Mauricio Teixeira edited this page Dec 12, 2018 · 6 revisions

Welcome to the tower-nagios-integration wiki!

This repository will contain various documentation and scripts that help integrate Ansible Tower with Nagios.

tower_handler.py

Script to be used as Nagios event handler to trigger jobs in Ansible Tower.

Overview

Lots of people use event handlers in Nagios as a way to preemptively fix problems even before alerting anyone. There are some limitations on how those handlers can/should be deployed, and what kind of actions they can execute. Using Ansible Tower to execute recovery tasks gives a lot more flexibility, and provide better integration with the established automation environment. On top of that, lots of statistics can be easily generated by using the internal Tower capabilities.

This script runs on the Nagios server, and uses tower-cli to trigger jobs in Ansible Tower. Since those jobs are standard Ansible playbooks running from within Ansible Tower, they can easily be used as a service self-healing method, by running the playbooks which your operations or DevOps team would already use to recover the service. On top of that, since those playbooks run outside the failed host, they can be used to reboot, re-provision or even auto-scale (given your Ansible Tower has already been properly configured for those tasks).

Red Hat IT developed this script in order to reduce the burden of the operations team, by automatically fixing problems without human intervention, and speeding up the time to recover.

This is note a silver bullet, it will not solve all your problems. It is merely a tool to help you automate your event management and service recovery.

Software requirements

  • Python 2.7
  • Nagios 3.5 or higher
  • Ansible Tower 3.2 or higher

Configuration requirements

By the time that you arrived here, you may already have everything you need to run this script. We will list the requirements here, but this document does not intend to explain how to achieve these. Please refer to the specific documentation of the given technology used.

  • Ansible Tower
    • Username/password to be used by Nagios.
    • At least one inventory and one job template.
    • It's highly advisable that your job template have the inventory "prompt on launch" check box marked, however it's not required.
  • Nagios
    • tower-cli installed and configured with the proper credentials.
      • HINT: On RHEL7 you can install python2-ansible-tower-cli from EPEL

Installation

Copy tower_handler.py into the directory where your event handler scripts should run (as defined by your configuration).

Test your environment

First of all, make sure tower-cli is working properly. The minimum viable test is this:

# tower-cli job list
===== ============ ======================== ========== ======= 
 id   job_template         created            status   elapsed 
===== ============ ======================== ========== ======= 
    1           1  2018-10-03T18:30:00.000Z successful  42.000
===== ============ ======================== ========== =======

To confirm if the handler itself is working, you can trigger a job from the command line:

# /path/to/tower_handler.py --template <my_template> --inventory <my_inventory> --attempt 2

If successful, the script will not produce any return, but you will see a job on your Ansible Tower Jobs tab (or in the job list, if you repeat the command above).

Command line options

Even though this script has been written to be used as a Nagios event handler, it can also be used from the command line (even though it's a little more complicated than using tower-cli directly).

It's important to know all the available command line options, because you will need to know them in order to define your own Nagios handlers. Depending on how you use those options will make it easier or harder to consume the handler.

# /path/to/tower_handler.py --help
usage: tower_handler.py [-h] --template TEMPLATE --inventory INVENTORY
                        [--playbook PLAYBOOK] [--extra_vars EXTRA_VARS]
                        [--limit LIMIT] [--state STATE] [--attempt ATTEMPT]
                        [--downtime DOWNTIME] [--host_downtime DOWNTIME]
                        [--service SERVICE]
                        [--hostname HOSTNAME] [--warning]

optional arguments:
  -h, --help               show this help message and exit
  --template TEMPLATE      Job template (number or name)
  --inventory INVENTORY
                           Inventory (number or name)
  --playbook PLAYBOOK      Playbook to run (yaml file inside template)
  --extra_vars EXTRA_VARS
                           Extra variables (JSON)
  --limit LIMIT            Limit run to these hosts (group name, or comma
                           separated hosts)
  --state STATE            Nagios check state
  --attempt ATTEMPT        Nagios check attempt
  --downtime DOWNTIME      Nagios service downtime check
  --host_downtime DOWNTIME Nagios host downtime check
  --service SERVICE        Nagios alerting service
  --hostname HOSTNAME      Nagios alerting hostname
  --warning                Trigger on WARNING (otherwise just CRITICAL and
                           UNKNOWN)

Nagios configuration

There are many ways to configure Nagios to use this script. We will present here some suggestions.

Example 1 - short call to the handler, wide impact

This will trigger the job run against all the hosts on the specified inventory.

/etc/nagios/conf.d/eventhandlers.cfg
define command {
    command_name        tower-handler-min
    # when playbook does not require extra_vars, and you want to run on full inventory
    command_line        $HANDLERS$/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --host_downtime '$HOSTDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$'
}
/etc/nagios/hosts.d/server01.example.com.cfg
define service {
    use                         generic-service
    host_name                   server01.example.com
    service_description         MyAppService
    contact_groups              it-production
    check_command               check_myappservice
    event_handler               tower-handler-min!My Template!My Inventory
}

Example 2 - longer call to the handler, more precise action

This allows the use of all parameters during the handler call, which provides more information to the job template, allowing fore more precise action.

/etc/nagios/conf.d/eventhandlers.cfg
define command {
    command_name        tower-handler-full
    command_line        $HANDLERS$/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --host_downtime '$HOSTDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$' --extra_vars '$ARG3$' --limit '$ARG4$'
}
/etc/nagios/hosts.d/server01.example.com.cfg
define service {
    use                         generic-service
    host_name                   server01.example.com
    service_description         MyAppService
    contact_groups              it-production
    check_command               check_myappservice
    event_handler               tower-handler-full!My Template!My Inventory!my_variable: value!<fqdn>"
}

Note: in this case, <fqdn> can be either the host itself, or a totally different host, as long as it exists in the inventory.

Useful variations

  • Run against the host itself -- By adding --limit '$HOSTADDRESS$' to the command definition, the job will run only against the host which called the handler.
  • Run in WARNING state -- By default, the script only runs when the alert is in CRITICAL or UNKNOWN state. Adding --warning to the command definition will allow it to trigger during a WARNING state.