Nagios and Nagiosgrapher
http://DEFLECT\_CONTROLLER/nagios3/index.php - System monitoring and
Nagios runs as a daemon in two places, and nagios agents run on every other host (ie the edges). The first nagios instance is on DEFLECT CONTROLLER, collecting data about all services, alerting us when things go down or cross configured thresholds. The second nagios server instance is tbd on backup, first to monitor DEFLECT CONTROLLER's nagios, eventually to become a live standby.
The most important thing for us to monitor is availability of content, over HTTP, on a per-edge basis.
Nagios Grapher collections data using RRD, and presents it in graph format.
monitor an object on the origin too
both direct and via each edge
either HTTP object or availability of the nrpe agent should be configured as parent for each host. A parent test has this function: if 'children' fail which a parent is down, they do not issue alerts - though they will show as RED on the nagios web interface. This is to streamline diagnosis when a lot of ests fail at once.
DNS check against the nameservers, possibly integrated dns-then-http check
disk space, log size, log currency, cache size, traffic-to-origin (per origin?), traffic-to-edge, variation in traffic levels to different edges
already being recorded per minute, but not yet fed into monitoring.
The second instance runs (will run!) on backup, and its job is just to check that the first instance of nagios is functioning.configuration of the the two nagios instances will be such that if the primary goes down, the secondary will continue monitoring in the interim. For the time being, the secondary just monitors the primary and alerts if that goes down.
HTTP request volumes
logstash and kibana
Under development. We use a wiki based incident system and email reporting of other events.