Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Infrastructure Monitoring As Code

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 bin
Octocat-spinner-32 examples
Octocat-spinner-32 lib
Octocat-spinner-32 spec
Octocat-spinner-32 .gitignore
Octocat-spinner-32 .rspec
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.rdoc
Octocat-spinner-32 Rakefile
Octocat-spinner-32 TODO.md
README.rdoc

Critical

Monitoring should be a layer in the stack, not an application.

Installing

gem install rspec

git clone git://github.com/danielsdeleo/critical.git

Manifesto

Critical is my take on network/infrastructure monitoring. Here are the big ideas:

  • Infrastructure as code: The monitoring system should be an internal DSL so it can natively interact with any part of your infrastructure you can find or write a library for. You should also be able to productively alter its guts if you need to. This is a monitoring system for ops people who write code and coders who do ops.

  • Client-based: This scales better, and is actually easier to configure if you use configuration management, which you should be doing anyway.

  • Continuous verification: Critical has a single shot mode in addition to the typical daemonized operation. This allows you to verify the configuration on a host after making changes and then continuously monitor the state of the system using the same verification tests.

  • Declarative: Declare what the state of your system is supposed to be.

  • Alerting and Trending together: a client/agent can do both of these at the same time with less configuration overhead. It makes sense to keep them separate on the server side.

  • Licensing: “Do what thou wilt shall be the whole of the law,” except for patent trolls, etc. So, Apache 2.0 it is.

Design

Critical runs as a cluster of daemons. The master process does the scheduling and assigns tasks to workers by communicating over a UNIX domain socket. The workers listen to the socket and process tasks as they come. I had also considered an evented architecture (using eventmachine), but that had the drawback of requiring users to write plugins using only EM-based libraries or risk running into problems with blocking IO.

Metric DSL

Critical provides a DSL for writing metric gathering code. It looks like this:

Metric(:memory_utilization) do
  case RUBY_PLATFORM
  when /darwin/
    # omitted...

  when /linux/

    collects 'free -b'

    reports(:total_memory_in_kb => :int) do
      result.line(1).split[1]
    end

    reports(:bytes_used => :int) do
      result.line(2).split[2]
    end

  else
    raise UnsupportedPlatform, "memory_utilization does not have an implementation for your platform yet :("
  end

  reports(:kb_free => :integer) do
    bytes_free / 1024
  end

  reports(:kb_used => :integer) do
    pp :kb_used => (bytes_used / 1024)
    bytes_used / 1024
  end

  reports(:mb_free => :float) do
    kb_free / 1024.0
  end

  reports(:mb_used => :float) do
    kb_used.to_f / 4.0
  end

end

Using Metrics

To configure critical to monitor your metrics, you use the monitor DSL:

require_metric 'disk_utilization'
require_metric 'memory_utilization'
require_metric 'cpu_utilization'
require_metric 'cluster'

# Monitors are also where you define your scheduling.
Monitor(:system) do

  # Monitor statements can be nested, this nesting will be included in the
  # collected data for tracking/tagging purposes.
  Monitor(hostname) do # includes the hostname in the namespace

    # Specify collection intervals with +every+ or +collect_every+
    # The +every+ form takes a block, each monitor you define inside the block
    # will be scheduled to run at that interval.
    every(10=>:seconds) do

      disk_utilization('/') { track :percentage }

      memory_utilization { track :bytes_used }

      cpu_utilization {track :percent_used}

      cluster("critical : worker") do |c|
        c.track :processes
        c.track :total_cpu
        c.track :total_rss
        c.track :uptime
      end

    end
  end
end

Running Critical:

See bin/critical –help and the examples directory

Project Status

Initial work focused on the alerting half of the alerting/trending combo that comprises “monitoring.” I've pivoted and am currently focusing on making it dead simple to get data into graphite. Alerting is still a long term priority.

License and Copyright

Distributed under the terms of the Apache 2.0 license. © 2010,2011 Daniel DeLeo

Something went wrong with that request. Please try again.