Skip to content
Infrastructure Monitoring As Code
Find file
Failed to load latest commit information.
examples purge more whitespace Mar 6, 2011
spec whitespace Jul 24, 2011
.gitignore add a bunch of gitignore stuff Mar 6, 2011
.rspec switch tests and rspec assertion support to rspec2 Nov 14, 2010
LICENSE Apache 2.0 Mar 6, 2010
README.rdoc clarify "continuous verification" in the readme Mar 22, 2011
Rakefile switch tests and rspec assertion support to rspec2 Nov 14, 2010 plan to appease the DSL haters Jun 5, 2011



Monitoring should be a layer in the stack, not an application.


gem install rspec

git clone git://


Critical is my take on network/infrastructure monitoring. Here are the big ideas:

  • Infrastructure as code: The monitoring system should be an internal DSL so it can natively interact with any part of your infrastructure you can find or write a library for. You should also be able to productively alter its guts if you need to. This is a monitoring system for ops people who write code and coders who do ops.

  • Client-based: This scales better, and is actually easier to configure if you use configuration management, which you should be doing anyway.

  • Continuous verification: Critical has a single shot mode in addition to the typical daemonized operation. This allows you to verify the configuration on a host after making changes and then continuously monitor the state of the system using the same verification tests.

  • Declarative: Declare what the state of your system is supposed to be.

  • Alerting and Trending together: a client/agent can do both of these at the same time with less configuration overhead. It makes sense to keep them separate on the server side.

  • Licensing: “Do what thou wilt shall be the whole of the law,” except for patent trolls, etc. So, Apache 2.0 it is.


Critical runs as a cluster of daemons. The master process does the scheduling and assigns tasks to workers by communicating over a UNIX domain socket. The workers listen to the socket and process tasks as they come. I had also considered an evented architecture (using eventmachine), but that had the drawback of requiring users to write plugins using only EM-based libraries or risk running into problems with blocking IO.

Metric DSL

Critical provides a DSL for writing metric gathering code. It looks like this:

Metric(:memory_utilization) do
  when /darwin/
    # omitted...

  when /linux/

    collects 'free -b'

    reports(:total_memory_in_kb => :int) do

    reports(:bytes_used => :int) do

    raise UnsupportedPlatform, "memory_utilization does not have an implementation for your platform yet :("

  reports(:kb_free => :integer) do
    bytes_free / 1024

  reports(:kb_used => :integer) do
    pp :kb_used => (bytes_used / 1024)
    bytes_used / 1024

  reports(:mb_free => :float) do
    kb_free / 1024.0

  reports(:mb_used => :float) do
    kb_used.to_f / 4.0


Using Metrics

To configure critical to monitor your metrics, you use the monitor DSL:

require_metric 'disk_utilization'
require_metric 'memory_utilization'
require_metric 'cpu_utilization'
require_metric 'cluster'

# Monitors are also where you define your scheduling.
Monitor(:system) do

  # Monitor statements can be nested, this nesting will be included in the
  # collected data for tracking/tagging purposes.
  Monitor(hostname) do # includes the hostname in the namespace

    # Specify collection intervals with +every+ or +collect_every+
    # The +every+ form takes a block, each monitor you define inside the block
    # will be scheduled to run at that interval.
    every(10=>:seconds) do

      disk_utilization('/') { track :percentage }

      memory_utilization { track :bytes_used }

      cpu_utilization {track :percent_used}

      cluster("critical : worker") do |c|
        c.track :processes
        c.track :total_cpu
        c.track :total_rss
        c.track :uptime


Running Critical:

See bin/critical –help and the examples directory

Project Status

Initial work focused on the alerting half of the alerting/trending combo that comprises “monitoring.” I've pivoted and am currently focusing on making it dead simple to get data into graphite. Alerting is still a long term priority.

License and Copyright

Distributed under the terms of the Apache 2.0 license. © 2010,2011 Daniel DeLeo

Something went wrong with that request. Please try again.