Skip to content

Simple DSL for writing metrics collector agents

License

Notifications You must be signed in to change notification settings

divanikus/salus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Salus

Salus is a simple DSL for writing collector agents for different monitoring systems. I'm just tired of rewriting those primitives from scratch for every new check I'm willing to add to a monitoring system.

This is alpha quality software right now, but you might help to improve it

Installation

Add this line to your application's Gemfile:

gem 'salus'

And then execute:

$ bundle

Or install it yourself as:

$ gem install salus

Usage

The gem can be used from your own script or by using CLI.

Quick sample:

require "json"
default ttl: 60
var state_file: "system.state.yml"

group "cpu" do
  data = File.open("/proc/stat").read.split(/\n/).grep(/^cpu /)
  data.each do |l|
    name, user, nice, csystem, idle, iowait, irq, softirq = l.split(/\s+/)

    busy  = user.to_i + nice.to_i + csystem.to_i + iowait.to_i + irq.to_i + softirq.to_i
    total = busy + idle.to_i

    counter "busy", value: busy, mute: true
    counter "total", value: total, mute: true
    gauge "usage" do
      value("busy") / value("total") * 100
    end
  end
end

group "memory" do
  data = File.open("/proc/meminfo").read.split(/\n/).grep(/^Mem/)
  data.each do |l|
    name, value = l.match(/Mem(?<name>.+):\s+(?<value>\d+)\s/)[1,2]
    gauge name.downcase, value: value.to_i, mute: true
  end
  gauge "usage" do
    (value("total") - value("free")) / value("total").to_f * 100
  end
end

render do |data|
  iterate(data) do |name, metric|
    puts ({:name => name, :value => metric.value, :timestamp => metric.timestamp, :ttl => metric.ttl}.to_json)
  end
end

Save it as sample.salus and run salus.

$ salus -f sample.salus
{"name":"cpu.usage","value":6.433521607455635,"timestamp":1510699623.8592708,"ttl":60}
{"name":"memory.usage","value":25.60121068625779,"timestamp":1510699623.863265,"ttl":60}

Because cpu.usage is made of counters, you'll have to run the command at least twice. Be aware of salus making system.state.yml for persisting it's state. You may redefine file name and location with -s switch.

You can also run in infinite loop mode

$ salus loop -f sample.salus
{"name":"cpu.usage","value":12.528473606797895,"timestamp":1510700076.4455612,"ttl":60}
{"name":"memory.usage","value":25.57683405209171,"timestamp":1510700076.447059,"ttl":60}
{"name":"cpu.usage","value":8.239946939924048,"timestamp":1510700106.3334389,"ttl":60}
{"name":"memory.usage","value":25.535366304014968,"timestamp":1510700106.33444,"ttl":60}
{"name":"cpu.usage","value":2.840158185017875,"timestamp":1510700136.364328,"ttl":60}
{"name":"memory.usage","value":25.537265316671707,"timestamp":1510700136.36533,"ttl":60}
^C

You may invoke several salus scripts at once, just specify all of them in a space delimited list. You might also specify a directory with Salusfile or many *.salus files. By default salus does search *.salus and Salusfile in the current directory.

Primitives

Group

Group is the base unit of work. You should write your metric collecting code inside groups. Groups can be nested. Top level groups are run in separate threads. Group might include metrics. Groups must be named.

group "test" do
  gauge "test1", value: 10
  gauge "test2", value: 30, mute: true
end

Metrics

Could be one of the following (mimicking RRDtool data sources):

  • Gauge: values stored as is
  • Derive: a rate of something per second, best suites for values that rarely overflow
  • Counter: almost same as derive, but with overflow detection for 32- and 64-bit counters
  • Absolute: a rate of a counter, which resets on reading
  • Text: just a text stored as is

A metric should have a name and a value at the very minimum. You might also specify a custom or default TTL. A metric can be also mute, which means it wouldn't appear in the output, unless told to do so.

An expired metric is considered invalid, so if you use counter or derive with ttl less than collecting interval, you'll always get nils.

group "test" do
  gauge   "test1", value: 10, ttl: 50
  counter "test2", value: 10, mute: true
  derive  "test3", value: 30, timestamp: Time.now.to_f + 10
end

You might also use a block to calculate metrics value. In this case any exception which happened in the block would be muted, and the nil value returned instead.

group "test" do
  # this would always produce nil
  gauge "division by zero" do
    100 / 0
  end
end

Renderers

A renderer is a class, which is used to render the actual output, would it be just STDOUT, a file or tcp/ip service.

render(StdoutRenderer.new(separator: "/"))

This code would produce something like that

[2017-11-15 02:25:16 +0300] cpu/usage - 4.00
[2017-11-15 02:25:16 +0300] memory/usage - 26.23

You may add more than one renderer at once and send your data to as many monitoring services as you want.

Check sample renderers for examples.

Pipeline

Salus pipeline is rather straightforward and consists of two stages:

  • Collect data (execute groups' code)
  • Send data (execute renderers' code)

You might run it once by cron or in infinite loop mode. Each stage is executed using embed thread pool with pre-set timeouts.

Thread pool and timeouts could be configured using configure

Salus.configure do |config|
  # Thread pool settings
  config.min_threads = (CPU.count / 2 == 0) ? 1 : CPU.count / 2
  config.max_threads = CPU.count * 2

  config.interval = 30 # Interval between runs in loop mode
  config.tick_timeout   = 15 # Data collection timeout
  config.render_timeout = 10 # Data rendering timeout
  config.logger   = Logger.new(STDERR) # Default logger
end

Zabbix

Zabbix uses two stage collecting. First of all, it queries (discovers) the list of objects to be checked. Next, it would ask for exact values of specified metrics of an object one by one. Sometimes this means making a lot of requests to a monitored service. So many script writers use some kind of result caching to lower unnecessary work. Salus also writes a result cache file (-c flag). Cache TTL for a metric is a half of it's real TTL or 60 seconds if TTL is unspecified. Upon parameter request it is loaded from cache and if it's expired, whole cache is invalidated and recalculated using salus script.

Sample Salus script for collecting CPU usage ratio on Linux for Zabbix Agent is something like that:

require "salus/zabbix"

default ttl: 60
var state_file: "cpu.state.yml"
var zabbix_cache_file: "cpu.cache.yml"

discover "cpus" do |data|
  stat = File.open("/proc/stat").read.split(/\n/).grep(/^cpu\d/)
  stat.each do |l|
    name, = l.split(/\s+/)
    data << {"\#{CPUNAME}" => name}
  end
end

group "cpu" do
  stat = File.open("/proc/stat").read.split(/\n/).grep(/^cpu\d/)
  stat.each do |l|
    name, user, nice, csystem, idle, iowait, irq, softirq = l.split(/\s+/)

    busy  = user.to_i + nice.to_i + csystem.to_i + iowait.to_i + irq.to_i + softirq.to_i
    total = busy + idle.to_i

    counter "busy[#{name}]", value: busy, mute: true
    counter "total[#{name}]", value: total, mute: true
    gauge "usage[#{name}]" do
      value("busy[#{name}]") / value("total[#{name}]") * 100
    end
  end
end

You can run it using command salus.

$ salus zabbix discover cpus -f zabbix.salus
{"data":[{"#{CPUNAME}":"cpu0"},{"#{CPUNAME}":"cpu1"},{"#{CPUNAME}":"cpu2"},{"#{CPUNAME}":"cpu3"},{"#{CPUNAME}":"cpu4"},{"#{CPUNAME}":"cpu5"},{"#{CPUNAME}":"cpu6"},{"#{CPUNAME}":"cpu7"}]}

Later you can get a cpu usage ratio

$ salus zabbix parameter cpu.usage[cpu5] -f zabbix.salus && sleep 31 && salus zabbix parameter cpu.usage[cpu5] -f zabbix.salus
8.619550926524335

NOTE! You won't get the result on the first run, because cpu usage on Linux needs to get two points in time to be calculated. zabbix subcommand uses caching of the results, so you have to wait for 30 seconds to get next result. But if you'll wait for more than TTL (60 seconds), you'll get empty result again.

New dependant items mode of Zabbix 3.4+ is also supported. Output is in JSON.

$ salus zabbix bulk cpu -f zabbix.salus
{"usage[cpu0]":8.362702709145317,"usage[cpu1]":2.2395325673158424,"usage[cpu2]":3.570268951237728,"usage[cpu3]":4.641349636608478,"usage[cpu4]":3.7313418117547834,"usage[cpu5]":4.023359928822469,"usage[cpu6]":3.0834133549568365,"usage[cpu7]":4.47470699450407}

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

Special thanks

Salus uses portions of code (meh's thread pool and future implementation) and concepts from both project and thor for CLI implementation.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/divanikus/salus.