Skip to content

Overview: Ganglia Dynamic Metrics?

Dave Carroll edited this page Dec 29, 2016 · 6 revisions

Ganglia Dynamic Metrics

Ganglia is great for long-term trending of metrics across your enterprise. I've used it at several companies, authoring hundreds of modules to accomplish a wide variety of metric gathering situations. One scenario where Ganglia may not be ideal is in the area of short-term dynamic metrics and in fact it takes some work to achieve this. What do I mean by "short-term dynamic metrics?" Here are some examples:

What if I'd like to have a metric graph showing processes on a system that have swapped? This data may change as various processes are swapped out or cleared. I just want to see what has swapped over the past 60 seconds, not any sort of trending.

In another example I have farms of web servers that handle a billion hits a day globally from customers as an API endpoint and at any given time I'd like to trend which are sending the highest volume of any given type of connection, ranked by a customer id provided in the request. In fact, I'd like to know who the top 5 senders of traffic are by volume every 60 seconds. This sort of metric is highly dynamic. It can change every 60 seconds, shifting the ranking. Some might argue Ganglia is not the best platform to handle this but I like everything consolidated into as few monitoring platforms as possible. Let's use this scenario as our example for a dynamic metric in Ganglia.

One primary issue one might face in my example is not within the ability to gather the metrics, but in fact in displaying those metrics. I am not dealing with a small set number of customers or endpoints so as I identify connecting sources and pump those along with the number of "connections" into my Ganglia instance, I quickly find that hundreds or maybe thousands of .rrd files and therefore metric graphs are being created. Not only does this get quickly messy, it can make my dynamically drawn node pages take an eternity to draw and load. No, this is not ideal at all. In addition, I have to scour thousands of metric graphs to find the top 5 by volume manually. There must be a better way. Turns out, if you are "hacky" there is. As with anything, there are probably many ways to accomplish this but this article describes how I have it working. It requires some unconventional approaches to Ganglia and running jobs local to your Ganglia Web system to handle killing off .rrd files and producing dynamic .json files for graphs.

What began as a fun challenge resulted over time in a very useful extension of Ganglia for my business. The solution eventually came to consist of:

  • A sending program on each web server that gathered metrics from log files in near real time
  • A program called DynamicGraph that sweeps my rrd tree for metrics, aging them out as I define.
  • A local Sqlite db later replaced by Redis to cache required information

This post offers some hints about how I went about tweaking Ganglia to offer more for my team than trending. I include some code snippets but leave to the reader to create their own solution. You can now install Ganglia DynamicGraph from Python Org's PyPi repository or review my source. Installing is very easy: 'pip install ganglia-dyngraph'

To begin, let's look at an example of dynamic graph usage from the gmond side of things.

Part I: Dealing With High Volume Access-Logs