GangliaRest API: Part 2

See Part 1 if you arrived here without reading it

After completing the first pass of the project to expose Ganglia metrics via HTTP calls, I wondered whether it might scale to a level we required. The ease of use to integrate it into existing monitoring and new possibilities it offered meant it was becoming more popular to utilize. I am unsure how the web.py module will scale and I may need to consider fronting this with Apache. Time will tell. While considering other bottlenecks I could improve I figured the easiest to address was to limit the directory and file scanning that happens while searching for the correct host and metric across the rrd tree.

My idea was to use a similar approach that I used to cache information I use to build dynamic graphs. Redis was already running locally to support that and other initiatives so I decided to create a new instance and within here I would hold metric location data, cached.

To start, I added a new class to my Redis module in my ganglia_tools package, specific to this process. (with some debugging enabled)

class Check_Redis_GangliaRest(object):
    ''' This class is responsible for handling our Ganglia RRD locations.
        We use a Redis DB instance to cache file system locations to lessen
        walking the filesystem when locating metrics under the rrd tree. '''

    rootDir = cfg.rrdDir
    logfile = cfg.logfile

    def __init__(self,hostname):
        ''' We are looking for our hostnames in our rrd tree '''

        self.hostname = hostname

        ''' We have to aquire both hostname and fqdn to test '''

        if self.hostname.endswith(cfg.domain):
            self.non_fqdn = self.hostname.split('.',1)[0]
            self.fqdn = self.hostname

        else:
            self.non_fqdn = self.hostname
            self.fqdn = self.hostname+'.'+cfg.domain

    def is_redis_available(self,conn):
        ''' Check if Redis running and available - if not write hint to logfile '''

        try:
            conn.get(None)  # getting None returns None or throws an exception

        except (redis.exceptions.ConnectionError,
            redis.exceptions.BusyLoadingError):
            if os.path.isfile('var/run/redis/redis.pid'):
                loglib(self.logfile,"WARN: Redis pid file exists but problem connecting to Redis")
            else:
                loglib(self.logfile,"ERROR: Redis is not responding and no pidfile was found. Ensure Redis is running")

        try:
            conn.ping()

        except:
            loglib(self.logfile,"ERROR: Redis is not reachable. Ensure it is configured correctly, and check /etc/GangliaRest.cfg for proper configuration.")
            return False

        return True

    def redis_lookup(self):

        r = redis.Redis(
        host = cfg.redisHost,
        port = cfg.redisPort,
            db = cfg.redisDb,
            password = cfg.redisAuth)

        redisStatus = self.is_redis_available(r)

    # Check our cache for host location
        if r.exists(self.fqdn):
            self.location = r.get(self.fqdn)
            loglib(self.logfile,"CACHE HIT: Redis: returning %s in Check_Redis_GangliaRest:redis_lookup" % self.location)
            return(self.location)

        elif r.exists(self.non_fqdn):
            self.location = r.get(self.non_fqdn)
            loglib(self.logfile,"CACHE HIT: Redis: returning %s in Check_Redis_GangliaRest:redis_lookup" % self.location)
            return(self.location)

    else:
        ''' location not found in cache, find in filesystem and load into cache
                to improve lookup performance for API requests. Because we may have been
                passed either a hostname or fqdn to search for, we'll have to look for both '''

        try:
            for dirName, subdirList, fileList in os.walk(self.rootDir):
                    for host in subdirList:
		    ''' We have to account for hosts that were named fqdn
		        or simply by hostname, or we will fail to locate '''
                        m = re.match(self.non_fqdn, host)
                        if m:
		        self.hostname = host	# reset the host to the correct name on fs
		        #loglib(self.logfile,"INFO: Host being set as %s" % self.hostname)
	                self.location = os.path.abspath(dirName+'/'+self.hostname)

		        ''' Now we need to add to our Redis cache with a TTL of one day
			    to account for transitory systems like VM's that spin down '''
		        try:
		            #print("Setting key for %s and val %s" % (self.hostname,self.location))
		            r.setex(self.hostname, self.location, cfg.redisTtl)
		            return(self.location)

                        except:
			    loglib(self.logfile,"ERROR: Error writing to Redis under Check_Redis_GangliaRest:redis_lookup")
		            exit(1)

	            else:
		        #loglib(self.logfile,"ERROR: Unable to match requested host %s using match %s in rrdtree under Check_Redis_GangliaRest:redis_lookup" % (host,self.non_fqdn))
		        continue  # we don't want to log every miss

            except:
            loglib(self.logfile,"ERROR: Unable to find host %s or write to Redis requested host in rrdtree under Check_Redis_GangliaRest:redis_lookup" % host)

As one can see in the above code, this class takes the hostname that was passed in on the initial HTTP request, checks the local Redis cache for a key matching that, if found, returns the location of that host within the rrd tree and if fails, scans the filesystem for that directory location, inserts it into Redis for one day, and returns the location back to the main program.

First however, we needed to handle cases where some hosts are set as a single hostname whereas others are configured as fqdn. We need to search both Redis by key and directory name for either web1.example.com or web1.

To recap, when a request arrives that appears as: http://my_gweb:8659/node/my_hostname/get_metric/load_one we want to take the hostname passed in and locate that buried under /var/lib/ganglia/rrds. As we have hundreds of systems across dozens of clusters, it can take dozens of scans to locate the actual host directory, which we then search all metric files therein to locate the desired metric passed in. So, we employ Redis to house locations in a k,v manner that looks like:

k=webapp1 v='/var/lib/ganglia/rrds/webapp_cluster/webapp1'

Knowing the above, my app can then go right to the path defined in the value. This provided a speed up of 5x to 10x, from observed behavior. I cache the location in Redis for 86400 seconds, or one day to allow dynamic hosts to come and go as they are spun up or down, and to prevent stale entries. This can be adjusted of course as needed.

Ok, so with our new class that handles our Redis work, we need to adjust our original GangliaRest program to stop simply tearing thru the filesystem and instead call the new class to handle searching. In our original code we had a method called locate_file which handled the less efficient lookup. We're going to ignore that method by adjusting our GET method a bit. Our GET method should look something more like:

def GET(self,node='None',req='None'):
    ''' pass metric req list to get_metric_value '''

    self.metric_list = []
    self.node = node
    self.req = req+'.rrd'

    loglib(logfile,'REQUEST: request for metric %s on node %s' % (self.req,self.node))

    locateDir = Check_Redis_GangliaRest(self.node)
    self.hostLocation = locateDir.redis_lookup()
    self.metric = [i for i in os.listdir(self.hostLocation) if i == self.req]

    #loglib(logfile,'INFO: Found location as %s' % self.hostLocation)
    #loglib(logfile,'INFO: Checking metric %s' % self.metric)

    GetMetric.reqsCount +=1

    self.metric_list.append(self.req)

    #loglib(logfile,'Appending %s to metric_list' % self.metric)
    #loglib(logfile,'Sending %s and %s to gmv' % (self.hostLocation,self.metric_list))

    ans = gmv.GetMetricValue(self.hostLocation,self.metric_list)

    try:
        for k,v in ans.to_sort.items():
            print k,v
            loglib(logfile,"RESPONSE: returning value of %s for metric %s" % (k,v))
            GetMetric.respCount +=1
    except Exception as e:
        GetMetric.errorCount +=1
        print(e)
        loglib(logfile,"ERROR: Error thrown was %s" % e)

    with open(statefile,'w') as f:
        f.write("Reqs: %d\n" % GetMetric.reqsCount)
        f.write("Resp: %d\n" % GetMetric.respCount)
        f.write("Errs: %d\n" % GetMetric.errorCount)

    return(v)

The adjustments center around calling the new class named Check_Redis_GangliaRest. You may notice I also am writing out some counts to a state file. I use this to trend how many requests per 30 seconds, how many fails per 30 seconds and how many responses per 30 seconds occur across GangliaRest. This should help me identify any issues around failures or volume. I also log output to a log for debugging.

When GangliaRest first starts up, it sees the cache is empty and primes it for us by running an indexing on our rrdtree. In the next section we will discuss the quick indexer I put together to help with performance.

ganglia_api

The image above shows requests and responses matching at around 2 per second. We currently have about 55 metrics being polled with many more planned.

Part III: GangliaRest Indexer

Home
Improving Gweb and Gmond Performance

Ganglia DynamicGraph

GangliaRest API

Metric Modules

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GangliaRest API: Part 2

Clone this wiki locally