Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

In-memory caching using only python #18

Merged
merged 7 commits into from

1 participant

@daveschaefer
Collaborator

Here is a module that implements Least Recently Used caching using only python. This can be used on servers that don't have a dedicated caching system.

In my performance tests I hit the server with 200 requests per second for 60 seconds, 20 times, and averaged the results. Using --pycache vs no caching showed these improvements with a postgress database:

100% increase in average responses/requests served in 60 seconds (4,041 -> 8,094 requests server)
13% decrease in average request timeouts in 60 seconds (1715 -> 1484 timeouts)
92% decrease in average response time (546ms -> 46ms response time)

Additionally the server no longer became overwhelmed and unresponsive.

@daveschaefer daveschaefer was assigned
@daveschaefer daveschaefer merged commit eb98fab into danwent:master
@daveschaefer daveschaefer referenced this pull request
Closed

+caching #8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
View
5 README
@@ -118,6 +118,11 @@ this type, we do not regularly test it. Currently, the threaded_scanner.py will
for SSL services, though work to have it scan for SSH services is pretty minor.
+==== MORE INFO ====
+
+See doc/advanced_notary_configuration.txt for tips on improving notary performance.
+
+
==== CONTRIBUTING ====
Please visit the github page to submit changes and suggest improvements:
View
9 doc/advanced_notary_configuration.txt
@@ -0,0 +1,9 @@
+There are several options that can make your notary run even better.
+
+
+1. Set up caching!
+
+Data caching will significantly increase your notary's performance.
+
+For best performance you may want to use a dedicated caching server such as memcached, memcachier, or redis. If you do not have access to or don't want to set up a dedicated caching server, use the built-in python caching with '--pycache'.
+
View
8 notary_http.py
@@ -37,7 +37,7 @@ class NotaryHTTPServer:
Collect and share information on website certificates from around the internet.
"""
- VERSION = "3.1"
+ VERSION = "pre3.2a"
DEFAULT_WEB_PORT=8080
ENV_PORT_KEY_NAME='PORT'
STATIC_DIR = "notary_static"
@@ -66,6 +66,10 @@ def __init__(self):
help="Use memcachier to cache observation data. " + cache.Memcachier.get_help())
cachegroup.add_argument('--redis', action='store_true', default=False,
help="Use redis to cache observation data. " + cache.Redis.get_help())
+ cachegroup.add_argument('--pycache', default=False, const=cache.Pycache.CACHE_SIZE,
+ nargs='?', metavar=cache.Pycache.get_metavar(),
+ help="Use RAM to cache observation data on the local machine only.\
+ If you don't use any other type of caching, use this! " + cache.Pycache.get_help())
args = parser.parse_args()
@@ -103,6 +107,8 @@ def __init__(self):
self.cache = cache.Memcachier()
elif (args.redis):
self.cache = cache.Redis()
+ elif (args.pycache):
+ self.cache = cache.Pycache(args.pycache)
self.active_threads = 0
self.args = args
View
29 test/Network Notary Test Cases.txt
@@ -62,3 +62,32 @@ Failing Gracefully:
- For Machines, and Event Types (on startup) does it log an error, disable database metrics, and continue?
- For Metrics does it ignore the metric, log an error, and continue?
- Are metrics throttled back if the server receives many requests in a short period of time? (e.g. 200 requests per second)
+
+
+In-memory caching with pycache:
+-------------------------------
+- If the cache is below the memory limit, are new keys continually added upon request?
+- If adding a new key would use too much memory, does the cache remove an entry and then store the key?
+ - Is the least recently used entry removed?
+ - If removing one entry doesn't clear enough RAM, does the cache remove multiple entries until it has enough space?
+ - Do both the hash and the heap size go down?
+- If a requested object is bigger than total RAM allowed, do we log a warning and not store it?
+- When an existing entry is retrieved from the cache is it's 'last viewed' time updated?
+
+expiry:
+- Are expired entries removed during get() calls and None returned instead?
+- Are expired entries cleaned up as they are encountered when clearing new memory?
+- Are negative expiry times rejected and cache entries not created?
+
+pycache threads:
+- Do we only create a single cache and return the proper results regardless of how many threads the server uses?
+- If multiple threads attempt to set a value for the same key is only one of them allowed to set and the rest return immediately?
+- If multiple threads attempt to set a value for *different* keys, are they all allowed to do so?
+- Is only one thread at a time allowed to adjust the current memory usage?
+
+pycache arguments:
+- Are characters other than 0-9MGB rejected, throwing an error?
+- Does it stop you from specifyng MB *and* GB?
+- Is it case insensitive?
+- Does the cache have to be at least 1MB?
+
View
86 util/cache.py
@@ -102,13 +102,11 @@ def __init__(self):
print >> sys.stderr, "ERROR: Could not connect to memcache server: '%s'. memcache is disabled." % (str(e))
self.pool = None
-
def __del__(self):
"""Clean up resources"""
if (self.pool != None):
self.pool.relinquish()
-
def get(self, key):
"""Retrieve the value for a given key, or None if no key exists."""
if (self.pool != None):
@@ -118,7 +116,6 @@ def get(self, key):
print >> sys.stderr, "Cache does not exist! Create it first"
return None
-
def set(self, key, data, expiry=CacheBase.CACHE_EXPIRY):
"""Save the value to a given key name."""
if (self.pool != None):
@@ -210,3 +207,86 @@ def set(self, key, data, expiry=CacheBase.CACHE_EXPIRY):
self.redis.expire(key, expiry)
else:
print >> sys.stderr, "ERROR: Redis cache does not exist! Create it first"
+
+
+class Pycache(CacheBase):
+ """
+ Cache data using RAM.
+ """
+
+ CACHE_SIZE = "50" # megabytes
+
+ @classmethod
+ def get_help(cls):
+ """Tell the user how they can use this type of cache."""
+ # TODO: quantify how many observation records this can store
+ return "Size can be specified in Megabytes (M/MB) or Gigabytes (G/GB). \
+ Megabytes is assumed if no unit is given. \
+ Default size: " + cls.CACHE_SIZE + "MB."
+
+ @classmethod
+ def get_metavar(cls):
+ """
+ Return the string that should be used for argparse's metavariable
+ (i.e. the string that explains how to specify a cache size on the command line)
+ """
+ return "CACHE_SIZE_INTEGER[M|MB|G|GB]"
+
+ def __init__(self, cache_size=CACHE_SIZE):
+ """Create a cache using RAM."""
+ self.cache = None
+
+ import re
+
+ # let the user specify sizes with the characters 'MB' or 'GB'
+ if (re.search("[^0-9MGBmgb]+", cache_size) != None):
+ raise ValueError("Invalid Pycache cache size '%s': use '%s'." %
+ (str(cache_size), self.get_metavar()))
+
+ if (re.search("[Mm]", cache_size) and re.search("[Gg]", cache_size)):
+ raise ValueError("Invalid Pycache cache size '%s': " % (str(cache_size)) +
+ "specify only one of MB and GB.")
+
+ multiplier = 1024 * 1024 # convert to bytes
+
+ if (re.search("[Gg]", cache_size)):
+ multiplier *= 1024
+
+ # remove non-numeric characters
+ cache_size = cache_size.translate(None, 'MGBmgb')
+ cache_size = int(cache_size)
+
+ if (cache_size < 1):
+ raise ValueError("Invalid Pycache cache size '%s': " % (str(cache_size)) +
+ "cache must be at least 1MB.")
+
+ cache_size *= multiplier
+
+ try:
+ from util import pycache
+ self.cache = pycache
+ pycache.set_cache_size(cache_size)
+ except ImportError, e:
+ print >> sys.stderr, "ERROR: Could not import module 'pycache': '%s'." % (e)
+ self.cache = None
+ except Exception, e:
+ print >> sys.stderr, "ERROR creating cache in memory: '%s'." % (e)
+ self.cache = None
+
+ def get(self, key):
+ """Retrieve the value for a given key, or None if no key exists."""
+ if (self.cache != None):
+ return self.cache.get(key)
+ else:
+ print >> sys.stderr, "pycache get() error: cache does not exist! create it before retrieving values."
+ return None
+
+ def set(self, key, data, expiry=CacheBase.CACHE_EXPIRY):
+ """Save the value to a given key name."""
+ if (self.cache != None):
+ try:
+ self.cache.set(key, data, expiry)
+ except Exception, e:
+ print >> sys.stderr, "pycache set() error: '%s'." % (e)
+ else:
+ print >> sys.stderr, "pycache set() error: cache does not exist! create it before setting values."
View
257 util/pycache.py
@@ -0,0 +1,257 @@
+# This file is part of the Perspectives Notary Server
+#
+# Copyright (C) 2011 Dan Wendlandt
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, version 3 of the License.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+"""
+Cache and retrieve data in key-value pairs using RAM only.
+
+When the cache reaches maximum size entries are discarded in
+'least recently used' order.
+
+This module does not preemptively reserve memory from the OS;
+additional memory is only acquired as needed.
+Make sure you have enough memory to use the cache you request!
+"""
+
+# Use a module so python can ensure there is only one cache regardless of threads.
+# Note this doesn't allow inheritance; if we need that we will need to refactor.
+
+import heapq
+import itertools
+import sys
+import threading
+import time
+
+# Note: the maximum cache size applies only to stored data;
+# the internal structures used to for implementation will cause pycache
+# to use slightly more memory.
+DEFAULT_CACHE_SIZE = 50 * 1024 * 1024 # bytes
+
+
+class CacheEntry(object):
+ """Store data for a given entry in the cache."""
+
+ def __init__(self, key, data, expiry):
+ """Create new cache entry."""
+
+ if (expiry < 1):
+ raise ValueError("CacheEntry expiry values must be positive")
+
+ now = int(time.time())
+
+ self.key = key
+ self.data = data
+ self.expiry = now + expiry
+ self.memory_used = sys.getsizeof(data)
+
+ # count the key as having been requested just now, so it is not immediately removed.
+ # this is usually correct, as the caller will likely have just retrieved or calculated
+ # the data before calling us to store it.
+ # this also prevents thrashing so new entries are not rapidly added and then removed from the heap.
+ self.last_requested = now
+
+ def update_request_time(self):
+ """Update the most recent request time for this cache entry."""
+ self.last_requested = int(time.time())
+
+ def has_expired(self):
+ """Returns true if this entry has expired; false otherwise."""
+ if (self.expiry < int(time.time())):
+ return True
+ return False
+
+
+class Heap(object):
+ """Store CacheEntries in a heap.
+ Entries are stored in 'least recently used' order
+ so we know what to remove when we run out of space."""
+
+ # This is a wrapper class to allow use of the heapq module in an Object-Oriented way,
+ # and to contain the logic for our priority queue.
+ # The heap does not store the cached data; it is only used to track the 'least recently used' order
+ # so cache entries can be removed when we need space.
+
+ # This heap uses lazy deletion - entries are not deleted immediately, as we don't want
+ # to spend time traversing and re-creating the heap each time.
+ # Instead entries are marked for deletion and removed when they are encountered via popping.
+
+ # Performance Note:
+ # We could add checks to recreate the heap list if old entries are taking up too much space,
+ # but with keys expiring it should be fine for now.
+ # We could also add a check to see if the counter has grown too large, but iterators use
+ # an infinite stream, so it shouldn't be necessary.
+
+ def __init__(self):
+ """Create a new heap."""
+ self.heap = []
+ self.current_entries = {}
+ self.counter = itertools.count()
+
+ def __len__(self):
+ """Return the number of items in the heap."""
+ return len(self.heap)
+
+ def __del__(self):
+ """Delete the heap."""
+ del self.heap
+ del self.current_entries
+
+ def push(self, cache_entry):
+ """Add an entry onto the heap."""
+ # use an iterator to break ties if multiple keys are added in the same second;
+ # this ensures tuple comparison works in python 3.
+ # credit for this idea goes to the python docs -
+ # http://docs.python.org/2/library/heapq.html
+ entry_id = next(self.counter)
+
+ heap_entry = [cache_entry.last_requested, entry_id, cache_entry.key]
+ self.current_entries[cache_entry.key] = entry_id
+ heapq.heappush(self.heap, heap_entry)
+
+ def update(self, cache_entry):
+ """Update the value of a heap entry."""
+ # this is a convenience function to make it easier to understand what's happening.
+ # entries are not actually updated in-place (that takes too long);
+ # instead a new entry is created and the current one marked for lazy deletion later
+ # (the entry is 'marked' for deletion by replacing the entry_id for that key in current_entries)
+ self.push(cache_entry)
+
+ def pop(self):
+ """Remove the least recently used heap entry."""
+ while self.heap:
+ last_requested, entry_id, key = heapq.heappop(self.heap)
+ if (key in self.current_entries and (self.current_entries[key] == entry_id)):
+ del self.current_entries[key]
+ return key
+ # otherwise the element we just popped is either expired or an old junk entry;
+ # discard it and continue.
+ raise IndexError("Heap has no entries to pop")
+
+ def remove(self, cache_entry):
+ """Remove the entry from the heap."""
+ # a convenience function: entries are not removed immediately but marked for lazy deletion.
+ if cache_entry.key in self.current_entries:
+ del self.current_entries[cache_entry.key]
+ # else: don't worry - some other thread might have removed the entry just before us.
+
+
+def __free_memory(mem_needed):
+ """Remove entries from the heap and cache until we have enough free memory."""
+ global current_mem
+ global max_mem
+
+ with mem_lock:
+ while heap and (current_mem + mem_needed > max_mem):
+ key = heap.pop()
+ if key in cache:
+ # naive implementation - we don't worry about discarding a non-expired item
+ # before all expired items are gone.
+ # we just want to clear *some* memory for the new item as fast as possible.
+ # if this really hurts performance we could refactor.
+ __delete_key(key)
+ else:
+ raise KeyError("The heap key '%s' does not exist in the cache and cannot be removed." % (key))
+
+
+def __delete_key(key):
+ """Remove this entry from the cache."""
+ global current_mem
+
+ with mem_lock:
+ current_mem -= cache[key].memory_used
+ del cache[key]
+
+
+def set_cache_size(size):
+ """Set the maximum amount of RAM to use, in bytes."""
+ size = int(size)
+ if size > 0:
+ with mem_lock:
+ global max_mem
+ max_mem = size
+
+
+def set(key, data, expiry):
+ """Save the value to a given key."""
+ global current_mem
+ global max_mem
+
+ with set_lock:
+ if key in set_threads:
+ # some other thread is already updating the value for this key.
+ # don't compete or waste time calculating a possibly duplicate value
+ return
+ else:
+ set_threads[key] = True
+
+ try:
+ entry = CacheEntry(key, data, expiry)
+
+ if (entry.memory_used > max_mem):
+ print >> sys.stderr, "ERROR: cannot store data for '%s' - it's larger than the max cache size (%s bytes)\n" \
+ % (key, max_mem)
+ return
+
+ with mem_lock:
+
+ # add/replace the entry in the hash;
+ # this tracks whether we have the key at all.
+ if entry.key in cache:
+ current_mem -= cache[key].memory_used # subtract the memory we gain back
+
+ if (current_mem + entry.memory_used > max_mem):
+ __free_memory(entry.memory_used)
+
+ heap.push(entry)
+ cache[key] = entry
+ current_mem += entry.memory_used
+
+ finally:
+ del set_threads[key]
+
+
+def get(key):
+ """Retrieve the value for a given key, or None if no key exists."""
+ if key not in cache:
+ return None
+
+ if (cache[key].has_expired()):
+ heap.remove(cache[key])
+ __delete_key(key)
+ return None
+
+ cache[key].update_request_time()
+ heap.update(cache[key])
+
+ return cache[key].data
+
+
+
+# Use a dictionary to efficiently store/retrieve data
+# and a heap to maintain a 'least recently used' order.
+cache = {}
+heap = Heap()
+
+current_mem = 0 # bytes
+max_mem = DEFAULT_CACHE_SIZE
+
+
+# we don't care if we get a slightly out of date value when retrieving,
+# but prevent multiple set() calls from writing data for the same key at the same time.
+set_threads = {}
+set_lock = threading.Lock()
+
+# prevent multiple threads from altering memory counts at the same time
+mem_lock = threading.RLock()
Something went wrong with that request. Please try again.