Clone this wiki locally
- merged to master November 15, 2013
- code complete September 22, 2013
- development started
History / Context
leveldb contains two caches per database (Riak vnode): file cache and block cache. The user establishes the size of each cache via the Options structure passed to leveldb upon database open. The two cache sizes are then fixed while the database is open. With Riak, the two cache sizes are actually fixed across all databases until Riak is restarted.
Riak is a very dynamic user of leveldb. Riak can have a highly variable number of databases open. And Riak can run for long periods of time, using hundreds to thousands of leveldb .sst table files. Both of these dynamics lead to suboptimal sizing of the two caches since sizes must conform to the worst case not the most likely, or to the mathematically estimated scenarios not the actual runtime usage.
This branch automates the sizing / resizing of the file cache and block cache during operation while maximizing total memory usage across a dynamic number databases. The variable items considered:
- varying number of open databases (vnodes) as Riak performs handoffs
- prioritized memory demands of file versus block cache since cost of a miss in file cache is much greater than cost of block cache miss
- file cache entries can become stale (not actively accessed) and therefore reduce the block cache size available to active files
- Riak has two tiers of databases, user databases (vnodes) and internal databases (active anti-entropy management, AAE), that should get different allocations of memory
- the total memory available to virtual servers can change dynamically in some VM environments and it would be helpful if leveldb could adjust without total restart
mv-flexcache branch manages memory allocation to each database, and the split of the allocation between the database's file and block caches. Each database's file and block cache are independent of other databases'. This design eliminates the need for thread synchronization across different databases. And therefore maintains current performance characteristics. A single, unified file and/or block cache would have many spinlock or mutex hits under the Riak usage pattern, slowing all accesses.
This branch adds code to keep track of the number of databases open (vnodes) and whether those databases are for user data or internal data (active anti-entropy feature). Each Open() operation adds the new database to an stl::set and informs the master cache object, FlexCache, of the total memory allocation available to all databases. The FlexCache object calls all existing database objects to inform them that the per database allocation has changed. If internal databases exist, they split a 20% allocation of all memory. User database either split 80% if internal databases exist, or 100% if no internal databases exist (AAE disabled).
Each database object DBImpl contains a master cache object DoubleCache. DoubleCache manages both the file cache and the block cache. DoubleCache gives memory priority to the file cache. But all memory not used by the file cache is available to the block cache. The block cache is guaranteed a 2Mbyte minimum size.
The Open API can fail if the new database object would receive less than 10Mbytes of shared cache.
Two new unit tests support the branch's changes: cache2_test and flexcache_test.
The Options structure contains two new option values: is_internal_db (bool) and total_leveldb_mem (uint64_t). Two old values remain but are now deprecated / unused: max_open_files (int) and block_cache (Cache *). (Actually block_cache is still actively used in some unit tests, but will likely go away once unit tests are update.)
is_internal_db is a flag used by Riak to mark Riak's internally generated databases. Internal databases get only a 20% share of the total cache memory allocation.
total_leveldb_mem is the total memory to be used across all open databases and their file/block caches. The number is internally adjusted down to also account for memory used by write buffers and log files per database. The adjustment is highly necessary given the possible large swings in database counts, example 32 becoming 64, in Riak environments and the amount of memory used by the objects.
Google's leveldb implementation did not contain a mechanism for intercommunication between database objects and/or background activities. This branch adds a global DBList() interface that keeps a list of all active databases. It also contains simple methods that will iterate the database list and calling the same function in each database.
The current design creates two std::sets: one for user database and one for internal databases. The use of two std::sets seemed intuitively obvious during the design and early coding. The final usage patterns suggest one unified set would be fine. That potential clean up is left to the future.
DBList() tracks current DBImpl objects. Therefore the list is maintained in an obvious fashion. The DBImpl constructor calls DBList()->AddDB. The DBImpl destructor calls DBList()->ReleaseDB. There is no other maintenance.
DBList()->ScanDBs is a method to walk all user or internal databases and call a DBImpl method on each database. This is written out instead of using stl::foreach so as to release the DBList spinlock before each database method call. Failing to release the spinlock could result in a deadlock scenario.
DBList() is implicitly tested via the routines in cache2_test and flexcache_test.
There is a global FlexCache object that contains the most recent Options::total_leveldb_mem setting. Each database Open() calls the global gFlexCache.SetTotalMemory(), passing Options::total_leveldb_mem. The SetTotalMemory() call triggers a call to all existing databases and tells them to adjust their DoubleCache allocation (and maybe prune some cache data). FlexCache and DBList could have been combined into a single object. But having the database list as an independent object eases future reuse of the database information (with less confusion of purpose).
gFlexCache.SetTotalMemory() is also called when a database closes. The value passed is zero. Zero is a special case that will not change the stored memory allocation, but will still trigger a call to all remaining databases to let them know their allocation has just increased (since there is one less database). There is no guarantee that the closing database's Options::total_leveldb_mem value represents the most recent setting. That is why zero is used instead.
Google's original code created two unrelated caches, the file cache and the block cache. The underlying code was the same for both caches. That original code is within util/cache.cc.
DoubleCache is parent object that manages the file and block caches interactively. The core logic is still Google's code, but it is a bit hacked up. Therefore util/cache.cc was directly copied to a new file util/cache2.cc (and cache2.h) then hacks applied to cache2.cc. All objects within util/cache2.cc have a "2" suffix in hopes of clarifying that "this is different, but similar" code to the original.
The object hierarchy is:
- ShardedLRUCache2 (file cache)
- 16 x LRUCache2 (sixteen LRUCache2 "shards")
- ShardedLRUCache2 (block cache)
- 16 x LRUCache2
- ShardedLRUCache2 (file cache)
gFlexCache (global singleton)
The cache2.cc implementation requires that each object know its parent. Each LRUCache2 knows its parent ShardedLRUCache2. Each ShardedLRUCache2 knows the parent DoubleCache. DoubleCache accesses gFlexCache directly.
The key logic changes happen in LRUCache2::Insert(). The original Google code simply tests if this current insert exceeds the cache size limit for this cache shard. If so, the code deletes one or more older cache entries within this schard to make room. The new code checks the cache size limit of all shards, i.e. the parent ShardedLRUCache2. The new code deletes from all shards in a round-robin fashion when space is required.
The file cache has a secondary purge criteria. Every hour a background thread initiates purge of file cache entries that have not been accessed in the last 4 days. This prevents little used files from taking up memory that would be better used as block cache space for other files. LRUCache2::Insert() and LRUCache2::Lookup() functions additionally populate the expire_seconds field of each cache entry. This is the value used to determine a file cache entry is stale. expire_seconds is not used for the block cache.
Note: The file cache purge is currently hard coded within util/throttle.cc as 4 days. The reasoning behind 4 days is simple. Some Riak users have usage patterns that drop dramatically over a weekend. Further a weekend could be a 3 day holiday. Worse case scenario for shorter than 4 days would be for heavily used files to be flushed over the weekend and create performance problems on the next full business day as file cache rebuilds. 4 days is therefore a likely compromise to keep readily used files open over an extended weekend.