Several weeks were spent updating leveldb (primarily db/version_set.cc) to allow multiple levels to contain overlapping table files. "Overlapping" means that the key range within each file might overlap the key range within another table file at the same level. Google's original implementation only allowed level 0 table files to overlap. Table files in all other levels are "sorted". Sorted table files contain a unique, non-overlapping block of keys within all table files of that level. The benefit of overlapped table files is that they pass through compaction much more quickly and greatly reduce the amount of write amplification in early data life (rewriting of same data).
Realized late in the release cycle the impact of converting level 2 from sorted to overlapped could take an hour or more per server at large installations. Converting only level 1 from sorted to overlapped is in the 3 to 7 minute range per server. This was a hurry up branch to make the switch AND retune write throttle accordingly.
There are four distinct changes made as part of this branch. They are discussed separately:
Code changes are isolated to db/dbformat.h and db/version_set.cc. db/dbformat.h contains a constant kNumOverlapLevels. This need to be adjusted from 3 to 2. Similarly, there is a big table of level attributes at the top of db/version_set.cc. This table need adjustment too. The table adjustments both change the m_OverlappedFiles flag and retune many level parameters because of the overlap change.
Yes, it should be possible to get rid of the kNumOverlapLevels constant now that db/version_set.cc/.h contain a IsLevelOverlapped() accessor. But that cleanup is for another day.
DBImpl::Recover() in db/db_impl.cc contains a new block of code. This code verifies that each of the overlapped levels contains an acceptable count of files. If not, the code nudges the background compaction routines and waits. The code loops after each compaction completion to see if the file count has reached the acceptance level: exits if yes, loops and waits if no.
Is this pause behavior "safe" for Riak? Jon verified that the Bitcask backend already has a paused, opening behavior. Combine that fact with the Riak 1.3 changes to make leveldb opens appear "asynchronous" to Erlang. So yes, the paused open behavior is believed safe and passes known tests.
The move to 3 overlapped levels was to improve performance: higher write ingest rate, lower write amplification, and faster lookups of recent data. The improvement came from both the increased number of overlapped levels and changes to the support write throttle parameters. The recent shift from 3 overlapped levels to 2 overlapped levels also required write throttle adjustments.
Three routines required tuning. Incremental tests suggest the latter two had more impact on stabilizing the new overlapped configuration:
VersionSet::Finalize(): This routine picks the next, most needed compaction for a database ("score"). It also calculates the total amount of compaction work needed to fully normalize a database ("penalty"). The penalty scores are still a black art. There are certain workloads, such as the 2i_slf_stressmix test, where the computationally justified penalties will not stabilize the ingest rate before Google's logic decides to completely stall the database. There is therefore special penalty scoring that occurs whenever overlapped levels exceed kL0_SlowdownWritesTrigger (8 files on the level). The special logic just piles on the penalties higher and higher. The "*8" parameter was derived from "what works".
The special penalty scoring got some tuning based upon the new testing of 2 overlap levels, instead of 3.
DBImpl::DoCompactionWork(): "perf top" listed the pthread_rwlock_rdlock() as a heavily hit routine. This makes sense since every compaction thread must use it once or twice per key within PrioritizedWork() which is called from DoCompactionWork(). DoCompactionWork() was adjust to only call PrioritzedWork() once every 100 keys instead of every key. This appears to have greatly improved compaction speed. Maybe 1,000 will be an even better value ... but really needs a full test cycle to validate. 100 is a safe, quick hack for late in the cycle.
DBImpl::Write(): The DBImpl object variable "mutex_" is used heavily throughout the code. Only recently was the mutex_ locked and released around an unprotected call to WriteThrottleUsec(). This branch moves the entire write throttle logic from the beginning of DBImpl::Write() to the end. That eliminates the extra lock / unlock of mutex_. It also makes it less likely for write order to be impacted by the throttle stalling one item, but not the next.
A few variables and "if" statements were adjusted to clear compiler comparison warnings about "int" variables versus "size_t". The long term goal is to shift all sizes to size_t since that is the stl norm.