- merged to master - July 5, 2016
- code complete - May 3, 2016
- development started - February 8, 2016
History / Context
Customers and Basho product marketing have requested automated, background expiry of leveldb data. These feature requests divide into two distinct categories: defining object expiry and managing the expired data. The two feature categories are independent, and different customers have requested a different mix of the two.
This branch contains only the most basic expiry feature infrastructure. Future branches will contain the more Riak centric features based upon the requests. Not all of the future branches will be open sourced. The non-open source branches, or portions of branches, will exist in the private basho/leveldb_ee repository ("ee" for enterprise edition).
The discussion here presents the basic expiry code that is within the basho/leveldb open source repository.
Global expiry, a.k.a. Time To Live (TTL) expiry: leveldb tracks the last write/update of a key. This last write date is then compared to the new "expiry_minutes" option setting. If the key's last write time is older than expiry_minutes, leveldb will treat the key as if deleted. leveldb's delete treatment is immediate for user requests for the key. leveldb's physical removal of the key and its value from the .sst table files will occur some time later as part of leveldb's normal compaction processing.
Explicit expiry: leveldb tracks an explicit expiry timestamp provide by the user. leveldb treats the key as deleted once the timestamp is current. Again, leveldb physically removes the key and its value some time later during normal compaction processing.
Whole file expiry: The setting "whole_file_expiry" enables leveldb to quickly erase entire .sst table files instead of executing a key by key compaction processing when all keys in the .sst table file expire. Note: All keys within the .sst table file must be expiry type keys. Any single non-expiry key disables whole file expiry for that .sst table file.
Expiry is a modular component to leveldb. This is to allow Basho to provide different feature sets between its open source and enterprise edition products. The leveldb::Options object contains pointer to the default leveldb::Expiry object. The options controlling expiry are within the leveldb::Options->leveldb::Expiry object. The options are:
true: enable expiry features,
false: disable expiry (expired keys that leveldb has not physically removed will reappear)
zero: write time not added to keys written, existing write time keys do not expire,
number: minutes after a key is last written that it expires (logically deletes)
false: expired keys only removed via normal leveldb compaction
true: entire .sst table files deleted without key by key scan
Also, one option within leveldb::Options overrules all expiry settings:
true: all expiry options disabled
false: expiry options function as described
is_internal_db is a Riak specific option. Riak sets it true to mark leveldb databases that contain Riak's internal data. Riak's internal databases must be independent of end user expiry settings.
Basho's expiry feature is a subtle enhancement to Google's original delete mechanism. Therefore it is helpful to first detail how Google's original delete mechanism functions.
The user passes leveldb a key and value for every object written into leveldb. leveldb wraps the user's key with internal metadata, creating an "internal key". leveldb formats the internal key as follows:
- user's key
- leveldb key sequence number
- leveldb key type byte
leveldb combines the sequence number and type byte into one 8 byte value directly following the key. This allows leveldb to only have to track the size of the entire internal key. The size of the user's key is implied by simple math: internal key size minus 8 bytes.
Google's implementation only defines two types for the type byte: Value and Delete. An object that has a valid key and value is a Value type. An object that the user deleted, has only the user's key and is a Delete type. "Tombstone" is a common term used to describe a deleted object.
leveldb writes every update to a given user's key into the database (a Riak vnode), whether the update be a Value or a Delete. leveldb considers only the the most recent update to key when returning the key's value to the user via a Get operation or iterator operation. leveldb's compaction cycles gradually remove older updates (versions) of the same key. If the most recent update is a "tombstone", the tombstone stays in the database until it gradually reaches the highest level within leveldb's .sst table files. Only then may leveldb remove the tombstone without concern for an older update becoming available to Get and iterate operations.
Basho's expiry code makes the Value versus Delete distinction dynamic. The type byte has two additional definitions: WriteTime and ExplicitExpiry. Get and iterator operations now look at the most recent update to a key and make the following decisions dynamically:
|type byte||Get operation
|Value||object's value||object's value|
|Delete||Not Found error||key skipped|
|WriteTime||now < (write date + expiry_minutes)
else Not Found error
|now < (write date + expiry_minutes)
else key skipped
|ExplicitExpiry||now < expiry date
else Not Found error
|now < expiry date
else key skipped
Basho's expiry code therefore needs a timestamp as part of the key. The expiry date value is only present when the type byte is WriteTime or ExplicitExpiry. The resulting leveldb internal key is therefore:
- user's key
- leveldb key sequence number
- maybe an 8 byte timestamp
- leveldb key type byte
The timestamp is the time the key/value pair was written for WriteTime expiry. The timestamp is the exact time to expire the key/value pair for ExplicitExpiry.
Open Source versus Enterprise Edition build maintenance
The leveldb's build_detect_platform script now contains logic to detect if the Enterprise Edition directory leveldb_ee exists, i.e. leveldb/leveldb_ee. leveldb_ee is a private git submodule that only exists in the build tree when creating an Enterprise Edition build. The source files of leveldb_ee become part of the build_detect_platform's output file (build_config.mk) when the subdirectory/submodule exists. build_detect_platform substitutes the leveldb_os directory in place of leveldb_ee when building the open source version. Only one directory's files, leveldb_ee or leveldb_os, are used in any build.
The original development of leveldb expiry had code in both leveldb_ee and leveldb_os. Hence the above paragraph was important. By the time the code merged to the develop branch, the decision was to have basic expiry (as presented here) be an open source feature. The split between leveldb_ee and stubs will likely return for Riak's bucket based expiry.
Minor changes to keep up with expiry extension. BuildTable()'s KeyRetirement object now needs a third parameter in its initializer. The Options object contains the Expiry object that KeyRetirement's routines may or may not reference. BuildTable() now sets the new exp_write_low, exp_write_high, and exp_explicit_high data within FileMetaData object for newly built table.
db/c.cc and db/c_test.c
The C language interface, db/c.cc, is updated to support the optional KeyMetaData parameter for Put, Get, and iterate operations. Likewise the test program for the C interface db/c_test.c now supports and verifies the C usage of KeyMetaData.
db/db_impl.cc and db/db_impl.h
The CompactionState object has new member objects for tracking expiry metavalues (exp_write_low, exp_write_high, and exp_explicit_high). CompactionState now has a constructor to ensure the expiry values are in a useful state since there is no guarantee any particular compaction will set some or all of the values.
SanitizeOptions() provides a defense against expiry getting accidentally applied to Riak's metadata databases. The metadata databases have the Option flag "is_internal_db" set. SanitizeOptions() removes any expiry setting when that flag is seen set.
There was a dead parameter bg_compaction_scheduled_ discovered during the expiry work. It is now removed. Its removal has no material value to the expiry changes. Just house cleaning.
In DBImpl::RecoverLogFile(), the call to WriteBatchInternal::InsertInto() has an Options object as a new third parameter. This gives InsertInto() the opportunity to perform expiry related work as needed.
WriteLevel0Table() uses the new VersionEdit function AddFile2() instead of the original AddFile(). AddFile2() supports the new expiry metadata values.
BackgroundCall2() now supports routing the worker thread to either normal compaction or full file delete expiry.
(Other changes following BackgroundCall2() are of the same nature as previously discussed so not repeated here.)
DBImpl::Get() and DBImpl::Put() operation now support a new, optional KeyMetaData structure. This structure allows user routines to pass non-default expiry information (such as explicit expiry dates or write time values from previous vnodes during migrations) to and from leveldb.
class DBImpl previously used "private:" to prevent access to its internal routines. This practice lead to the creation of many TEST_ routines. The internal routines now have "protected:" status. This allows a derived class to access the internal routines for the purposes of unit tests.
db/db_iter.cc and db/db_iter.h
The database iterator object needs access to any expiry module to properly filter expired objects during an iteration. Therefore NewDbIterator() has a new parameter for the expiry object.
When DBIter has an expiry object, it will call expiry's KeyRetirementCallback() to dynamically delete any potentially expired objects.
Various API calls of db_test were updated in support of expiry changes. However there are no new tests related to expiry.
The KeyRetirement object is now expiry aware. If an expiry module exists, it is called as part of the key retirement logic. KeyRetirement is part of building level 0 files and all compaction operations. The added call to expiry's KeyRetirementCallback() enables the dynamic deletion of expiry keys when appropriate.
KeyRetirement's new dropped and expired counters are for LOG and performance counter uses.
This file is essential to the expiry changes. Its changes create an intended seamless integration of traditional Value and Delete keys with the new WriteTime and ExplicitExpiry keys. The ValueType enum is therefore now exposed for API usage via the leveldb/options.h header file, i.e. moved from this internal header file to the public header file.
IsExpiryKey() and ExtractExpiry() are two new accessor routines for identifying expiry enabled keys and extracting their expiry value.
The LookupKey object is used by the Get() API call. It passes the desired key for retrieval to the various functions that search the write buffers and .sst table files. It now will also pass an optional KeyMetaData that the search functions can populate with a key's expiry metadata. The LookupKey object is traditionally passed to the search routines as a "const" object. Therefore the metadata object must be marked mutable to allow it to be set while the rest of the object is protected.
db/memtable.cc and db/memtable.h
The memtable files govern the write buffers that initially store incoming objects from Put and Write operations. This code contains updates to support populating KeyMetaData information during Get and iterate operations. It also supports dynamic deletion of expiry enabled objects via the MemTableCallback() API from the expiry module.
Repair contains updates to various calls previously discussed. Its key new function is to populate the expiry metadata values of the FileMetaData object (used by leveldb's internal and saved manifest). The .sst table file's counters section holds the needed expiry metadata values.
db/version_edit.cc and db/version_edit_test.cc
The VersionEdit routines were modified previously to support the encoding and decoding of expiry metadata values between the file manifest and the runtime manifest. The previous work created the possibility of rolling back an expiry enabled leveldb release to the previous release without data loss. The changes in this branch simply switch the default actions from non-expiry enabled to expiry enabled. The old AddFile() routine remains in the code as a reference for the next few releases.
db/version_set.cc and db/version_set.h
The .sst table file search routines use the Saver structure to return the value of a key found during Get operations. It now has the additional Option and LookupKey fields to support expiry features. The Option structure provides access to the optional expiry module. The LookupKey object carries the KeyMetaData object that can optionally return expiry metadata values as part of the Get response.
The SaveValue() function performs the actual comparison of a potential key to the desired key. If the keys match, SaveValue() then evaluates the key's type. The code includes a call to the expiry module's KeyRetirementCallback() routine to dynamically evaluate the delete status of an expiry enabled key.
VersionSet::NeighborCompactionQuiet() previously assumed it would never be called against the highest level (config::kNumLevels - 1). That is no longer true due to a change in VersionSet::Finalize(). This change handles the new exception.
VersionSet::Finalize() previously analyzed levels 0 through config::kNumLevels -2 (0 to 5). It was looking for source levels that would compact to the next level, level+1. The highest level, level 6, was never a source. With whole file expiry, the highest level could be the source of a compaction (actually a file delete, not a compaction). The main loop is adjusted according. There are other key changes in Finalize().
Finalize now has sets a "no_move" parameter if it selects an aggressive delete compaction. This fixes a semi-unrelated bug. Aggressive delete compactions need to actually run a full key-by-key compaction to potential remove delete keys and free disk space. The recent enabling of whole file move compactions bypassed the key-by-key work making aggressive delete useless. This new flag makes sure an aggressive delete compaction is processed key-by-key. There is a side benefit to expiry. A key-by-key compaction counts expired keys that must remain in the file as tombstones as "deletes". Therefore setting up a potential aggressive delete compaction to move the keys one level closer to full removal.
Finalize will also call the expiry module when no regular or aggressive delete compactions exist for a level. The expiry module may spot whole files that leveldb can delete entirely without a key-by-key compaction scan.
VersionSet::PickCompaction() now supports two types of compaction actions: key-by-key and whole file delete. The existing code path is mildly adjusted as needed.
VersionSet::IsTrivialMove() contains the important fix to disable move operations when the intended compaction is due to aggressive delete specifically needing a key-by-key compaction.
db/write_batch.cc, db/write_batch_internal.h, db/write_batch_test.cc, and include/leveldb/write_batch.h
A WriteBatch is an alternative to a single object Put operation. A WriteBatch allows the user to group one or more Write and Delete operations into a single operation, similar to a relational database's transaction commit. The WriteBatch class contains some expiry looking functions from a previous release. Those functions were prototype in nature and never live. The new Put interface utilizes the KeyMetaData as an optional mechanism for supplying explicit expiry metadata values.
WriteBatch's internal class WriteBatch::Handler requires a value type (Value, WriteTime, ExplicitExpiry) and expiry date value. This supports moving expiry enabled keys from the batch into the write buffer (see memtable.cc). WriteBatch::Put() contains updates to build memtable keys that contain expiry metadata.
expiry_ee.cc (currently in private leveldb_ee repository)
The ExpiryModuleEE class implements the six virtual functions declared in the open source include/leveldb/expiry.h file. The callbacks exist to support dynamic delete decisions of single keys and entire files.
MemTableInserterCallback() will modify a plain Write operation into an expiry enabled WriteTime key when the proper expiry options are set. This allows the user to establish WriteTime expiry completely via settings. No API calls need to be upgraded.
KeyRetirementCallback() is the backbone of the dynamic delete logic. It evaluates the current options, key type, and time of day to decide if a key is valid or deleted.
TableBuilderCallback() watches the stream of keys during a compaction to properly create expiry metadata that becomes part of the .sst table file's counter section. The eSstCountDeleteKey counter is the only one of these used at runtime. The other three exist solely to allow repair to quickly rebuild the manifest entries for this file.
MemTableCallback() is a wrapper to KeyRetirementCallback(). It simply reformats the supplied key.
CompactionFinalizeCallback() is concerned with whole file deletes. It is called from two places. When called from VersionSet::Finalize(), it responds with whether or not there is at least one file on the given level that is eligible for full file delete. When called from DBImpl::BackgroundExpiry, it responds with every file on a level that is eligible for full file delete.
DBImpl::BackgroundExpiry() this routine takes a list of files from CompactionFinalizeCallback() then manages of applying the list to manifest. It also requests a delete of those files if they are not in use.