Skip to content

Mv aggressive delete

Matthew Von-Maszewski edited this page Apr 6, 2014 · 13 revisions

Status

  • merged to master November 25, 2013
  • code complete November 16, 2013
  • development started November 14, 2013

History / Context

The delete of key/value content has two shortcomings in Riak's leveldb. The first shortcoming is the number of database operations required to delete one key. The delete action requires one database read operation, one database write operation, and finally a second database read. The most expensive operation is typically the first read which may have to open one or more non-cached leveldb files. A typical purge of old data results in a heavy load of database operations. The second shortcoming is that the delete operation typically results in the writing of a "tombstone" record without actually releasing any disk space. The actual release occurs significantly later (days, weeks, or even months later) when the tombstone record merges with the actual data in a background compaction. So a purge of data can be disk intensive, impacting database latencies, and seldom results in a quick reduction in disk space usage.

This branch addresses only the second shortcoming. This branch monitors the number of pending "tombstone" records and forces compactions that result in release of total disk usage when the number of pending "tombstone" records exceeds a user defined threshold.

Update April 6, 2014

The original implementation described here was able to reduce leveldb to almost zero disk storage if the user deleted all keys. Great! Then there was larger scale testing. The testing used more diverse user load scenarios. The larger scale testing demonstrated that the original logic could excessively impact performance in work loads that had sporadic, but necessary, delete operations.

The most recent code available on the "develop" branch, and soon the "master" branch as Riak 2.0 ships, has a subtle adjustment. Aggressive delete does NOT apply to levels 0 and 1 of a leveldb database (vnode). Google's original logic still applies on those two levels. Aggressive delete kicks in once keys have moved to level 2 via normal compaction cycles.

The adjustment exempts as much as 4.2 gigabytes of storage per database (Riak vnode) from aggressive delete's logic.

Branch description

The logic used here is simple: keep a count of tombstones in an .sst file. Check this count against the user selected threshold (default is 1,000) whenever a database (vnode) scans for compaction candidates. Offer the .sst file as a "grooming" compaction target when its tombstone count exceeds the threshold. The "grooming" will cascade through leveldb's levels as long as the tombstone is not resolved against the actual data and the .sst table file that contains it still has more tombstones than the threshold.

Notes:

  • "Grooming" compactions only occur whenever there is a compaction thread immediately available. The "grooming" compaction is ignored if all compaction threads are already occupied. Only "non-grooming" compactions go onto a backlog queue. This is to prevent non-essential work, such as background delete resolution, from overshadowing realtime operations.

  • The tombstone counter does not exist in .sst files created prior to Riak 2.0. Anytime one of these older files is part of a Riak 2.0 compaction, the resulting files will contain the appropriate tombstone counts.

  • Setting Options::delete_threshold to zero disables this feature. The default is 1,000.

table/table_builder.cc

The TableBuiler::Add() function now watches for kTypeDeletion keys, a.k.a. tombstones. The function increments the file level counter whenever it detects one. The delete counter is one of several Basho specific counters appended to each .sst table file's metadata.

db/table_cache.cc

Added a Basho specific function TableCache::GetStatisticValue(). This function returns the requested .sst table file statistic only if the requested table file is already in the table cache. Otherwise it returns zero. This code intentionally does not retrieve the .sst table from disk since that could cause heavy disk thrashing on systems where the file cache is smaller than the space required to cache all table files.

db/version_set.cc

Version::Finalize() now includes an extra step to look for any .sst table file on the target that has more tombstones than Options::delete_threshold. This extra step only occurs if there is not a compaction already selected based upon disk space tests. Finalize() always marks a delete compaction as "grooming", i.e. optional.

Unrelated to delete: Fixed a bug in Compaction::ShouldStopBefore() that was introduced in Riak 1.4. The 1.4 change was to limit the total number of keys in an .sst table file to 75,000. The change had a side effect of disabling the previous GrandParent overlap logic. This fix corrects the code to flow to properly support both "stop" decisions: overlap and total key count.

include/leveldb/options.h & util/options.cc

Added delete_threshold to Options structure. Initialized to 1,000 as the default.

Clone this wiki locally