Skip to content

mv initial level fix

Matthew Von-Maszewski edited this page Aug 9, 2015 · 5 revisions

Status

  • merged to develop - August 9, 2015
  • code complete - June 17, 2015
  • development started - June 17, 2015

History / Context

Google's Version::PickLevelForMemTableOutput() routine was not beneficial to Riak's heavily random write loads. In fact, it got in the way of some early development ideas and performance research. Therefore it was disabled in Basho's version of leveldb long ago. Now Basho is examining sequential key loading in both traditional Riak handoff and new features. The routine does provide modest benefit if made aware of Basho's other leveldb additions such as multiple overlap levels and tiered storage. This branch reactivates the routine.

Basho's previous code always wrote a newly filled memory table to level-0. This branch examines the possibility of writing that file to level-2 or level-3 based upon various criteria.

Branch Description

The decision as to which level to use, other than the default level-0, is stretched over two routines: Version::PickLevelForMemTableOutput() and DBImpl::WriteLevel0Table(). Combined they evaluate the four questions:

  • does the new file overlap key ranges of existing files at level-0, level-1, level-2 or level-3?
  • would its addition to level-2 or level-3 violate the m_MaxGrandParentOverlapBytes rule?
  • are compactions running against any key ranges in level-1, level-2, or level-3?
  • is the destination level, level-2 or level-3, within the "slow tier"?

The combined logic selects the highest level of level-2 and level-3 where the answer to all four questions above is "no". Otherwise, defaults to adding the file to level-0.

db/version_set.cc / db/version_set.h

Version::PickLevelForMemTableOutput() is rewritten to consider two of the four questions when picking the initial level for placement of a memory table:

  • does the new file overlap key ranges of existing files at level-0, level-1, level-2 or level-3?
  • would its addition to level-2 or level-3 violate the m_MaxGrandParentOverlapBytes rule?

The logic selects the highest level of level-2 and level-3 where the answer to both questions above is "no". Otherwise, defaults to adding the file to level-0.

VersionSet::NeighborCompactionsQuiet() is a new function that takes an existing line out of VersionSet::Finalize() to make the logic rule available both to VersionSet::Finalize() and DBImpl::WriteLevel0Table() (in db/db_impl.cc).

db/db_impl.cc

DBImpl::WriteLevel0Table() is the only user of the Version::PickLevelForMemTableOutput() routine. It answers the remaining two of four questions:

  • are compactions running against any key ranges in level-1, level-2, or level-3?
  • is the destination level, level-2 or level-3, within the "slow tier"?

Threading note: It is important to note that the routine holds the vnode's (leveldb database's) mutex. Code is only supposed to update the manifest and/or pick compactions while holding the mutex. Therefore it is safe for this routine to examine the active compactions and "slide one in" to the destination level. This routine is NOT called on multiple threads simultaneously.

The code writes initial .sst file to level-0, always. Only after the file is on disk does the logic potentially move it to a different level, which is a different subdirectory, via Rename.

all other .cc files

All other files relate to unit tests that required updating due to the changed placement of initial .sst table files.