Skip to content

mv sequential tuning

Matthew Von-Maszewski edited this page Jun 17, 2015 · 11 revisions

Status

  • merged to master - June 16, 2015
  • code complete - December 24, 2014
  • development started - December 19, 2014

History / Context

Basho has invested considerable time into improving leveldb for large write volumes hitting many databases (Riak vnodes) simultaneously. Two recent customer issues pointed out that time needed to be invested into when a single database (vnode) receives a large write volume. This situation commonly occurs during a Riak handoff operation.

Branch Description

The branch contains four independent changes, but all related to the customer issue:

  • Create alternate throttle value for single vnode usage
  • Restore the IsTrivialMove() function so that some compactions are simply renames
  • Correct Options::mmap_size usage so that write_buffer is not forced to 20Mbytes all the time
  • Implement concept demonstrated in mv-compress-bypass branch in current code to eliminate compression attempts known to fail

db/version_set.h

The WriteThrottleUsec() function changed. Previously it only used GetThrottleWriteRate(), published by the write throttle, to determine the amount of wait time to apply to each Write() call. GetThrottleWriteRate() returns "1" whenever it believes there is sufficient write capacity across all open databases (vnodes). That rate is not useful if a single database is getting heavy traffic. The routine now looks to GetUnadjustedThrottleWriteRate() whenever a throttle is needed and GetThrottleWriteRate() returns "1".

util/hot_threads.cc & .h

These two sources files changed in support of the added GetUnadjustedThrottleWriteRate(). The existing routine, GetThrottleWriteRate(), returns a value that is based upon both how long it takes to write a key and how much compaction work is backlogged across all open databases (vnodes). The latter part of the value is an adjustment to allow for compaction capacity that is not being used. The adjustment is not appropriate from the perspective of a single database (vnode) that is getting heavily impacted while others are not. The new function returns the value from only the first half of half of the calculation, how long it takes to write a key.

db/version_set.cc

Compaction::IsTrivialMove() is again active. This function controls whether or not a compaction might actually be executed as a rename instead of a complete reread and write of the .sst table file. There were some imagined error cases with moving levels into directories and with tiered storage that caused doubts about using this function. Those doubts are addressed in this branch, so the function is active again.

Related to IsTrivialMove()'s activation is a simplification made to Version::GetOverlappingInputs(). Levels 0 and 1 allow the key ranges of their respective .sst table files to overlap. GetOverlappingInputs() now forces all .sst tables files within an overlap layer to participate in the next compaction, without regard to how their keys do or do not overlap. This is especially important for performance where sequential data never creates overlapping key space within the overlap levels.

Compaction::CalcInputStats() calculates a new flag relating to compression. The routine examines two .sst table file statistics, number of blocks in a file and number of block that failed compression, for each input file in an upcoming compaction. The routine is biased to say the resulting compaction files will compress. Only when every input has zero compressed blocks will the calculation mark the upcoming compaction as not compressible.

db/db_impl.cc

SanitizeOptions() corrected. The logic flow of mmap_size and limited_developer_mem was poor. The result was that write_buffer's size was always set to mmap_size. Logic is updated to only change write_buffer's size parameter when limited_developer_mem is set.

DBImpl::BackgroundCompaction() upgraded per suggestion from Scott Fritchie. If the code fails to perform a compaction via rename, the same compaction immediately attempts to standard compaction via file reread and write.

DBImpl::OpenCompactionOutputFile() now switches an output file from compression processing to non-compression processing whenever the compaction object determines that none of the related input files contain compressed data, i.e. if none of the data would compress in the inputs it is a waste of time to attempt compressing and fail again.

table/table_builder.cc

TableBuilder::WriteBlock() determines the preprocessing of each block written to a .sst table file. The new automated decision to not attempt compression on things known to not compress is implemented via the new case statement.

util/options.cc

Options::total_leveldb_mem value is always set by Riak. This branch changes the constant from zero to 2.5Gbytes for cases where 3rd parties use non-Basho tools and/or benchmarks with Basho's leveldb. The 2.5Gbytes is documented in the README as a "sweet spot" for single database operations. The intent is to help the 3rd parties get high performance without expecting them to read helpful documentation.