Skip to content

mv tuning4

Matthew Von-Maszewski edited this page Jan 17, 2014 · 14 revisions

Status

  • merged to develop
  • code complete January 16, 2014
  • development started January 6, 2014

History / Context

This page discusses the tunings and fixes made to both basho/leveldb and basho/eleveldb as part of the final Riak 2.0 release preparation.

basho/eleveldb is the Erlang to leveldb API. Two bugs were found in it during the integration testing. The first related to the asynchronous iterators prefetch logic. The logic assumed that prefetch was a one pass operation: pick a starting point, execute prefetch next repeatedly, and close iterator. The Riak active anti-entropy code (AAE) uses the same iterator with multiple, sequential starting points. This did not work correctly. The second bug related to the use of leveldb::ReadOptions pointers being passed within the iterator logic. The ownership of the allocated ReadOptions was confusing and the allocated memory getting release while the data was still being used.

basho/leveldb changes in the branch are all performance tunings. There are no known bugs addressed. Riak 2.0's leveldb contains the hot-threads branch which replaces all the horrible hacks to threads and locks with real code. The hot-threads worked great on fast disk arrays and required no changes to the write throttle code. Then we started testing on slow arrays. Hot-threads was too fast and the write throttle was completely inappropriate (lots of sawtooth throughput graphs). There are still sawtooth patterns on some heavy loads, but the teeth are smaller and overall throughput greater.

Branch description

basho/eleveldb:c_src/eleveldb.cc

leveldb::ReadOptions was previously allocated on the heap, then its pointer passed. async_get and async_iterator used leveldb::ReadOptions this way. This branch changes ReadOptions to be created on a routine's stack, or embedded within other objects using it: no more passing pointers. leveldb::ReadOptions is declared in leveldb/include/leveldb/options.h. It uses the C default copy constructor. Its pointer for snapshot is be copied and therefore unsafe, but is never used beyond scope of where it is initialized. A more robust coding (in the future) would be to create a copy constructor that did not copy the snapshot pointer.

PREFETCH_STOP atom and supporting code addresses the multi-positioning of an iterator in prefetch mode. The caller is required to issue a Prefetch_Stop action after the last Prefetch action and before the next Seek action. Prefetch_Stop action "eats" any pending prefetch operations and prevents a subsequent from starting, i.e. breaks and cleans up the prior prefetch cycle.

basho/eleveldb:c_src/refobjects.cc

Two functions that previously had a leveldb::ReadOptions pointer function argument now use a leveldb::ReadOptions reference. Similarly, the previous delete of a m_ReadOptions pointer is removed since the object is now embedded.

basho/eleveldb:c_src/refobjects.h

All usage of leveldb::ReadOptions pointers is converted to embedded copies of leveldb::ReadOptions. Various code clean-ups result.

basho/eleveldb:c_src/workitems.cc

Minor code adjustments to switch from leveldb::ReadOptions pointers to embedded copies.

Add PREFETCH_STOP cases to iterator retrieval and iterator_refresh logic paths.

basho/eleveldb:c_src/workitems.h

All usage of leveldb::ReadOptions pointers is converted to embedded copies of leveldb::ReadOptions. Various code clean-ups result.

basho/eleveldb:test/iterators.erl

Add test cases that illustrate prefetch_next bug.

basho/leveldb:db/builder.cc

Previous leveldb branches had significant work to better manage the page cache via fadvise. A quick summary is that the code tells the operating system to flush everything out of the cache but a newly created .sst table file's metadata. That leaves more page cache space for random read buffering. Works great on fast drive arrays, especially improves performance has datasets grow beyond the size of physical RAM. Yet it worsens overall throughput on slow arrays as they reread recent level-0 and level-1 files for compactions during heavy write loads.

This one line change instructs the buffered write code to inform the operating system to keep all of the newly created level-0 file in the page cache.

basho/leveldb:corruption_test.cc

A corruption test, originally written by Google, started failing when I adjusted the kL0_CompactionTrigger constant. Review of the code suggests the test was written against the wrong constant. I corrected constant and everything got happy.

basho/leveldb:db/db_impl.cc

The first change in this file is a parallel of the change discussed for builder.cc. The builder.cc change only impacted level-0 files. The change here applies the same page cache management to level-1 and level-2 files.

The second change is specific to Riak's active anti-entropy (AAE) feature. Riak 1.3 gave AAE special consideration for its write batches when applying the write throttle. In Riak 1.4, that special consideration became unnecessary after various performance improvements to leveldb. And AAE's special consideration reduced performance to 2i operations. Riak 1.4 removed the AAE special consideration. Riak 2.0 needs AAE's special consideration again. This line applies the consideration only to internal AAE databases, and not to user database (which is where the 2i operations occur).

basho/leveldb:db/dbformat.h

Raises kL0_CompactionTrigger from Google's original 4 to 6. This constant defines when to initiate a compaction in level-0 (also in level-1 for Basho's leveldb). level-0's compaction is triggered by file count, not by total size of all files in the level. Total size is the trigger in all other levels. The alternative to raising this constant is to raise the write_buffer sizes. But changing write_buffer sizes also increases the amount of physical ram required by each leveldb database (Riak vnode). Such a memory increase would hurt some users with higher vnode counts and/or lower amounts of physical ram. Therefore the constant change is a safer solution for the Riak environment.

The result of this change is that compactions from level-0 and level-1 now occur 50% less often, but are of 50% larger total byte size. This is an overall throughput win, especially on slower disk arrays.

basho/leveldb:db/version_set.cc

The gLevelTraits at the top of the file had to be adjusted to match the changes to level sizings resulting from the raising of kL0_CompactionTrigger. Also, level 3's capacity was enlarged based upon test observation and that had a ripple effect of requiring larger settings on each subsequent level.

VersionSet::Finalize() contains new special case code for selecting when to NOT initiate a compaction on level-1, and removes all level-2 compactions from the "grooming" status. Both changes address the same performance issue: level-2 is Basho's designated "landing level". leveldb .sst table files in levels 0 and 1 have overlapping key spaces in every file. level-2 is the first level where the overlapping files are compacted into non-overlapping key spaces. The compaction from level-1 to level-2 typically requires that every file in level-2 be recompacted. The two code changes in VersionSet::Finalize() work to minimize the number of files in level-2. First change delays level-1's compaction slightly if level-2 is not clear. The second makes sure every level-2 compaction request is queued (grooming compactions do not queue, only execute if there is a free compaction thread).

VersionSet::UpdatePenalty() has new code that reduces penalty values assigned to level-1. Previously all overlapped levels (level-0 and level-1) had non-linear penalties applied when too many files accumulated. That was too harsh for level-1. It now gets a linear penalty like all other levels. Logging statement added to help tie penalty values to performance characteristics during analysis.

basho/levedb:db/version_set.h

WriteThrottleUsec() now takes a slightly different approach to selecting write throttle values. Previously, when the hacked threading model used 5 thread blocks, the throttle would essentially turn on and off by thread block (which meant throttle applied to two or more vnodes/database at a time). hot-threads represent one, universal thread block. The on/off approach made throughput extremely variable. The function now has 3 states:

  • this database/vnode has non-zero penalty: multiply penalty times global write rate
  • global write rate is above default value of one: use global write rate as throttle
  • vnode is caught up, global write rate exceeds disk limits: no throttle

basho/leveldb:util/env_posix.cc

SetMetadataOffset() updated so that a value of 1, which means keep entire file in page cache, overrides gFadviseWillNeed flag.

HotThreadPool sizes manually adjusted based upon test observation. Yes, Jon wants these to be configurable. Someday they will be, but not today.

Clone this wiki locally