Skip to content

mv fadvise control

Matthew Von-Maszewski edited this page Dec 20, 2013 · 14 revisions

Status

  • merged to develop December 10, 2013
  • code complete December 4, 2013
  • development started December 3, 2013

History / Context

This branch contains code for the Riak 1.4.4 release. There are three distinct pieces:

  1. backport from 2.0 of eleveldb fix for iterator Prev and Next operations: this change is discussed here https://github.com/basho/eleveldb/issues/52

  2. backport from 2.0 of leveldb fix to Compaction::ShouldStopBefore(): a code change in 1.4.0 to limit the total number of keys in any .sst table file to 75,000 had a side effect. The side effect disabled the function's primary purpose of splitting up a new .sst when its keys overlapped too many .sst files at the next higher level. Too much overlap creates very large compactions in the future (some multi-gigabyte compactions were seen).

  3. add app.config flag "fadvise_willneed" and pass that flag through eleveldb to leveldb: leveldb 1.3 and 1.4 each incrementally improved the fadvise() logic that manages the Linux page cache. The improvements helped the page cache flush all newly compacted user data to disk more quick, leaving more page cache space for random disk operations. However, the page cache management assumes that user servers have physical RAM that is much smaller that the data base size. A user with 200Gbytes of RAM and a smaller database requested an option to disable the page cache management so that all user data would remain in the page cache. Setting fadvise_willneed to true in app.config on systems where physical RAM exceeds data base size will improve some random read performance and reduce disk operations.

Branch description

basho/leveldb mv-fadvise-control changes

include/leveldb/env.h

Added declaration for the gFadviseWillNeed global (initialized in util/env_posix.cc).

util/env_posix.cc

Initializes new gFadviseWillNeed global. Default is "false".

PosixMmapFile constructor changes the initial value of metadata_offset_ from 0 to 1 if gFadviseWillNeed is true. metadata_offset_ is normally used to mark where user data ends and file metadata starts in an .sst table file. User data is normally saved to disk and marked with the POSIX_FADV_DONTNEED fadvise() flag. This suggests to the operating system that the cache pages for user data should be flushed from memory. However, the file metadata is normally marked with the POSIX_FADV_WILLNEED fadvise() flag since the metadata is always read back into RAM immediately. Setting metadata_ to 1 instead of zero is a logical flag to have every cache page, user data and metadata, held in RAM as long as possible.

The SetMetaDataOffset() routine is modified to be aware of the special metadata_offset_ = 1 logic. The routine now only changes metadata_offset_ if not already set to one.

include/leveldb/options.h

Adds Options.fadvise_willneed option flag along with comments.

util/options.cc

Initializes fadvise_willneed to false and updates Dump() to include this new variable.

db/db_impl.cc

The global gFadviseWillNeed is populated from Options.fadvise_willneed everytime a database is opened. The global immediately impacts all open databases, not just the one opening now. With Riak's eleveldb, this direction change is from false to true as defined in app.config / riak.conf. The default is false. Globals are tacky programming, but currently options do not get passed to the lower level objects needing this setting.

db/version_set.cc

This file contains code changes for Compaction::ShouldStopBefore(). First the second column of gLevelTraits, m_MaxGrandParentOverlapBytes, is updated to match the m_MaxFileSizeForLevel column. This reduces the change of a new .sst table file being split over one grand parent (level+2) file.

Second the ShouldStopBefore() routine is slightly adjusted. The original code had two return points. The original change in 1.4 modified the second return point, but incorrectly reset the overlap bytes to zero on every call. This effectively disabled the grandparent overlap logic. Now the code has one return point and clarified "stop" rules. Rule 1 is overlap bytes. Rule 2 is number of keys. If either rule fires, overlap bytes is set to zero in anticipation of the next call.

basho/eleveldb mv-fadvise-control changes

c_src/build_deps.sh

The change on line 65 is a backport fix from a community user.

c_src/eleveldb.cc

The fadvise_willneed flag from app.config and its counterpart Options.fadvise_willneed from leveldb need special handling. Riak has some internal databases that do not necessarily pass app.config parameters as part of their open call. If fadvise_willneed was only processed within the async_open(), the setting might flap between true (user databases) and false (internal databases) with the last one winning. Typically the internal databases open last and would therefore negate the user's setting. Therefore the fadvise_willneed setting is determined once when Erlang loads the eleveldb NIF (see end of on_load() routine).

The fadvise_willneed setting read in on_load() is then reused in every call to async_open() to provide a consistent setting for all databases.

The EleveldbOptions structure is from Riak 2.0. There it is the home for several options that require a global setting across all databases, just like fadvise_willneed. Hence all the code related to EleveldbOptions is backported to make the usage models consistent.

All other eleveldb changes are related to previous mv-iterator-prev branch. Please see https://github.com/basho/eleveldb/pull/69 for discussions and testing.

Clone this wiki locally