Skip to content

mv tiered options

Matthew Von-Maszewski edited this page Apr 17, 2014 · 29 revisions

Status

  • merged to master
  • code complete April 7, 2014
  • development started April 6, 2014

WARNING!

  • This initial implementation does NOT automatically migrate data from a non-tiered configuration to a tiered configuration.
  • This initial implementation does NOT detect changes in tiered option parameters and migrate data from old configuration to new configuration.

The user must stop leveldb / Riak node and manually move leveldb .sst table files from one configuration to another. There is no manual activity required if starting a fresh database with the tiered configuration.

History / Context

Google's original leveldb implementation maintained all .sst table files in a single database directory. Riak 1.3 updated the leveldb code to place .sst tables file into subdirectories that represented the "level" of the file, i.e. sst_0, sst_1, … sst_6. The Riak 1.3 update was motivated by the need to speed database repair operations. But it had a side effect. Database operators could use the subdirectories to mount alternative storage devices at each level. The database operators had to manually create all the necessary directory links. This branch automates the process of using alternative storage arrays based upon levels.

The justification for two types/speeds of storage arrays is simple. leveldb is extremely write intensive in its lower levels. The write intensity drops off as the level number increases. Similarly, current and frequently updated data tends to be in lower levels while archival data tends to be in higher levels. These leveldb characteristics create a desire to have faster, more expensive storage arrays for the high intensity lower levels. This branch allows the high intensity lower levels to be on expensive storage arrays while slower, less expensive storage arrays to hold the higher level data to reduce costs.

Note that tiered storage is not magical. Extremely high volume, sustained write operations can fill the high speed storage arrays before leveldb has opportunity move data to the low speed storage arrays. leveldb's write throttle will again slow incoming write operations to allow compactions to catch up. This is no different from when using a single storage array.

Configuration / Usage

This branch introduces three configuration parameters that are only used when tiered storage is desired. The parameters assume the use of only two storage arrays: a fast array (primary) and a slow array (secondary).

leveldb option Riak 2.0 option Purpose
tiered_slow_level leveldb.tiered The level number where data should switch to slow array.
_0 is the default and disables this feature._
tiered_fast_prefix leveldb.tiered.path.fast Path prefix for .sst files below tiered_slow_level
tiered_slow_prefix leveldb.tiered.path.slow Path prefix for .sst files at and above tiered_slow_level

The full path for an .sst table file with tiered storage enabled:

Levels Resulting Path
0 to tier_slow_level-1 tiered_fast_prefix / database_name
tier_slow_level to 6 tiered_slow_prefix / database_name

The database_name is the name given in the DB::Open() call of leveldb. Riak 2.0 users know this as the leveldb.data_root option. Both uses of database_name need to be a relative path since either the fast or slow prefix option precedes it. A common relative path is ".".

Example Riak 2.0 configuration:

leveldb.data_root = .
leveldb.tiered = 4
leveldb.tiered.mount.fast = /mnt/fast_raid
leveldb.tiered.mount.slow = /mnt/slow_raid

Example leveldb api configuration:

leveldb::Options options;
leveldb::DB * db_ptr;

options.tiered_slow_level = 4;
options.tiered_fast_prefix = "/mnt/fast_raid";
options.tiered_slow_prefix = "/mnt/slow_raid";

leveldb::DB::Open(".", options, &db_ptr);

Note: The full path of the tiered_fast_prefix and the tiered_slow_prefix must exist. Riak intentionally does not attempt to build any missing directories in the prefixes.

Selecting the level number for leveldb.tiered / options.tiered_slow_level

The obvious goal is to get as much of your data onto the faster array. How much will fit depends upon the size of your array and the number of databases (Riak vnodes). Here are the approximate sizes of one database/vnode within each level:

Level Level size Cummulative size Cummulative with AAE
0 377,487,360 377,487,360 754,974,720
1 2,264,924,160 2,642,411,520 5,284,823,040
2 3,082,813,440 5,725,224,960 11,450,449,920
3 6,442,450,944 12,167,675,904 24,335,351,808
4 128,849,018,880 141,016,694,784 282,033,389,568
5 2,476,980,377,600 2,617,997,072,384 5,235,994,144,768
6 not limited not tiered not tiered

First determine how many databases / vnodes, including typical failover scenarios, for your server / node. The typical measure for Riak is: (ring_size) / (nodes-1).

Second select either the third or the fourth column for the next step. Use the fourth column if you have entropy = active in riak.conf. Use the third column for all other scenarios.

Finally multiply your value from the first step times the Cummulative column of each row in the table above. The first result that exceeds your fast storage array capacity is the Level number to use for tiered_slow_level.

Branch description

include/leveldb/options.h & util/options.cc

Add three new option parameters: tiered_slow_level (integer), tiered_fast_prefix (std::string), tiered_slow_prefix (std::string). tiered_slow_level defaults to zero, disabling tiered storage.

leveldb makes an internal copy of the Options structure during the DB::Open() call. When tiered_slow_level is disabled (zero), the code overwrites any user values in the two _prefix variables with the "database name" parameter from the "DB::Open()" call. When enabled, the code appends "/" and the "database name" to each _prefix variable (see MakeTieredDbname() in db/filename.cc).

db/filename.h & db/filename.cc

The essential changes for tiered storage happen within db/filename.cc. Google centralized all path and file naming operations within this file. Riak 1.3 changed this file to route .sst table files into level-based directories. This branch similarly uses the level of a file to select the prefix for each .sst table file.

Functions MakeFileName2(), MakeDirName2(), TableFileName(), and MakeLevelDirectories() now utilize an Options parameter instead of database name. This allows the code to dynamically select the appropriate prefix for the .sst table file. Also, changing the parameter list to use a different variable type, Options struct, forced all uses of the functions to self identify within the other source files (code quit compiling).

MakeTieredDbname() is new for tiered storage. It defines how the three new options interact. It a filter function that adjusts leveldb's internal copy of the user's original Options structure. It is also a filter for the user's database name. The Options struct and database name get adjust at the time of DB::Open and database repair.

all other source file changes

All other files contain changes to adjust dbname_ to options_ upon calls to the db/filename.cc routines. Also, DBImpl's constructors and Repair's constructors contain edits to pass the user's Options and database name through the MakeTieredDbname() filter.

Clone this wiki locally