Skip to content


Subversion checkout URL

You can clone with
Download ZIP
branch: cd-indexrepair
Fetching contributors…

Cannot retrieve contributors at this time

611 lines (503 sloc) 47.811 kb

Riak 1.3.2 Release Notes

New Features or Major Improvements for Riak

eLevelDB Verify Compactions

In Riak 1.2 we added code to have leveldb automatically shunt corrupted blocks to the lost/BLOCKS.bad file during a compaction. This was to keep the compactions from going into an infinite loop over an issue that A) read repair and AAE could fix behind the scenes and B) took up a bunch of customer support / engineering time to help customers fix manually.

Unfortunately, we did not realize that only one of two corruption tests was actually active during a compaction. There is a CRC test that applies to all blocks, including file metadata. Compression logic has a hash test that applies only to compressed data blocks. The CRC test was not active by default. Sadly, leveldb makes limited defensive tests beyond the CRC. A corrupted disk file could readily result in a leveldb / Riak crash ... unless the bad block happened to be detected by the compression hash test.

Google's answer to this problem is the paranoid_checks option, which defaults to false. Unfortunately setting this to true activates not only the compaction CRC test but also a CRC test of the recovery log. A CRC failure in the recovery log after a crash is expected, and utilized by the existing code logic to enable automated recovery upon next start up. paranoid_checks option will actually stop the automatic recovery if set to true. This second behavior is undesired.

This branch creates a new option, verify_compactions. The background CRC test previously controlled by paranoid_checks is now controlled by this new option. The recovery log CRC check is still controlled by paranoid_checks. verify_compactions defaults to true. paranoid_checks continues to default to false.

Note: CRC calculations are typically expensive. Riak 1.3 added code to leveldb to utilize Intel hardware CRC on 64bit servers where available. Riak 1.2 added code to leveldb to create multiple, prioritized compaction threads. These two prior features work to minimize / hide the impact of the increased CRC workload during background compactions.

Erlang Scheduler Collapse

All Erlang/OTP releases prior to R16B01 are vulnerable to the Erlang computation scheduler threads going asleep too aggressively. The sleeping periods reduce power consumption and inter-thread resource contention.

This release of Riak EDS requires a patch to the Erlang/OTP virtual machine to force sleeping scheduler threads to wake up a regular intervals. The flag +sfwi 500 must also be present in the vm.args file. This value is in milliseconds and may need tuning for your application. For the Open Source Riak release, the patch (and extra "vm.args" flags) are recommended: the patch can be found at:

Overload Protection / Work Shedding

As of Riak 1.3.2, Riak now includes built-in overload protection. If a Riak node becomes overloaded, Riak will now immediately respond {error, overload} rather than perpetually enqueuing requests and making the situation worse.

Previously, Riak would always enqueue requests. As an overload situation became worse, requests would take longer and longer to service, eventually getting to the point where requests would continually timeout. In extreme scenarios, Riak nodes could become unresponsive and ultimately crash.

The new overload protection addresses these issues.

The overload protection is configurable through app.config settings. The default settings have been tested on clusters of varying sizes and request rates and should be sufficient for all users of Riak. However, for completeness, the new settings are explained below.

There are two types of overload protection in Riak, each with different settings. The first limits the number of in-flight get and put operations initiated by a node in a Riak cluster. This is configured through the riak_kv/fsm_limit setting. The default is 50000. This limit is tracked separately for get and put requests, so the default allows up to 100000 in-flight requests in total.

The second type of overload protection limits the message queue size for individual vnodes, setting an upper bound on unserviced requests on a per-vnode basis. This is configured through the riak_core/vnode_overload_threshold setting and defaults to 10000 messages.

Setting either config setting to undefined in app.config will disable overload protection. This is not recommended. Note: not configuring the options at all will use the defaults mentioned above, ie. when missing from app.config.

The overload protection provides new stats that are exposed over the /stats endpoint.

The dropped_vnode_requests_total stat counts the number of messages discarded due to the vnode overload protection.

For the get/put overload protection, there are several new stats. The stats related to gets are listed below, there are equivalent versions for puts.

The node_get_fsm_active and node_get_fsm_active_60s stats shows how many gets are currently active on the node within the last second or last minute respectively. The node_get_fsm_in_rate and node_get_fsm_out_rate track the number of requests initiated and completed within the last second. Finally, the node_get_fsm_rejected, node_get_fsm_rejected_60s, and node_get_fsm_rejected_total track the number of requests discarded due to overload in their respective time windows.

Health Check Disabled

The health check feature that shipped in Riak 1.3.0 has been disabled as of Riak 1.3.2. The new overload protection feature serves a similar purpose and is much safer. Specifically, the health check approach was able to successfully recover from overload that was caused by slow nodes, but not from overload that was caused by incoming workload spiking beyond absolute cluster capacity. In fact, in the second case, the health check approach (divert overload traffic from one node to another) would exacerbate the problem.

Issues / PR's Resolved

Known Issues

  • riak_kv/400 If the node owns fewer than 2 partitions, the following warning will appear in the logs riak_core_stat_q:log_error:123 Failed to calculate stat {riak_kv,vnode,backend,leveldb,read_block_error} with error:badarg Fixed in master

Riak 1.3.1 Release Notes

New Features or Major Improvements for Riak

2i Big Integer Encoding

For all Riak versions prior to 1.3.1, 2i range queries involving integers greater than or equal to 2147483647 (0x7fffffff) could return missing results. The cause was identified to be an issue with the encoding library sext [1], which Riak uses for indexes stored in eleveldb. Sext serializes Erlang terms to a binary while preserving sort order. For these large integers, this was not the case. Since the 2i implementation relies on this property, some range queries were affected.

The issue in sext was patched [2] and is included in Riak 1.3.1. New installations of Riak 1.3.1 will immediately take advantage of the change. However, the fix introduces an incompatibly in the encoding of big integers. Integer indexes containing values greater than or equal to 2147483647 already written to disk with Riak 1.3 and below will need to be rewritten, so that range queries over them will return the correct results.

Riak 1.3.1 includes a utility, as part of riak-admin, that will perform the reformatting of these indexes while the node is online. After the affected indexes have been reformatted on all nodes, range queries will begin returning the correct results for previously written data. The utility should be run against any riak cluster using 2i after upgrading the entire cluster to 1.3.1, regardless of whether or not large integer index values are used. It will report how many indexes were affected (rewritten). Unaffected indexes are not modified and new writes will be written in the correct format.

To reformat indexes on a Riak node run:

riak-admin reformat-indexes [<concurrency>] [<batch size>]

The concurrency option controls how many partitions are reformatted concurrently. If not provided it defaults to 2. Batch size controls how many keys are fixed at a time and it defaults to 100. A node without load could finish reformatting much faster with a higher concurrency value. Lowering the batch could lower the latency of other node operations if the node is under load during the reformatting. We recommend to use the default valuess and tweak only after testing. Output will be printed to logs once the reformatting has completed (or if it errors). If the reformatting operation errors, it should be re-executed. The operation will only attempt to reformat keys that were not fixed on the previous run.

If downgrading back to Riak 1.3 from Riak 1.3.1, indexes will need to be reformatted back to the old encoding in order for the downgraded node to run correctly. The --downgrade flag can be passed to riak-admin reformat-indexes to perform this operation:

riak-admin reformat-indexes [<concurrency>] [<batch size>] --downgrade

The concurrency and batch size parameters work in exactly the same way as in the upgrade case above.



Improved bitcask startup time

We fixed a problem that was preventing vnodes from starting concurrently. Installations using the bitcask backend should see a substantial improvement in startup times if multiple cores are available. We have observed improvements in the vicinity of an order of magnitude (~10X) on some of our own clusters.

Fix behaviour of PR/PW

For Riak releases prior to 1.3.1 the get and put options PR and PW only checked that the requested number of primaries were online when the request was handled. It did not check which vnodes actually responded. So with a PW of 2 you could easily write to one primary, one fallback, fail the second primary write and return success.

As of Riak 1.3.1, PR and PW will also wait until the required number of primaries have responded before returning the result of the operation. This means that if PR + PW > N and both requests succeed, you'll be guaranteed to have read the previous value you've written (barring other intervening writes and irretrievably lost replicas).

Note however, that writes with PW that fail may easily have done a partial write. This change is purely about strengthening the constraints you can impose on read/write success. See more information in the pull request linked below.

Issues / PR's Resolved

Riak 1.3.0 Release Notes

New Features or Major Improvements for Riak

Active Anti-Entropy

New in Riak 1.3. Riak now includes an active anti-entropy (AAE) subsystem that works to verify and repair data cross an entire Riak cluster. The AAE system periodically exchanges information between data replicas in order to determine missing or divergent data. When bad replicas are detected, AAE triggers read repair to correct the situation. AAE is entirely automatic, and provides an additional layer of protection against various data loss scenarios (eg. disk failure, restoring from an outdated backup, bit rot, etc).

AAE is implemented using hash tree exchange, which ensures that the information exchanged between data replicas is proportional to the amount of divergent data rather than the total amount of data stored in Riak. When all data is in sync (the common case), exchanges are fast and have extremely low overhead. For this reason, AAE is able to perform multiple exchanges a minute with negligible impact on a cluster.

AAE hash trees are persistent entities stored in LevelDB instances separate from normal Riak K/V data. When first starting a fresh Riak 1.3 cluster (or upgrading from an older release), Riak will generate the hash tree information by traversing over each partition's data. By default, Riak will build one hash tree per hour per node. If the traversal over a partition's data takes more than an hour, then Riak may trigger a second tree build. However, by default at most two tree builds can occur at once.

Once a hash tree is built, it is kept up-to-date in real-time as writes are sent to Riak. However, trees are periodically expired and rebuilt to protect against potential divergence between the K/V data and its corresponding hash tree. Rebuilding trees also protects against silent data corruption (eg. bit rot). By default, trees are expired and rebuilt once a week.

All of the above settings (and more) can be configured in app.config. The AAE settings are in the riak_kv section, and have comments documenting the different options.

To provide insight into AAE, Riak provides the riak-admin aae-status command. The AAE status output is broken into three parts: Exchanges, Entropy Trees, and Keys Repaired.

================================== Exchanges ==================================
Index                                              Last (ago)    All (ago)
0                                                  3.8 min       4.1 min
91343852333181432387730302044767688728495783936    3.3 min       7.8 min
182687704666362864775460604089535377456991567872   2.8 min       8.3 min
274031556999544297163190906134303066185487351808   2.3 min       6.3 min
365375409332725729550921208179070754913983135744   1.8 min       5.5 min

The Exchanges section shows information about AAE exchanges for each K/V partition. The Last column lists when the most recent exchange between a partition and one of its sibling replicas was performed. The All column shows how long it has been since a partition exchanged with all of its sibling replicas. In essence, the All column sets the upperbound on how out-of-date an individual partition can be. Specifically, a partition can not have any missing or divergent data older that the value shown in All, unless all replicas for that data are invalid.

================================ Entropy Trees ================================
Index                                              Built (ago)
0                                                  22.1 min
91343852333181432387730302044767688728495783936    22.6 min
182687704666362864775460604089535377456991567872   22.3 min
274031556999544297163190906134303066185487351808   22.9 min
365375409332725729550921208179070754913983135744   22.3 min

The Entropy Trees section shows when the hash trees for a given partition were created. A hash tree must be built before a partition can participate in an exchange. As mentioned above, trees are built once and expired (by default) once a week.

================================ Keys Repaired ================================
Index                                                Last      Mean      Max
0                                                     0         0         0
91343852333181432387730302044767688728495783936       87        21        87
182687704666362864775460604089535377456991567872      0         0         0
274031556999544297163190906134303066185487351808      0         0         0
365375409332725729550921208179070754913983135744      0         0         0

The Keys Repaired section presents information about repairs triggered by AAE, including keys repaired in the most recent exchange, and the mean and max across all exchanges.

Note: All AAE status information is in-memory and is reset across a node restart. Only tree build times are persistent (since trees themselves are persistent).

Final notes about AAE:

  1. Trees must be built before exchange can occur. Since trees are built once an hour by default, it will take up to ring_size / number_of_nodes hours before all trees are built after first starting or upgrading to 1.3, and therefore that amount of time until AAE is fully protecting all data.

  2. Tree building typically uses 100% of a CPU when possible but should have minimal impact on Riak performance. When using Bitcask for K/V data, tree building may increase the latency for list_keys, list_buckets, and Riak EE's fullsync replication strategy. Once trees are built, these issues go away (until trees are expired/rebuilt a week later).

  3. AAE may occasionally repair a small number of keys (typically 1 or 2) even in a healthy cluster without divergent or missing data. This occurs when AAE is performing an exchange at the same time incoming writes are occurring to the same nodes. For example, a write may reach node A while being concurrently in-flight to node B, yet AAE happens to run at just the right moment to see the write on A but not B, and force a repair. Since AAE just triggers reads (to trigger read repair) this behavior is entirely safe.

  4. AAE is a feature of Riak K/V and does not protect Riak Search data.

MapReduce Sink Backpressure

Riak Pipe brought inter-stage backpressure to Riak KV's MapReduce system. However, prior to Riak 1.3, that backpressure did not extend to the sink. It was assumed that the Protocol Buffers or HTTP endpoint could handle the full output rate of the pipe. With Riak 1.3, backpressure has been extended to the sink so that those endpoint processes no longer become overwhelmed. This backpressure is tunable via a soft cap on the size of the sink's buffer, and a period at which a worker should check that cap. These can be configured at the Riak console by setting application environment variables, or in the riak_kv section of app.config (defaults shown):

 %% Soft cap on the MapReduce sink's buffer,
 %% expressed as a positive integer number of messages
 %% (one message is used per MapReduce result)
 {mrc_sink_buffer, 1000},

 %% Period at which a MapReduce worker must check
 %% the sink's buffer cap, expressed as an integer
 %% number of messages to send before waiting on
 %% an clear-to-send acknowledgement
 %%   0 = wait for acknowledgement of each message
 %%   1 = wait every other message
 %%   'infinity' = never wait for acknowledgements
 {mrc_sink_sync_period, 10}

Additional IPv6 support

Riak Handoff and Protocol Buffers interfaces can now listen on IPv6 addresses (HTTP has always supported IPv6). You may specify the address using the short-hand string form, e.g. "::1" (for localhost), or as the 16-byte address in a tuple of 8 numbers, e.g. {0,0,0,0,0,0,0,1} (for localhost). IPv4 addresses may also be specified in either form (except the latter will be 4 bytes, tuple of 4 numbers). Note: This does not affect Riak node names. Refer to the `inet_dist_` settings in the Erlang documentation to enable IPv6 support for cluster membership.*

Luke Removal

The luke application was deprecated in the release of Riak 1.2. This release removes it, and all code using it.

riak getpid Added

A bug existed in how we used riak stop (listed below in Bugs Fixed) that justified a refactoring of how we got our own PID of Riak. While fixing the bug, it was thought getpid might be useful to system admins out there who don't want to rely on outside scripts to find the PID of Riak. riak getpid does what you expect, returns the PID of a running Riak or exits with 1 on failure. It is a small feature, but might save some time with ps, grep, and awk.

Riaknostic Included by Default

To encourage its use, we have now included Riaknostic in the Riak packages. Prior to 1.3, the user needed to download riaknostic separately, but now riak-admin diag will work out of the box.

Support added for SmartOS 1.8

Packages are now available for SmartOS machines based on 1.8 datasets as well as 1.6.

Health Check

New in Riak 1.3. Riak Core now includes a health check subsystem that actively monitors each node for specific conditions and disables/enables services based on those conditions.

To enable/disable all health checks a new setting has been added to the riak_core section of app.config:

%% Health Checks
%% If disabled, health checks registered by an application will
%% be ignored. NOTE: this option cannot be changed at runtime.
%% To re-enable, the setting must be changed and the node restarted.
{enable_health_checks, true},

Riak registers a health check with Riak Core to monitor the message queue lengths of KV vnodes. To configure the kv health check a new setting has been added to the riak_kv section of app.config:

%% This option configures the riak_kv health check that monitors
%% message queue lengths of riak_kv vnodes. The value is a 2-tuple,
%% {EnableThreshold, DisableThreshold}. If a riak_kv_vnode's message
%% queue length reaches DisableThreshold the riak_kv service is disabled
%% on this node. The service will not be re-enabled until the message queue
%% length drops below EnableThreshold.
{vnode_mailbox_limit, {1, 5000}}

Note: the kv health check does not apply to Riak Search or Riak Pipe vnodes.

Reset Bucket Properties

The HTTP interface now supports resetting bucket properties to their default values. Bucket properties are stored in Riak's ring structure that is gossiped around the cluster. Resetting bucket properties for buckets that are no longer used or that are using the default properties can reduce the amount of gossiped data.

Support for logging to syslog

Riak 1.3 now includes support for logging to syslog. To enable it, you can add something like this to the 'handlers' section of riak's app.config, under lager:

{lager_syslog_backend, ["riak", daemon, info]}

Which would log any messages at info or above to the daemon facility with the identity set to 'riak'. For more information see the lager_syslog documentation:

Installation Notes

For RHEL/Centos/Fedora users, the RPM tools have added a dependency on expect, so if you see a message like this:

$ sudo rpm -i riak-1.3.0rc1-1.el5.x86_64.rpm
error: Failed dependencies:
    /usr/bin/expect is needed by riak-1.3.0rc1-1.x86_64

You can fix this issue by installing the Riak RPM with yum which will resolve any dependencies automatically:

$ sudo yum -y install riak-1.3.0rc1-1.el5.x86_64.rpm
Preparing...                ########################################### [100%]
   1:expect                 ########################################### [100%]
   2:riak                   ########################################### [100%]

Issues / PR's Resolved

Jump to Line
Something went wrong with that request. Please try again.