Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Tag: riak-1.3.2rc1
Fetching contributors…

Cannot retrieve contributors at this time

476 lines (396 sloc) 38.445 kB

Riak 1.3.1 Release Notes

New Features or Major Improvements for Riak

2i Big Integer Encoding

For all Riak versions prior to 1.3.1, 2i range queries involving integers greater than or equal to 2147483647 (0x7fffffff) could return missing results. The cause was identified to be an issue with the encoding library sext [1], which Riak uses for indexes stored in eleveldb. Sext serializes Erlang terms to a binary while preserving sort order. For these large integers, this was not the case. Since the 2i implementation relies on this property, some range queries were affected.

The issue in sext was patched [2] and is included in Riak 1.3.1. New installations of Riak 1.3.1 will immediately take advantage of the change. However, the fix introduces an incompatibly in the encoding of big integers. Integer indexes containing values greater than or equal to 2147483647 already written to disk with Riak 1.3 and below will need to be rewritten, so that range queries over them will return the correct results.

Riak 1.3.1 includes a utility, as part of riak-admin, that will perform the reformatting of these indexes while the node is online. After the affected indexes have been reformatted on all nodes, range queries will begin returning the correct results for previously written data. The utility should be run against any riak cluster using 2i after upgrading the entire cluster to 1.3.1, regardless of whether or not large integer index values are used. It will report how many indexes were affected (rewritten). Unaffected indexes are not modified and new writes will be written in the correct format.

To reformat indexes on a Riak node run:

riak-admin reformat-indexes [<concurrency>] [<batch size>]

The concurrency option controls how many partitions are reformatted concurrently. If not provided it defaults to 2. Batch size controls how many keys are fixed at a time and it defaults to 100. A node without load could finish reformatting much faster with a higher concurrency value. Lowering the batch could lower the latency of other node operations if the node is under load during the reformatting. We recommend to use the default valuess and tweak only after testing. Output will be printed to logs once the reformatting has completed (or if it errors). If the reformatting operation errors, it should be re-executed. The operation will only attempt to reformat keys that were not fixed on the previous run.

If downgrading back to Riak 1.3 from Riak 1.3.1, indexes will need to be reformatted back to the old encoding in order for the downgraded node to run correctly. The --downgrade flag can be passed to riak-admin reformat-indexes to perform this operation:

riak-admin reformat-indexes [<concurrency>] [<batch size>] --downgrade

The concurrency and batch size parameters work in exactly the same way as in the upgrade case above.

[1] https://github.com/uwiger/sext

[2] https://github.com/uwiger/sext/commit/ff10beb7a791f04ad439d2c1c566251901dd6bdc

Improved bitcask startup time

We fixed a problem that was preventing vnodes from starting concurrently. Installations using the bitcask backend should see a substantial improvement in startup times if multiple cores are available. We have observed improvements in the vicinity of an order of magnitude (~10X) on some of our own clusters.

Fix behaviour of PR/PW

For Riak releases prior to 1.3.1 the get and put options PR and PW only checked that the requested number of primaries were online when the request was handled. It did not check which vnodes actually responded. So with a PW of 2 you could easily write to one primary, one fallback, fail the second primary write and return success.

As of Riak 1.3.1, PR and PW will also wait until the required number of primaries have responded before returning the result of the operation. This means that if PR + PW > N and both requests succeed, you'll be guaranteed to have read the previous value you've written (barring other intervening writes and irretrievably lost replicas).

Note however, that writes with PW that fail may easily have done a partial write. This change is purely about strengthening the constraints you can impose on read/write success. See more information in the pull request linked below.

Issues / PR's Resolved

Riak 1.3.0 Release Notes

New Features or Major Improvements for Riak

Active Anti-Entropy

New in Riak 1.3. Riak now includes an active anti-entropy (AAE) subsystem that works to verify and repair data cross an entire Riak cluster. The AAE system periodically exchanges information between data replicas in order to determine missing or divergent data. When bad replicas are detected, AAE triggers read repair to correct the situation. AAE is entirely automatic, and provides an additional layer of protection against various data loss scenarios (eg. disk failure, restoring from an outdated backup, bit rot, etc).

AAE is implemented using hash tree exchange, which ensures that the information exchanged between data replicas is proportional to the amount of divergent data rather than the total amount of data stored in Riak. When all data is in sync (the common case), exchanges are fast and have extremely low overhead. For this reason, AAE is able to perform multiple exchanges a minute with negligible impact on a cluster.

AAE hash trees are persistent entities stored in LevelDB instances separate from normal Riak K/V data. When first starting a fresh Riak 1.3 cluster (or upgrading from an older release), Riak will generate the hash tree information by traversing over each partition's data. By default, Riak will build one hash tree per hour per node. If the traversal over a partition's data takes more than an hour, then Riak may trigger a second tree build. However, by default at most two tree builds can occur at once.

Once a hash tree is built, it is kept up-to-date in real-time as writes are sent to Riak. However, trees are periodically expired and rebuilt to protect against potential divergence between the K/V data and its corresponding hash tree. Rebuilding trees also protects against silent data corruption (eg. bit rot). By default, trees are expired and rebuilt once a week.

All of the above settings (and more) can be configured in app.config. The AAE settings are in the riak_kv section, and have comments documenting the different options.

To provide insight into AAE, Riak provides the riak-admin aae-status command. The AAE status output is broken into three parts: Exchanges, Entropy Trees, and Keys Repaired.

================================== Exchanges ==================================
Index                                              Last (ago)    All (ago)
-------------------------------------------------------------------------------
0                                                  3.8 min       4.1 min
91343852333181432387730302044767688728495783936    3.3 min       7.8 min
182687704666362864775460604089535377456991567872   2.8 min       8.3 min
274031556999544297163190906134303066185487351808   2.3 min       6.3 min
365375409332725729550921208179070754913983135744   1.8 min       5.5 min
<snip>

The Exchanges section shows information about AAE exchanges for each K/V partition. The Last column lists when the most recent exchange between a partition and one of its sibling replicas was performed. The All column shows how long it has been since a partition exchanged with all of its sibling replicas. In essence, the All column sets the upperbound on how out-of-date an individual partition can be. Specifically, a partition can not have any missing or divergent data older that the value shown in All, unless all replicas for that data are invalid.

================================ Entropy Trees ================================
Index                                              Built (ago)
-------------------------------------------------------------------------------
0                                                  22.1 min
91343852333181432387730302044767688728495783936    22.6 min
182687704666362864775460604089535377456991567872   22.3 min
274031556999544297163190906134303066185487351808   22.9 min
365375409332725729550921208179070754913983135744   22.3 min
<snip>

The Entropy Trees section shows when the hash trees for a given partition were created. A hash tree must be built before a partition can participate in an exchange. As mentioned above, trees are built once and expired (by default) once a week.

================================ Keys Repaired ================================
Index                                                Last      Mean      Max
-------------------------------------------------------------------------------
0                                                     0         0         0
91343852333181432387730302044767688728495783936       87        21        87
182687704666362864775460604089535377456991567872      0         0         0
274031556999544297163190906134303066185487351808      0         0         0
365375409332725729550921208179070754913983135744      0         0         0
<snip>

The Keys Repaired section presents information about repairs triggered by AAE, including keys repaired in the most recent exchange, and the mean and max across all exchanges.

Note: All AAE status information is in-memory and is reset across a node restart. Only tree build times are persistent (since trees themselves are persistent).

Final notes about AAE:

  1. Trees must be built before exchange can occur. Since trees are built once an hour by default, it will take up to ring_size / number_of_nodes hours before all trees are built after first starting or upgrading to 1.3, and therefore that amount of time until AAE is fully protecting all data.

  2. Tree building typically uses 100% of a CPU when possible but should have minimal impact on Riak performance. When using Bitcask for K/V data, tree building may increase the latency for list_keys, list_buckets, and Riak EE's fullsync replication strategy. Once trees are built, these issues go away (until trees are expired/rebuilt a week later).

  3. AAE may occasionally repair a small number of keys (typically 1 or 2) even in a healthy cluster without divergent or missing data. This occurs when AAE is performing an exchange at the same time incoming writes are occurring to the same nodes. For example, a write may reach node A while being concurrently in-flight to node B, yet AAE happens to run at just the right moment to see the write on A but not B, and force a repair. Since AAE just triggers reads (to trigger read repair) this behavior is entirely safe.

  4. AAE is a feature of Riak K/V and does not protect Riak Search data.

MapReduce Sink Backpressure

Riak Pipe brought inter-stage backpressure to Riak KV's MapReduce system. However, prior to Riak 1.3, that backpressure did not extend to the sink. It was assumed that the Protocol Buffers or HTTP endpoint could handle the full output rate of the pipe. With Riak 1.3, backpressure has been extended to the sink so that those endpoint processes no longer become overwhelmed. This backpressure is tunable via a soft cap on the size of the sink's buffer, and a period at which a worker should check that cap. These can be configured at the Riak console by setting application environment variables, or in the riak_kv section of app.config (defaults shown):

{riak_kv,
 ...
 %% Soft cap on the MapReduce sink's buffer,
 %% expressed as a positive integer number of messages
 %% (one message is used per MapReduce result)
 {mrc_sink_buffer, 1000},

 %% Period at which a MapReduce worker must check
 %% the sink's buffer cap, expressed as an integer
 %% number of messages to send before waiting on
 %% an clear-to-send acknowledgement
 %%   0 = wait for acknowledgement of each message
 %%   1 = wait every other message
 %%   'infinity' = never wait for acknowledgements
 {mrc_sink_sync_period, 10}
}.

Additional IPv6 support

Riak Handoff and Protocol Buffers interfaces can now listen on IPv6 addresses (HTTP has always supported IPv6). You may specify the address using the short-hand string form, e.g. "::1" (for localhost), or as the 16-byte address in a tuple of 8 numbers, e.g. {0,0,0,0,0,0,0,1} (for localhost). IPv4 addresses may also be specified in either form (except the latter will be 4 bytes, tuple of 4 numbers). Note: This does not affect Riak node names. Refer to the `inet_dist_` settings in the Erlang documentation to enable IPv6 support for cluster membership.*

Luke Removal

The luke application was deprecated in the release of Riak 1.2. This release removes it, and all code using it.

riak getpid Added

A bug existed in how we used riak stop (listed below in Bugs Fixed) that justified a refactoring of how we got our own PID of Riak. While fixing the bug, it was thought getpid might be useful to system admins out there who don't want to rely on outside scripts to find the PID of Riak. riak getpid does what you expect, returns the PID of a running Riak or exits with 1 on failure. It is a small feature, but might save some time with ps, grep, and awk.

Riaknostic Included by Default

To encourage its use, we have now included Riaknostic in the Riak packages. Prior to 1.3, the user needed to download riaknostic separately, but now riak-admin diag will work out of the box.

Support added for SmartOS 1.8

Packages are now available for SmartOS machines based on 1.8 datasets as well as 1.6.

Health Check

New in Riak 1.3. Riak Core now includes a health check subsystem that actively monitors each node for specific conditions and disables/enables services based on those conditions.

To enable/disable all health checks a new setting has been added to the riak_core section of app.config:

%% Health Checks
%% If disabled, health checks registered by an application will
%% be ignored. NOTE: this option cannot be changed at runtime.
%% To re-enable, the setting must be changed and the node restarted.
{enable_health_checks, true},

Riak registers a health check with Riak Core to monitor the message queue lengths of KV vnodes. To configure the kv health check a new setting has been added to the riak_kv section of app.config:

%% This option configures the riak_kv health check that monitors
%% message queue lengths of riak_kv vnodes. The value is a 2-tuple,
%% {EnableThreshold, DisableThreshold}. If a riak_kv_vnode's message
%% queue length reaches DisableThreshold the riak_kv service is disabled
%% on this node. The service will not be re-enabled until the message queue
%% length drops below EnableThreshold.
{vnode_mailbox_limit, {1, 5000}}

Note: the kv health check does not apply to Riak Search or Riak Pipe vnodes.

Reset Bucket Properties

The HTTP interface now supports resetting bucket properties to their default values. Bucket properties are stored in Riak's ring structure that is gossiped around the cluster. Resetting bucket properties for buckets that are no longer used or that are using the default properties can reduce the amount of gossiped data.

Support for logging to syslog

Riak 1.3 now includes support for logging to syslog. To enable it, you can add something like this to the 'handlers' section of riak's app.config, under lager:

{lager_syslog_backend, ["riak", daemon, info]}

Which would log any messages at info or above to the daemon facility with the identity set to 'riak'. For more information see the lager_syslog documentation:

https://github.com/basho/lager_syslog

Installation Notes

For RHEL/Centos/Fedora users, the RPM tools have added a dependency on expect, so if you see a message like this:

$ sudo rpm -i riak-1.3.0rc1-1.el5.x86_64.rpm
error: Failed dependencies:
    /usr/bin/expect is needed by riak-1.3.0rc1-1.x86_64

You can fix this issue by installing the Riak RPM with yum which will resolve any dependencies automatically:

$ sudo yum -y install riak-1.3.0rc1-1.el5.x86_64.rpm
Preparing...                ########################################### [100%]
   1:expect                 ########################################### [100%]
   2:riak                   ########################################### [100%]

Issues / PR's Resolved

Jump to Line
Something went wrong with that request. Please try again.