Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Merge branch 'bz15-clean-mapred-docs'

  • Loading branch information...
commit 96d0b8011521aa3b421cf273e173b93a74c49af6 2 parents d5d3cbb + a010092
@beerriot beerriot authored
Showing with 5 additions and 998 deletions.
  1. +0 −296 README
  2. +5 −2 README.org
  3. +0 −165 doc/basic-mapreduce.txt
  4. +0 −535 doc/js-mapreduce.org
View
296 README
@@ -1,296 +0,0 @@
- Welcome to Riak.
- ================
-
-Date: 2010-06-09 10:06:44 CDT
-
-
-
-Table of Contents
-=================
-1 Overview
-2 Quick Start
- 2.1 Building Riak
- 2.2 Starting Riak
- 2.3 Connecting a client to Riak
- 2.4 Clients for Other Languages
-3 Server Management
- 3.1 Configuration
- 3.2 Server Control
- 3.2.1 bin/riak
- 3.2.2 bin/riak-admin
-
-
-1 Overview
-~~~~~~~~~~~
- Riak is a distributed, decentralized data storage system.
-
- Below, you will find the "quick start" directions for setting up and
- using Riak. For more information, browse the following files:
-
- * README: this file
- * TODO: a list of improvements planned for Riak
- * LICENSE: the license under which Riak is released
- * apps/ the source tree for Riak and all its dependencies
- * doc/
- - admin.org: Riak Administration Guide
- - architecture.txt: details about the underlying design of Riak
- - basic-client.txt: slightly more detail on using Riak
- - basic-setup.txt: slightly more detail on setting up Riak
- - basic-mapreduce.txt: introduction to map/reduce on Riak
- - js-mapreduce.org: using Javascript with Riak map/reduce
- - man/riak.1.gz: manual page for the riak(1) command
- - man/riak-admin.1.gz manual page for the riak-admin(1) command
- - raw-http-howto.txt: using the Riak HTTP interface
-
-
-
-2 Quick Start
-~~~~~~~~~~~~~~
-
- This section assumes that you have copy of the Riak source tree. To get
- started, you need to:
- 1. Build Riak
- 2. Start the Riak server
- 3. Connect a client and store/fetch data
-
-2.1 Building Riak
-==================
-
- Assuming you have a working Erlang (R14B02 or later) installation,
- building Riak should be as simple as:
-
-
- $ cd $RIAK
- $ make rel
-
-2.2 Starting Riak
-==================
-
- Once you have successfully built Riak, you can start the server with the
- following commands:
-
-
- $ cd $RIAK/rel/riak
- $ bin/riak start
-
- Now, verify that the server started up cleanly and is working:
-
- $ bin/riak-admin test
-
- Note that the $RIAK/rel/riak directory is a complete, self-contained instance
- of Riak and Erlang. It is strongly suggested that you move this directory
- outside the source tree if you plan to run a production instance.
-
-2.3 Connecting a client to Riak
-================================
-
- Now that you have a functional server, let's try storing some data in
- it. First, start up a erlang node using our embedded version of erlang:
-
-
- $ erts-<vsn>/bin/erl -name riaktest@127.0.0.1 -setcookie riak
-
- Eshell V5.7.4 (abort with ^G)
- (riaktest@127.0.0.1)1>
-
- Now construct the node name of Riak server and make sure we can talk to it:
-
-
- (riaktest@127.0.0.1)4> RiakNode = 'riak@127.0.0.1'.
-
- (riaktest@127.0.0.1)2> net_adm:ping(RiakNode).
- pong
- (riaktest@127.0.0.1)2>
-
- We are now ready to start the Riak client:
-
-
- (riaktest@127.0.0.1)2> {ok, C} = riak:client_connect(RiakNode).
- {ok,{riak_client,'riak@127.0.0.1',<<4,136,81,151>>}}
-
- Let's create a shopping list for bread at /groceries/mine:
-
-
- (riaktest@127.0.0.1)6> O0 = riak_object:new(<<"groceries">>, <<"mine">>, ["bread"]).
- O0 = riak_object:new(<<"groceries">>, <<"mine">>, ["bread"]).
- {r_object,<<"groceries">>,<<"mine">>,
- [{r_content,{dict,0,16,16,8,80,48,
- {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
- {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
- ["bread"]}],
- [],
- {dict,1,16,16,8,80,48,
- {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
- {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
- undefined}
-
- (riaktest@127.0.0.1)3> C:put(O0, 1).
-
- Now, read the list back from the Riak server and extract the value
-
-
- (riaktest@127.0.0.1)4> {ok, O1} = C:get(<<"groceries">>, <<"mine">>, 1).
- {ok,{r_object,<<"groceries">>,<<"mine">>,
- [{r_content,{dict,2,16,16,8,80,48,
- {[],[],[],[],[],[],[],[],[],[],[],[],...},
- {{[],[],[],[],[],[],
- ["X-Riak-Last-Modified",87|...],
- [],[],[],...}}},
- ["bread"]}],
- [{"20090722191020-riaktest@127.0.0.1-riakdemo@127.0.0.1-266664",
- {1,63415509105}}],
- {dict,0,16,16,8,80,48,
- {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
- {{[],[],[],[],[],[],[],[],[],[],[],...}}},
- undefined}}
-
- (riaktest@127.0.0.1)5> %% extract the value
- (riaktest@127.0.0.1)5> V = riak_object:get_value(O1).
- ["bread"]
-
- Add milk to our list of groceries and write the new value to Riak:
-
-
- (riaktest@127.0.0.1)6> %% add milk to the list
- (riaktest@127.0.0.1)6> O2 = riak_object:update_value(O1, ["milk" | V]).
- {r_object,<<"groceries">>,<<"mine">>,
- [{r_content,{dict,2,16,16,8,80,48,
- {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
- {{[],[],[],[],[],[],
- ["X-Riak-Last-Modified",87,101,100|...],
- [],[],[],[],[],...}}},
- ["bread"]}],
- [{"20090722191020-riaktest@127.0.0.1-riakdemo@127.0.0.1-266664",
- {1,63415509105}}],
- {dict,0,16,16,8,80,48,
- {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
- {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
- ["milk","bread"]}
-
- (riaktest@127.0.0.1)7> %% store the new list
- (riaktest@127.0.0.1)7> C:put(O2, 1).
- ok
-
- Finally, see what other keys are available in groceries bucket:
-
-
- (riaktest@127.0.0.1)8> C:list_keys(<<"groceries">>).
- {ok,[<<"mine">>]}
-
-2.4 Clients for Other Languages
-================================
-
- Client libraries are available for many languages. Rather than
- bundle them with the Riak server source code, we have given them
- each their own source repository. Currently, official Riak
- client language libraries include:
-
- + Javascript
- [http://bitbucket.org/basho/riak-javascript-client]
-
- + Python
- [http://bitbucket.org/basho/riak-python-client]
-
- + Ruby
- [http://bitbucket.org/basho/riak-ruby-client]
- [http://github.com/seancribbs/ripple/]
-
- + Java
- [http://bitbucket.org/basho/riak-java-client]
-
- + PHP
- [http://bitbucket.org/basho/riak-php-client]
-
- + Erlang
- [http://bitbucket.org/basho/riak-erlang-client]
- (using protocol buffers instead of distributed Erlang)
-
-3 Server Management
-~~~~~~~~~~~~~~~~~~~~
-
-3.1 Configuration
-==================
- Configuration for the Riak server is stored in $RIAK/rel/riak/etc
- directory. There are two files:
- - vm.args
- This file contains the arguments that are passed to the Erlang VM
- in which Riak runs. The default settings in this file shouldn't need to be
- changed for most environments.
-
- - app.config
- This file contains the configuration for the Erlang applications
- that run on the Riak server.
-
- More information about this files is available in doc/basic-setup.txt.
-
-3.2 Server Control
-===================
-
-3.2.1 bin/riak
----------------
- This script is the primary interface for starting and stopping the Riak
- server.
-
- To start a daemonized (background) instance of Riak:
-
- $ bin/riak start
-
- Once a server is running in the background you can attach to the Erlang
- console via:
-
- $ bin/riak attach
-
- Alternatively, if you want to run a foreground instance of Riak, start it
- with:
-
- $ bin/riak console
-
- Stopping a foreground or background instance of Riak can be done from a
- shell prompt via:
-
- $ bin/riak stop
-
- Or if you are attached/on the Erlang console:
-
- (riak@127.0.0.1)1> q().
-
- You can determine if the server is running by:
-
- $ bin/riak ping
-
-3.2.2 bin/riak-admin
----------------------
- This script provides access to general administration of the Riak server.
- The below commands assume you are running a default configuration for
- parameters such as cookie.
-
- To join a new Riak node to an existing cluster:
-
-
- $ bin/riak start # If a local server is not already running
- $ bin/riak-admin join <node in cluster>
-
- (Note that you must have a local node already running for this to work)
-
- To verify that the local Riak node is able to read/write data:
- $ bin/riak-admin test
-
- To backup a node or cluster run the following:
- $ bin/riak-admin backup riak@X.X.X.X riak <directory/backup_file> node
- $ bin/riak-admin backup riak@X.X.X.X riak <directory/backup_file> all
-
- Restores can function in two ways, if the backup file was of a node the
- node will be restored and if the backup file contains the data for a
- cluster all nodes in the cluster will be restored.
-
- To restore from a backup file:
- $ riak-admin restore riak@X.X.X.X riak <directory/backup_file>
-
- To view the status of a node:
- $ bin/riak-admin status
-
- If you change the IP or node name you will need to use the reip command:
- $ bin/riak-admin reip <old_nodename> <new_nodename>
-
-
-
View
7 README.org
@@ -17,13 +17,16 @@ Welcome to Riak.
- architecture.txt: details about the underlying design of Riak
- basic-client.txt: slightly more detail on using Riak
- basic-setup.txt: slightly more detail on setting up Riak
- - basic-mapreduce.txt: introduction to map/reduce on Riak
- - js-mapreduce.org: using Javascript with Riak map/reduce
- man/riak.1.gz: manual page for the riak(1) command
- man/riak-admin.1.gz manual page for the riak-admin(1) command
- raw-http-howto.txt: using the Riak HTTP interface
+* Where to find more
+Below, you'll find a basic introduction to starting and using Riak as
+a key/value store. For more information about Riak's extended feature
+set, including MapReduce, Search, Secondary Indexes, various storage
+strategies, and more, please visit our wiki at http://wiki.basho.com/.
* Quick Start
View
165 doc/basic-mapreduce.txt
@@ -1,165 +0,0 @@
-Introduction to Map/Reduce on Riak
-------
-
-This document describes Riak's implementation of a data processing
-system based on the MapReduce[1] programming paradigm popularized by
-Google. It assumes that you have already set up Riak, and know the
-basics about dealing with Riak clients. For more information on these
-prerequisites, see riak/doc/basic-setup.txt and
-riak/doc/basic-client.txt.
-
-Quick and Dirty Example
----
-
-If you have a Riak client hanging around, you can execute Map/Reduce
-queries on it like this:
-
-1> Count = fun(G, undefined, none) ->
- [dict:from_list([{I, 1} || I <- riak_object:get_value(G)])]
- end.
-2> Merge = fun(Gcounts, none) ->
- [lists:foldl(fun(G, Acc) ->
- dict:merge(fun(_, X, Y) -> X+Y end,
- G, Acc)
- end,
- dict:new(),
- Gcounts)]
- end.
-3> {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},
- {<<"groceries">>, <<"yours">>}],
- [{map, {qfun, Count}, none, false},
- {reduce, {qfun, Merge}, none, true}]).
-4> L = dict:to_list(R).
-
-If the "mine" and "yours" objects in the groceries bucket had values
-of ["bread", "cheese"], ["bread", "butter"], the sequence of commands
-above would result in L being bound to
-[{"bread", 2},{"cheese",1},{"butter",1}].
-
-
-Details
----
-
-More importantly, riak_client:mapred takes two lists as arguments.
-The first list contains bucket key pairs, which are the keys to the
-"starting values" of the Map/Reduce query. The second list are the
-steps of the query.
-
-
-Map Steps
----
-
-Map steps expect as input a list of bucket/key pairs, just like the
-first argument to the riak_client:mapred function. Riak executes a
-map step by looking up values for keys in the input list and executing
-the map function referenced in the step.
-
-Map steps take the form:
-
-{map, FunTerm, Arg, Accumulate}
-
-Where:
-
- FunTerm is a reference to the function that will compute the map of
- each value. A function referenced by a FunTerm must be arity-3,
- accepting the arguments:
-
- Value: the value found at a key. This will be a Riak object
- (defined by the riak_object module) if a value was found, or the
- tuple {error, notfound} if a bucket/key was put in the input
- list, but not found in the Riak cluster.
-
- Data: An optional piece of data attached to the bucket/key tuple.
- If instead of {Bucket, Key}, {{Bucket, Key}, Data} is passed as
- input to a map step, that Data will be passed to the map
- function in this argument. Data will be the atom 'undefined' if
- the former form is used.
-
- Arg: The Arg from the map step definition. The same Arg is passed
- to every execution of the map function in this step.
-
- Functions may be referenced in two ways:
-
- {modfun, Module, Function} where Module and Function are atoms
- that name an Erlang function in a specific module
-
- {qfun, Function} where Function is a callable fun term
-
- The function must return a *list* of values. The lists returned
- by all executions of the map function for this step will be
- appended and passed to the next step.
-
- Arg: The third argument passed to the function referenced in FunTerm.
-
- Accumulate: If true, the output of this map step will be included in
- the final return of the mapred function. If false, the output will
- be discarded after the next step.
-
-
-Reduce Steps
----
-
-A reduce step takes the form:
-
-{reduce, FunTerm, Arg, Acc}
-
-Where FunTerm, Arg, and Acc are mostly the same as their definition
-for a map step, but the function referenced in FunTerm is instead of
-arity 2. Its parameters are:
-
- ValueList: The list of values produce by the preceeding step of the
- Map/Reduce.
-
- Arg: The Arg from the step definition.
-
-The function should again produce a list of values, but it must also
-be true that the function is commutative, associative, and
-idempotent. That is, if the input list [a,b,c,d] is valid for a given
-F, then all of the following must produce the same result:
-
- F([a,b,c,d])
- F([a,d] ++ F([c,b]))
- F([F([a]),F([c]),F([b]),F([d])])
-
-
-Where does the code run?
----
-
-So, all well and good, but you could code the same abstraction in a
-couple of hours, right? Just fetch each object and run your function.
-
-Well, not so fast. This map/reduce isn't just an abstraction, it
-fully exploits data locality. That is to say, both map and reduce
-functions run on Riak nodes. Map nodes are even run on the node where
-the data is already located.
-
-This means a few things to you:
-
-- If you use the {modfun, Module, Function} form of the FunTerm in the
- map/reduce step definition, that Module must be in the code path of
- the Riak node. This isn't a huge concern for libraries that ship
- with Erlang, but for any of your custom code, you'll need to make
- sure it's loadable.
-
-- If you use the {modfun, Module, Function} form of the FunTerm in the
- map/reduce step definition, you'll need to force the Riak nodes to
- reload the Module if you make a change to it.
-
- The easiest way to reload a module on a Riak node is to get a Riak
- client, then call Client:reload_all(Module).
-
-- If you need to do a Riak 'get' inside of a map or reduce function,
- you can use riak:local_client/0 to get a Riak client instead of
- riak:client_connect/1.
-
-- Your map and reduce functions are running on a Riak node, which
- means that that Riak node is spending CPU time doing something other
- than responding to 'get' and 'put' requests.
-
-- If you use the {qfun, Fun} form, your callable function, and its
- environment will be shipped to the Riak cluster, and each node on
- which it runs. This is both a benefit (in that you have the full
- power of closures) and a danger (in that you must be mindful of
- closing over very large data structures).
-
-[1] http://labs.google.com/papers/mapreduce.html
View
535 doc/js-mapreduce.org
@@ -1,535 +0,0 @@
-#+SETUPFILE: "basho-doc-style.iorg"
-#+TITLE: Using Javascript with Riak Map/Reduce
-
-Riak supports writing map/reduce query functions in Javascript, as
-well as specifying query execution over HTTP. This document will
-teach you how to use these features.
-
-* Simple Example
-
- This section hits the ground running with a quick example to
- demonstrate what HTTP/Javascript map/reduce looks like in Riak.
- This example will store several chunks of text in Riak, and then
- compute a word counts on the set of documents.
-
-** Load data
-
- We will use the Riak HTTP interface to store the texts we want to
- process:
-
-#+BEGIN_EXAMPLE
-$ curl -X PUT -H "content-type: text/plain" \
- http://localhost:8098/riak/alice/p1 --data-binary @-
-Alice was beginning to get very tired of sitting by her sister on the
-bank, and of having nothing to do: once or twice she had peeped into the
-book her sister was reading, but it had no pictures or conversations in
-it, 'and what is the use of a book,' thought Alice 'without pictures or
-conversation?'
-^D
-$ curl -X PUT -H "content-type: text/plain" \
- http://localhost:8098/riak/alice/p2 --data-binary @-
-So she was considering in her own mind (as well as she could, for the
-hot day made her feel very sleepy and stupid), whether the pleasure
-of making a daisy-chain would be worth the trouble of getting up and
-picking the daisies, when suddenly a White Rabbit with pink eyes ran
-close by her.
-^D
-$ curl -X PUT -H "content-type: text/plain" \
- http://localhost:8098/riak/alice/p5 --data-binary @-
-The rabbit-hole went straight on like a tunnel for some way, and then
-dipped suddenly down, so suddenly that Alice had not a moment to think
-about stopping herself before she found herself falling down a very deep
-well.
-#+END_EXAMPLE
-
-** Run query
-
- With data loaded, we can now run a query:
-
-#+BEGIN_EXAMPLE
-$ curl -X POST -H "content-type: application/json" http://localhost:8098/mapred --data @-
-{"inputs":[["alice","p1"],["alice","p2"],["alice","p5"]],"query":[{"map":{"language":"javascript","source":"function(v) { var m = v.values[0].data.toLowerCase().match('\\\\w*','g'); var r = []; for(var i in m) if (m[i] != '') { var o = {}; o[m[i]] = 1; r.push(o); } return r; }"}},{"reduce":{"language":"javascript","source":"function(v) { var r = {}; for (var i in v) { for(var w in v[i]) { if (w in r) r[w] += v[i][w]; else r[w] = v[i][w]; } } return [r]; }"}}]}
-^D
-#+END_EXAMPLE
-
- And we end up with the word counts for the three documents.
-
-#+BEGIN_EXAMPLE
-[{"the":8,"rabbit":2,"hole":1,"went":1,"straight":1,"on":2,"like":1,"a":6,"tunnel":1,"for":2,"some":1,"way":1,"and":5,"then":1,"dipped":1,"suddenly":3,"down":2,"so":2,"that":1,"alice":3,"had":3,"not":1,"moment":1,"to":3,"think":1,"about":1,"stopping":1,"herself":2,"before":1,"she":4,"found":1,"falling":1,"very":3,"deep":1,"well":2,"was":3,"considering":1,"in":2,"her":5,"own":1,"mind":1,"as":2,"could":1,"hot":1,"day":1,"made":1,"feel":1,"sleepy":1,"stupid":1,"whether":1,"pleasure":1,"of":5,"making":1,"daisy":1,"chain":1,"would":1,"be":1,"worth":1,"trouble":1,"getting":1,"up":1,"picking":1,"daisies":1,"when":1,"white":1,"with":1,"pink":1,"eyes":1,"ran":1,"close":1,"by":2,"beginning":1,"get":1,"tired":1,"sitting":1,"sister":2,"bank":1,"having":1,"nothing":1,"do":1,"once":1,"or":3,"twice":1,"peeped":1,"into":1,"book":2,"reading":1,"but":1,"it":2,"no":1,"pictures":2,"conversations":1,"what":1,"is":1,"use":1,"thought":1,"without":1,"conversation":1}]
-#+END_EXAMPLE
-
-** Explanation
-
- For more details about what each bit of syntax means, and other
- syntax options, read the following sections. As a quick
- explanation of how this example map/reduce query worked, though:
-
- 1. The objects named =p1=, =p2=, and =p5= from the =alice= bucket
- were given as inputs to the query.
-
- 2. The map function from the phase was run on each object. The
- function:
-
-#+BEGIN_SRC javascript
-function(v) {
- var m = v.values[0].data.match('\\w*','g');
- var r = [];
- for(var i in m)
- if (m[i] != '') {
- var o = {};
- o[m[i]] = 1;
- r.push(o);
- }
- return r;
-}
-#+END_SRC
-
- creates a list of JSON objects, one for each word (non-unique)
- in the text. The object has as a key, the word, and as the
- value for that key, the integer 1.
-
- 3. The reduce function from the phase was run on the outputs of the
- map functions. The function:
-
-#+BEGIN_SRC javascript
-function(v) {
- var r = {};
- for (var i in v) {
- for(var w in v[i]) {
- if (w in r)
- r[w] += v[i][w];
- else
- r[w] = v[i][w];
- }
- }
- return [r];
- }
-#+END_SRC
-
- looks at each JSON object in the input list. It steps through
- each key in each object, and produces a new object. That new
- object has a key for each key in every other object, the value
- of that key being the sum of the values of that key in the other
- objects. It returns this new object in a list, because it may
- be run a second time on a list including that object and more
- inputs from the map phase.
-
- 4. The final output is a list with one element: a JSON object with
- a key for each word in all of the documents (unique), with the
- value of that key being the number of times the word appeared in
- the documents.
-
-* Query Syntax
-
- Map/Reduce queries are issued over HTTP via a POST to the /mapred
- resource. The body should be =application/json= of the form
- ={"inputs":[...inputs...],"query":[...query...]}=.
-
- Map/Reduce queries have a default timeout of 60000 milliseconds
- (60 seconds). The default timeout can be overridden by supplying
- a different value, in milliseconds, in the JSON document
- ={"inputs":[...inputs...],"query":[...query...],"timeout": 90000}=
-
-** Inputs
-
- The list of input objects is given as a list of 2-element lists of
- the form =[Bucket,Key]= or 3-element lists of the form
- =[Bucket,Key,KeyData]=.
-
- You may also pass just the name of a bucket
- (={"inputs":"mybucket",...}=), which is equivalent to passing all
- of the keys in that bucket as inputs (i.e. "a map/reduce across the
- whole bucket"). You should be aware that this triggers the
- somewhat expensive "list keys" operation, so you should use it
- sparingly.
-
-** Query
-
- The query is given as a list of phases, each phase being of the
- form ={PhaseType:{...spec...}}=. Valid =PhaseType= values are
- "map", "reduce", and "link".
-
- Every phase spec may include a =keep= field, which must have a
- boolean value: =true= means that the results of this phase should
- be included in the final result of the map/reduce, =false= means
- the results of this phase should be used only by the next phase.
- Omitting the =keep= field accepts its default value, which is
- =false= for all phases except the final phase (Riak assumes that
- you were most interested in the results of the last phase of your
- map/reduce query).
-
-*** Map
-
- Map phases must be told where to find the code for the function to
- execute, and what language that function is in.
-
- Function source can be specified directly in the query by using
- the "source" spec field. Function source can also be loaded from
- a pre-stored riak object by providing "bucket" and "key" fields in
- the spec.
-
- For example:
-
-:{"map":{"language":"javascript","source":"function(v) { return [v]; }","keep":true}}
-
- would run the Javascript function given in the spec, and include
- the results in the final output of the m/r query.
-
-:{"map":{"language":"javascript","bucket":"myjs","key":"mymap","keep":false}}
-
- would run the Javascript function declared in the content of the
- Riak object under =mymap= in the =myjs= bucket, and the results of
- the funciton would not be included in the final output of the m/r
- query.
-
- Map phases may also be passed static arguments by using the "arg"
- spec field.
-
-*** Reduce
-
- Reduce phases look exactly like map phases, but are labeled "reduce".
-
-*** Link
-
- Link phases accept =bucket= and =tag= fields that specify which
- links match the link query. The string "_" (underscore) in each
- field means "match all", while any other string means "match
- exactly this string". If either field is left out, it is
- considered to be set to "_" (match all).
-
- For example:
-
-:{"link":{"bucket":"foo","keep":false}}
-
- Would follow all links pointing to objects in the =foo= bucket,
- regardless of their tag.
-
-* Javascript Functions
-** Function Parameters
-*** Map functions
-
- Map functions are passed three parameters: the object that the map
- is being applied to, the "keydata" for that object, and the static
- argument for the phase.
-
- The object will be a JSON object of the form:
-
-#+BEGIN_EXAMPLE
-{
- "bucket":BucketAsString,
- "key":KeyAsString,
- "vclock":VclockAsString,
- "values":[
- {
- "metadata":{
- "X-Riak-VTag":VtagAsString,
- "X-riak-Last-Modified":LastModAsString,
- ...other metadata...
- },
- "data":ObjectData
- },
- ...other metadata/data values (siblings)...
- ]
-}
-#+END_EXAMPLE
-
- =object.values[0].data= is probably what you will be interested in
- most of the time, but the rest of the details of the object are
- provided for your use.
-
- The "keydata" is the third element of the item from the input
- bucket/key list (called =KeyData= in the [[Inputs]] section above), or
- "undefined" if none was provided.
-
- The static argument for the phase is the value of the =arg= field
- from the map spec in the query list.
-
- A map phase should produce a list of results. You will see errors
- if the output of your map function is not a list. Return the
- empty list if your map function chooses not to produce output.
-*** Reduce functions
-
- Reduce functions are passed two parameters: a list of inputs to
- reduce, and the static argument for the phase.
-
- The list of inputs to reduce may contain values from previous
- executions of the reduce function. It will also contain results
- produced by the preceding map or reduce phase.
-
- The static argument for the phase is the value of the =arg= field
- from the reduce spec in the query list.
-
- A reduce phase should produce a list of results. You will see
- errors if the output of your reduce function is not a list. The
- function should return an empty list, if it has no other output to
- produce.
-
-*** Link functions
-
- If you are storing data through the HTTP interface, and using the
- =Link= HTTP header, you do not need to worry about writing a
- link-extraction function. Just use the predefined
- =raw_link_walker_resource:mapreduce_linkfun/3=.
-
- But, if you need to extract links from your data in some other
- manner, there are many ways to specify Javascript functions to do
- that. They all start with setting the =linkfun= bucket property.
- Through the HTTP interface:
-
-:$ curl -X PUT -H "application/json" http://localhost:8098/riak/bucket \
-:> --data "{\"props\":{\"linkfun\":{...function...}}}"
-
- The three ways to fill in the value of the =linkfun= key are:
-
- + Quoted source code, as the value of the =jsanon= key:
-
- :{"jsanon":"function(v,kd,bt) { return []; }"}
-
- + The bucket and key of an object containing the function source:
-
- :{"jsanon":{"bucket":Bucket,"key":Key}}
-
- + The name of a predefined Javascript function:
-
- :{"jsfun":FunctionName}
-
- The function has basically the same contract as a map function.
- The first argument is the object from which links should be
- extracted. The second argument is the =KeyData= for the object.
-
- The third argument is a Javascript object representing the links
- to match at return. The two fields in the object, =bucket= and
- =tag=, will have the values given in the link phase spec from the
- query.
-
- The link fun should return a list of the same form as the =inputs=
- list: 2-item bucket/key lists, or 3-item bucket/key/keydata lists.
-
-* How Map/Reduce Queries Work
-
-** Map/Reduce Intro
-
- The main goal of Map/Reduce is to spread the processing of a query
- across many systems to take advantage of parallel processing power.
- This is generally done by dividing the query into several steps,
- dividing the dataset into several chunks, and then running those
- step/chunk pairs in separate physical hosts.
-
- One step type is called "map". Map functions take one piece of
- data as input, and produce zero or more results as output. If
- you're familiar with "mapping over a list" in functional
- programming style, you're already familiar with "map" steps in a
- map/reduce query.
-
- Another step type is called "reduce". The purpose of a "reduce"
- step is to combine the output of many "map" step evaluations, into
- one result.
-
- The common example of a map/reduce query involves a "map" step that
- takes a body of text as input, and produces a word count for that
- body of text. A reduce step then takes the word counts produced
- from many bodies of text and either sums them to provide a word
- count for the corpus, or filters them to produce a list of
- documents containing only certain counts.
-
-** Riak-specific Map/Reduce
-
-*** How Riak Spreads Processing
-
- Riak's map/reduce has an additional goal: increasing data-locality.
- When processing a large dataset, it's often much more efficient to
- take the computation to the data than it is to bring the data to
- the computation.
-
- It is Riak's solution to the data-locality problem that determines
- how Riak spreads the processing across the cluster. In the same
- way that any Riak node can coordinate a read or write by sending
- requests directly to the other nodes responsible for maintaining
- that data, any Riak node can also coordinate a map/reduce query by
- sending a map-step evaluation request directly to the node
- responsible for maintaining the input data. Map-step results are
- sent back to the coordinating node, where reduce-step processing
- can produce a unified result.
-
- Put more simply: Riak runs map-step functions right on the node
- holding the input data for those functions, and it runs reduce-step
- functions on the node coordinating the map/reduce query.
-
-*** How Riak's Map/Reduce Queries Are Specified
-
- Map/Reduce queries in Riak have two components: a list of inputs
- and a list of "steps", or "phases".
-
- Each element of the input list is a bucket-key pair. This
- bucket-key pair may also be annotated with "key-data", which will
- be passed as an argument to a map function, when evaluated on the
- object stored under that bucket-key pair.
-
- Each element of the phases list is a description of a map
- function, a reduce function, or a link function. The description
- includes where to find the code for the phase function (for map
- and reduce phases), static data passed to the function every time
- it is executed during that phase, and a flag indicating whether or
- not to include the results of that phase in the final output of
- the query.
-
- The phase list describes the chain of operations each input will
- flow through. That is, the initial inputs will be fed to the
- first phase in the list, and the output of that phase will be fed
- as input to the next phase in the list. This stream will continue
- through the final phase.
-
-*** How a Map Phase Works in Riak
-
- The input list to a map phase must be a list of (possibly
- annotated) bucket-key pairs. For each pair, Riak will send the
- request to evaluate the map function to the partition that is
- responsible for storing the data for that bucket-key. The vnode
- hosting that partition will lookup the object stored under that
- bucket-key, and evaluate the map function with the object as an
- argument. The other arguments to the function will be the
- annotation, if any is included, with the bucket-key, and the
- static data for the phase, as specified in the query.
-
-*** How a Reduce Phase Works in Riak
-
- Reduce phases accept any list of data as input, and produce any
- list of data as output. They also receive a phase-static value,
- specified in the query definition.
-
- The important thing to understand is that the function defining
- the reduce phase may be evaluated multiple times, and the input of
- later evaluations will include the input of earlier evaluations.
-
- For example, a reduce phase may implement the "set-union"
- function. In that case, the first set of inputs might be
- =[1,2,2,3]=, and the output would be =[1,2,3]=. When the phase
- receives more inputs, say =[3,4,5]=, the function will be called
- with the concatentation of the two lists: =[1,2,3,3,4,5]=.
-
- Other systems refer to the second application of the reduce
- function as a "re-reduce". There are at least a couple of
- reduce-query implementation strategies that work with Riak's model.
-
- One strategy is to implement the phase preceeding the reduce
- phase, such that its output is "the same shape" as the output of
- the reduce phase. This is how the examples in this document are
- written, and the way that we have found produces cleaner code.
-
- An alternate strategy is to make the output of a reduce phase
- recognizable, such that it can be extracted from the input list on
- subsequent applications. For example, if inputs from the
- preceeding phase are numbers, outputs from the reduce phase could
- be objects or strings. This would allow the function to find the
- previous result, and apply new inputs to it.
-
-*** How a Link Phase Works in Riak
-
- Link phases find links matching patterns specified in the query
- definition. The patterns specify which buckets and tags links
- must have.
-
- "Following a link" means adding it to the output list of this
- phase. The output of this phase is often most useful as input to
- a map phase, or another reduce phase.
-
-*** Using Named Functions
-
- Riak can also use pre-defined named functions for map and reduce
- phase processing. Named functions are invoked with the following
- form:
-
-#+BEGIN_EXAMPLE
-{"map": {"language": "javascript", "name": "Riak.mapValues", "keep": true}}
-
-{"reduce": {"language": "javascript", "name": "Riak.reduceSort", "keep": true}}
-#+END_EXAMPLE
-
- The key =name= in both examples points to the name of the function
- to be used. Riak expects the function to be defined prior to the
- execution of the phase using it.
-
-**** Defining Named Functions
-
- Defining a named function for Riak is a simple process.
-
- 1. Create a Javascript source file containing the definitions for
- all the functions you would like Riak to pre-define.
- 2. Edit the =app.config= of your Riak nodes and add the line
- ={js_source_dir, <path_to_source_dir>}= to the =riak=
- configuration block. =<path_to_source_dir>= should point to
- the directory where the file created in step #1 was saved.
- 3. Start using the functions in your map/reduce jobs.
-
- When =js_source_dir= is enabled, Riak scans the directory for
- files ending in =.js=. These files are then loaded into each
- Javascript VM when it is created.
-
- NOTE: Named functions must be available on all nodes in a cluster
- for proper map/reduce results.
-
-**** Why Use Named Functions?
-
- Named functions can be better than anonymous functions in certain
- situations. Since named functions live in a file they can be
- managed using source code control and deployed automatically
- using tools such as Chef or Puppet. This can be a significant
- advantage when administrating large Riak clusters.
-
- More important, though, is the fact that named functions execute
- much faster than the equivalent anonymous functions. Invoking
- anonymous functions requires Riak to ensure the anonymous
- function is defined before invoking it. Named functions allow
- Riak to skip the definition check and execute the function call
- immediately.
-
- Also, since named functions do not change between invocations,
- Riak is able to cache named function call results and short
- circuit the call entirely. Currently, Riak performs this
- optimization on named functions executed during map phases only.
-
- In general, anonymous functions should be used during development
- and named functions should be used for production deployments
- where possible. This combination provides the maximum flexibility
- and performance.
-
-**** Riak-Supplied Functions
-
- Riak supplies several named functions out of the box. These
- functions are defined on a global Javascript object named =Riak=
- and should not be modified or overridden. These functions, along
- with descriptions and notes on their use are described in the
- next two sections.
-
-***** Named Map Functions
-
- + =Riak.mapValues(values, keyData, arg)=
- Extracts and returns only the values contained in a bucket and key.
-
- + =Riak.mapValuesJson(values, keyData, arg)=
- Same as =mapValues= except the values are passed through a JSON
- decoder first.
-
-***** Named Reduce Functions
-
- + =Riak.reduceSum(values, arg)=
- Returns the sum of =values=
-
- + =Riak.reduceMin(values, arg)=
- Returns the minimum value from =values=
-
- + =Riak.reduceMax(values, arg)=
- Returns the maximum value from =values=
-
- + =Riak.reduceSort(values, arg)=
- Returns the sorted version of =values=. If =arg= is the source
- to a Javascript function, it will be eval'd and used to
- control the sort via =Array.sort=.
-
- + =Riak.reduceLimit(values, arg)=
- Returns the leftmost n members of values where =arg= is used as n.
-
- + =Riak.reduceSlice(values, arg)=
- Returns a slice of the values array. =arg= must be a two
- element array containing the starting and ending positions for
- the slice.
Please sign in to comment.
Something went wrong with that request. Please try again.