Skip to content

Latest commit

 

History

History
1168 lines (809 loc) · 39.6 KB

using_search.org

File metadata and controls

1168 lines (809 loc) · 39.6 KB

Riak Search 0.13 Manual

Introduction

Riak Search is a distributed, easily-scalable, failure-tolerant, real-time, full-text search engine built around Riak Core and tightly integrated with Riak KV.

Riak Search allows you to find and retrieve your Riak objects using the objects’ values. When a Riak KV bucket has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search.

The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak map/reduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

Operations

Operationally, Riak Search is very similar to Riak KV. An administrator can add nodes to a cluster on the fly with simple commands to increase performance or capacity. Index and query operations can be run from any node. Multiple replicas of data are stored, allowing the cluster to continue serving full results in the face of machine failure. Partitions are handed off and replicated across clusters using the same mechanisms as Riak KV.

Indexing

At index time, Riak Search tokenizes a document into an inverted index using standard Lucene Analyzers. (For improved performance, the team re-implemented some of these in Erlang to reduce hops between Erlang and Java.) Custom analyzers can be created in either Java or Erlang. The system consults a schema (defined per-index) to determine required fields, the unique key, the default analyzer, and which analyzer should be used for each field. Field aliases (grouping multiple fields into one field) and dynamic fields (wildcard field matching) are supported.

After analyzing a document into an inverted index, the system uses a consistent hash to divide the inverted index entries (called postings) by term across the cluster. This is called term-partitioning and is a key difference from other commonly used distributed indexes. Term-partitioning was chosen because it provides higher overall query throughput with large data sets. (This can come at the expense of higher-latency queries for especially large result sets.)

Querying

Search queries use the same syntax as Lucene, and support most Lucene operators including term searches, field searches, boolean operators, grouping, lexicographical range queries, and wildcards (at the end of a word only).

Querying has two distinct stages, planning and execution. During query planning, the system creates a directed graph of the query, grouping points on the graph in order to maximize data locality and minimize inter-node traffic. Single term queries can be executed on a single node, while range queries and fuzzy matches are executed using the minimal set of nodes that cover the query.

As the query executes, Riak Search uses a series of merge-joins, merge-intersections, and filters to generate the resulting set of matching bucket/key pairs.

Persistence

For a backing store, the Riak Search team developed merge\_index. merge\_index takes inspiration from the Lucene file format, Bitcask (our standard backing store for Riak KV), and SSTables (from Google’s BigTable paper), and was designed to have a simple, easily-recoverable data structure, to allow simultaneous reads and writes with no performance degredation, and to be forgiving of write bursts while taking advantage of low-write periods to perform data compactions and optimizations.

Riak Search is “Beta Software”

Note that Riak Search should be considered beta software. Please be aware that there may be bugs and issues that we have not yet covered that may require a full data reload with the next version.

Installing Riak Search

Requirements

The following components are required:

  • Erlang R13B04
  • Java 1.6.x
  • Ant
  • gcc toolchain

Installation

Unzip and install Riak Search

Enter the riak\_search directory and run:

make
make rel

Change to the rel/riaksearch directory

cd rel/riaksearch

Start Riak Search

If you are running on a Mac, set JAVA\_HOME and increase your filehandle limit:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
ulimit -n 1024

From within the rel/riaksearch directory, start Riak:

bin/riaksearch console

Or, use bin/riaksearch start to start riak\_search in the background.

Major Components

Riak Search is comprised of:

  • Riak Core - Dynamo-inspired distributed-systems framework
  • Riak KV - Distributed Key/Value store inspired by Amazon’s Dynamo.
  • Bitcask - Storage backend used by Riak KV.
  • Riak Search - Distributed index and full-text search engine.
  • MergeIndex - Storage backend used by Riak Search.
  • Qilr - Library for parsing queries into execution plans and documents into terms.
  • Riak Solr - Adds a subset of Solr HTTP interface capabilities to Riak Search.

Replication

Search data is replicated in a manner similar to Riak KV data: A search index has an n_val setting that determines how many copies of the data exist. Copies are written across different partitions located on different physical nodes.

In contrast to Riak KV:

  • Search uses timestamps, rather than vector clocks, to resolve version conflicts. This leads to fewer guarantees about your data (as depending on wall-clock time can cause problems if the clock is wrong) but was a necessary tradeoff for performance reasons.
  • Search does not use quorum values when writing (indexing) data. The data is written in a fire and forget model. Search does use hinted-handoff to remain write-available when a node goes offline.
  • Search does not use quorum values when reading (querying) data. Only one copy of the data is read, and the partition is chosen based on what will create the most efficient query plan overall.

Schema

Riak Search was designed to work seamlessly with Riak. As a result, it retains many of the same properties as Riak, including a schema-free design. In other words, you can start adding data to a new index without having to explicitly define the index fields.

That said, Search does provide the ability to define a custom schema. This allows you to specify required fields and custom analyzer factories, among other things.

The Default Schema

The default schema treats all fields as strings, unless you suffix your field name as follows:

  • FIELDNAME\_num - Numeric field. Uses Whitespace analyzer. Values are padded to 10 characters.
  • FIELDNAME\_dt - Date field. Uses Whitespace analyzer.
  • All other fields are treated as Strings and use the Standard analyzer.

The default field is named value.

Defining a Schema

The schema definition for an index is stored in the Riak bucket of the same name as the index, under the key _rs_schema. For example, the schema for the “books” index is stored under books/_rs_schema.

Alternatively, you can set or retrieve the schema for an index using command line tools:

# Set an index schema.
bin/search-cmd set_schema Index SchemaFile

:

# View the schema for an Index.
bin/search-cmd show_schema Index

Note that changes to the Schema File will not affect previously indexed data. It is recommended that if you change field definitions, especially settings such as type, that you re-index your documents.

Below is an example schema file. The schema is formatted as an Erlang term. Spacing does not matter, but it is important to matching opening and closing brackets and braces, to include commas between all list items, and to include the final period after the last brace:

{
    schema,
    [
        {version, "1.1"},
        {default_field, "value"},
        {default_op, "or"},
        {n_val, 3},
        {analyzer_factory, "com.basho.search.analysis.DefaultAnalyzerFactory"}
    ],
    [
        {field, [
            {name, "id"},
            {type, string}
        ]},
        {field, [
            {name, "title"},
            {required, true},
            {type, string}
        ]},
        {field, [
            {name, "published"},
            {type, date}
        ]},
        {dynamic_field, [
            {name, "*_text"},
            {type, string}
        ]},
        {field, [
            {name, "tags"},
            {type, string},
            {analyzer_factory, "com.basho.search.analysis.WhitespaceAnalyzerFactory"}
        ]},
        {field, [
            {name, "count"},
            {type, integer},
            {padding_size, 10}
        ]},
        {field, [
            {name, "category"}
        ]}
    ]
}.

Schema-level properties:

The following properties are defined at a schema level:

  • version - Required. A version number, currently unused.
  • default\_field - Required. Specify the default field used for searching.
  • default\_op - Optional. Set to “and” or “or” to define the default boolean. Defaults to “or”.
  • v\_val - Optional. Set the number of replicas of search data. Defaults to 3.
  • analyzer\_factory - Optional. Defaults to “com.basho.search.analysis.DefaultAnalyzerFactory”.

Fields and Field-Level Properties:

Fields can either by static or dynamic. A static field is denoted with ‘field’ at the start of the field definition, whereas a dynamic field is denoted with ‘dynamic\_field’ at the start of the field definition.

The difference is that a static field will perform an exact string match on a field name, and a dynamic field will perform a wildcard match on the string name. The wildcard can appear anywhere within the field, but it usually occurs at the beginning or end. (The default schema, described above, uses dynamic fields, allowing you to use fieldname suffixes to create fields of different data types.)

Field matching occurs in the order of appearance in the schema definition. This allows you to create a number of static fields followed by a dynamic field as a “catch all” to match the rest.

The following properties are defined at a field level, and apply to both static and dynamic fields:

  • name - Required. The name of the field. Dynamic fields can use wildcards. Note that the unique field identifying a document must be named “id”.
  • required - Optional. Boolean flag indicating whether this field is required in an incoming search document. If missing, then the document will fail validation. Defaults to false.
  • type - Optional. The type of field, either ‘string’ or ‘integer’. If ‘integer’ is specified, and no field-level analyzer\_factory is defined, then the field will use the Whitespace analyzer. Defaults to ‘string’.
  • skip - Optional. When ‘true’, the field is stored, but not indexed. Defaults to ‘false’.
  • aliases - Optional. A list of aliases that should be mapped to the current field definition, effectively indexing multiple fields of different names into the same field. Defaults to an empty list.
  • analyzer\_factory - Optional. Specify the analyzer factory to use when parsing the field. If not specified, defaults to the analyzer factory for the schema. (Unless the field is an integer type. See above.)
  • padding\_size - Optional. Values are padded up to this size. Defaults to 0 for string types, 10 for integer types.

Indexing

Indexing a document is the act of:

  1. Reading a document.
  2. Splitting the document into one or more fields.
  3. Splitting the fields into one or more terms.
  4. Normalizing the terms in each field.
  5. Writing the {Field, Term, DocumentID} postings to an index.

There are numerous ways to index a document in Riak Search.

Indexing via the Command Line

The easiest way to index documents stored on the filesystem is to use the search-cmd command line tool:

bin/search-cmd index <INDEX> <PATH>

Parameters:

  • <INDEX> - The name of an index.
  • <PATH> - Relative or absolute path to the files or directories to recursively index. Wildcards are permitted.

For example:

bin/search-cmd index my_index files/to/index/*.txt

The documents will be indexed into the default field defined by the Index’s schema, using the base filename plus extension as the document ID.

Deleting via the Command Line

To remove previously indexed files from the command line, use the search-cmd command line tool:

bin/search-cmd delete <INDEX> <PATH>

Parameters:

  • <INDEX> - The name of an index.
  • <PATH> - Relative or absolute path to the files or directories to recursively delete. Wildcards are permitted.

For example:

bin/search-cmd delete my_index files/to/index/*.txt

Any documents matching the base filename plus extension of the files found will be removed from the index. The actual contents of the files are ignored during this operation.

Indexing via the Erlang API

The following Erlang functions will index documents stored on the filesystem:

search:index_dir(Path).
search:index_dir(Index, Path).

Parameters:

  • Index - The name of the index. Defaults to search.
  • Path - Relative or absolute path to the files or directories to recursively index. Wildcards are permitted.

For example:

search:index_dir("my_index", "files/to/index/*.txt").

The documents will be indexed into the default field defined by the Index’s schema, using the base filename plus extension as the document ID.

Alternatively, you can provide the fields of the document to index:

search:index_doc(Index, DocId, Fields)

Parameters:

  • Index - The name of the index.
  • DocId - Document Id
  • Fields - A Key/Value list of fields to index. One of these fields must be either the atom ‘id’ or the string “id”.

For example:

search:index_doc(Index, DocId, [{title, "The Title"}, {content, "The Content"}])

Deleting via the Erlang API

The following Erlang functions will remove documents from the index:

search:delete_dir(Path).
search:delete_dir(Index, Path).

Parameters:

  • Index - The name of the index.
  • Path - Relative or absolute path to the files or directories to recursively delete. Wildcards are permitted.

For example:

search:delete_dir("my_index", "files/to/index/*.txt").

Any documents matching the base filename plus extension of the files found will be removed from the index. The actual contents of the files are ignored during this operation.

Alternatively, you can delete a document by it’s id:

search:delete_doc(Index, DocID)

Parameters:

  • Index - The name of the index.
  • DocID - The document ID of the document to delete.

Indexing via the Solr Interface

Riak Search supports a Solr-compatible interface for indexing documents via HTTP. Documents must be formatted as simple Solr XML documents, for example:

<add>
  <doc>
    <field name="id">DocID</field>
    <field name="title">Zen and the Art of Motorcycle Maintenance</field>
    <field name="author">Robert Pirsig</field>
    ...
  </doc>
  ...
</add>

Additionally, the Content-Type header must be set to ‘text/xml’.

Search currently requires that the field determining the document ID be named id, and does not support any additional attributes on the add, doc, or field elements. (In other words, things like overwrite, commitWithin, and boost are not yet supported.)

The Solr interface does NOT support the <commit /> nor <optimize /> commands. All data is committed automatically in the following stages:

  • Incoming Solr XML document is parsed. If XML is invalid, an error is returned.
  • Documents fields are analyzed and broken into terms. If there are any problems, an error is returned.
  • Documents terms are indexed in parallel. Their availability in future queries is determined by the storage backend.

By default, the update endpoint is located at “http://hostname:8098/solr/update?index=INDEX”.

Alternatively, the index can be included in the URL, for example “http://hostname:8098/solr/INDEX/update”.

To add data to the system with Curl:

curl -X POST -H text/xml --data-binary @tests/books.xml http://localhost:8098/solr/books/update

Alternatively, you can index Solr files on the command line:

bin/search-cmd solr my_index path/to/solrfile.xml

Deleting via the Solr Interface

Documents can also be deleted through the Solr interface via two methods, either by Document ID or by Query.

To delete documents by document ID, post the following XML to the update endpoint:

<delete>
  <id>docid1</id>
  <id>docid2</id>
  ...
</delete>

To delete documents by Query, post the following XML to the update endpoint:

<delete>
  <query>QUERY1</query>
  <query>QUERY2</query>
  ...
</delete>

Any documents that match the provided queries will be deleted.

Querying

Query Syntax

Riak Search follows the same query syntax as Lucene, detailed here:

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Terms and Phrases

A query can be as simple as a single term (ie: “red”) or a series of terms surrounded by quotes called a phrase (“See spot run”). The term (or phrase) is analyzed using the default analyzer for the index.

The index schema contains a default_operator setting that determines whether a phrase is treated as an AND operation or an OR operation. By default, a phrase is treated as an OR operation. In other words, a document is returned if it matches any one of the terms in the phrase.

Fields

You can specify a field to search by putting it in front of the term or phrase to search. For example:

color:red

Or:

title:"See spot run"

You can further specify an index by prefixing the field with the index name. For example:

products.color:red

Or:

books.title:"See spot run"

Wildcard Searches

Terms can include wildcards in the form of an asterisk (*) to allow prefix matching, or a question mark (?) to match a single character.

Currently, the wildcard must come at the end of the term in both cases.

For example:

  • “bus*” will match “busy”, “business”, “busted”, etc.
  • “bus?” will match “busy”, “bust”, “busk”, etc.

Fuzzy Searches

NOTE: Fuzzy Searches are not yet supported.

Fuzzy searching allows you to find terms with similar spelling. To specify a fuzzy search, use the tilde operator on a single term with an optional fuzziness argument. (If no fuzziness argument is specified, then 0.5 is used by default.)

For example:

bass~

Is equivalent to:

bass~0.5

And will match “bass” as well as “bask”, “bats”, “bars”, etc. The fuzziness argument is a number between 0.0 and 1.0. Values close to 0.0 result in more fuzziness, values close to 1.0 result in less fuzziness.

Proximity Searches

Proximity searching allows you to find terms that are within a certain number of words from each other. To specify a proximity seach, use the tilde argument on a phrase.

For example:

"See spot run"~20

Will find documents that have the words “see”, “spot”, and “run” all within the same block of 20 words.

Range Searches

Range searches allow you to find documents with terms in between a specific range. Ranges are calculated lexicographically. Use square brackets to specify an inclusive range, and curly braces to specify an exclusive range.

The following example will return documents with words containing “red” and “rum”, plus any words in between.

"field:[red TO rum]"

The following example will return documents with words in between “red” and “rum”:

"field:{red TO rum}"

Boosting a Term

A term (or phrase) can have its score boosted using the caret operator along with an integer boost factor.

In the following example, documents with the term “red” will have their score boosted:

red^5 OR blue

Boolean Operators - AND, OR, NOT

Queries can use the boolean operators AND, OR, and NOT. The boolean operators must be capitalized.

The following example return documents containing the words “red” and “blue” but not “yellow”.

red AND blue AND NOT yellow

The required (+) operator can be used in place of “AND”, and the prohibited (-) operator can be used in place of “AND NOT”. For example, the query above can be rewritten as:

+red +blue -yellow

Grouping

Clauses in a query can be grouped using parentheses. The following query returns documents that contain the terms “red” or “blue”, but not “yellow”:

(red OR blue) AND NOT yellow

Querying via the Search Shell

The Search Shell is the easiest way to run interactive queries against Search. To start the shell, run:

bin/search-cmd shell [INDEX]

This launches an interactive console into which you can type search commands. For help, type h().

Querying via the Command Line

To run a single query from the command line, use:

bin/search-cmd search [INDEX] QUERY

For example:

bin/search-cmd search books "title:\"See spot run\""

This will display a list of Document ID values matching the query. To conduct a document search, use the search\_doc command. For example:

bin/search-cmd search_doc books "title:\"See spot run\""

Querying via the Erlang Command Line

To run a query from the Erlang shell, use search:search(Query) or search:search(Index, Query). For example:

search:search("books", "author:joyce").

This will display a list of Document ID values matching the query. To conduct a document search, use search:search_doc(Query) or search:search_doc(Index, Query). For example:

search:search_doc("books", "author:joyce").

Querying via the Solr Interface

Riak Search supports a Solr-compatible interface for searching documents via HTTP. By default, the select endpoint is located at “http://hostname:8098/solr/select”.

Alternatively, the index can be included in the URL, for example “http://hostname:8098/solr/INDEX/select”.

The following parameters are supported:

  • index=INDEX - Specifies the default index name.
  • q=QUERY - Run the provided query.
  • df=FIELDNAME - Use the provided field as the default. Overrides the default_field setting in the schema file.
  • q.op=OPERATION - Allowed settings are either “and” or “or”. Overrides the default_op setting in the schema file. Default is “or”.
  • start=N - Specify the starting result of the query. Useful for paging. Default is 0.
  • rows=N - Specify the maximum number of results to return. Default is 10.
  • sort=FIELDNAME - Sort on the specified field name. Default is “none”, which causes the results to be sorted in descending order by score.
  • wt=FORMAT - Choose the format of the output. Options are “xml” and “json”. The default is “xml”.

To query data in the system with Curl:

curl "http://localhost:8098/solr/books/select?start=0&rows=10000&q=prog*"

Querying via the Riak Client API

Basic Querying

The Riak Client API’s have been updated to support querying of Riak Search. See the client documentation for more information. Currently, the Ruby, Python, PHP, and Erlang clients are supported.

The API takes a default search index as well as as search query, and returns a list of bucket/key pairs. Some clients transform this list into objects specific to that client.

Querying Integrated with Map/Reduce

The Riak Client APIs that integrate with Riak Search also support using a search query to generate inputs for a map/reduce operation. This allows you to perform powerful analysis and computation across your data based on a search query. See the client documentation for more information. Currently, the Ruby, Python, PHP, and Erlang clients are supported.

Querying via HTTP/Curl

Developers who are using a language without an official Riak API or prefer to use the pure HTTP API can still execute a search-based map/reduce operation.

The syntax is fairly simple. In the “inputs” section of your map/reduce query, use the new “modfun” specification, naming “riak\_search” as your module, “mapred\_search” as your function, and your index and query as the arguments.

For example, if you wanted to search the “article” bucket for objects that had the word “seven” in their “text” field, you would normally issue a Solr query like:

$ curl http://localhost:8098/solr/article/select?q=text:seven

Kicking off a map/reduce query with the same result set over HTTP would use a POST body like this:

{
 "inputs": {
            "module":"riak_search",
            "function":"mapred_search",
            "arg":["article","text:seven"]
           },
 "query":...
}

The phases in the “query” field should be exactly the same as usual. An initial map phase will be given each object matching the search for processing, but an initial link phase or reduce phase will also work.

The “arg” field of the inputs specification is always a two-element list. The first element is the name of the bucket you wish to search, and the second element is the query to search for. All syntax available in other Search interfaces is available in this query parameter.

Faceted Queries via the Solr Interfae

NOTE: Facet querying through the Solr interface is not yet supported.

Faceted search allows you to generate keywords (plus counts) to display to a user to drill down into search results.

Search accepts the following faceting parameters on the Solr interface:

  • facet=BOOLEAN - If BOOLEAN is set to “true” enable faceting. If set to “false”, disable faceting. Default is “false”.
  • facet.field=FIELDNAME - Tells Search to calculate and return count associated with unique terms in this fieldname. To specify multiple facet fields, include the facet.field setting multiple times in the query parameters.
  • facet.prefix=PREFIX / f.FIELD.facet.prefix=PREFIX - Limit faceting to a subset of terms on a field.
  • facet.sort=MODE / f.FIELD.facet.sort=MODE- If MODE is set to “count”, sort the facets counts by count. If set to “index”, sort the facet counts lexicographically. Defaults to “count”.
  • facet.offset=N / f.FIELDNAME.facet.offset=N - Set the offset at which to start listing facet entries. Used for paging.
  • facet.limit=N / f.FIELDNAME.facet.limit=N - Limit the number of facet entries to N. Used for paging.

Use the longer syntax if multiple fields are defined.

Note that when faceting on a field, only terms that are present in the result set are listed in the facet results (in other words, you will never see a facet count entry of zero.) Faceted fields are analyzed using the analyzer associated with the field.

Query Scoring

Documents are scored using roughly the same formulas described here:

http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Similarity.html

The key difference is in how Riak Search calculates the Inverse Document Frequency. The equations described on the Similarity page require knowledge of the total number of documents in a collection. Riak Search does not maintain this information for a collection, so instead uses the count of the total number of documents associated with each term in the query.

Indexing and Querying Riak KV Data

Riak Search now supports indexing and querying of data stored in Riak KV. Out of the box, simple indexing of plain text, XML, and JSON data can be enabled in an instant.

Setting up Indexing

Riak Search indexing of KV data must be enabled on a per-KV-bucket basis. To enable indexing for a bucket, simply add the Search precommit hook to that bucket’s properties.

Adding the Search precommit hook to a bucket from the command line is easy:

$ bin/search-cmd install my_bucket_name

Any other method you would normally use to set bucket properties can also be used to enable the Search precommit hook as well. For example, using curl to install over HTTP:

$ curl -X PUT -H "content-type:application/json" http://localhost:8098/riak/demo2 --data @-
{"props":{"precommit":[{"mod":"riak_search_kv_hook","fun":"precommit"}]}}
^D

Note, though, that you may want to read the bucket properties first, so you don’t clobber any precommit hook already in place.

With the precommit hook installed, Riak Search will index your data each time that data is written.

Datatypes

Riak Search is able to handle several standard data encodings with zero configuration. Simply set the Content-Type metadata on your objects to the appropriate mime-type. Out of the box, XML, JSON, and plain-text encodings are supported.

JSON Encoded Data

If your data is in JSON format, set your Content-Type to “application/json”, “application/x-javascript”, “text/javascript”, “text/x-javascript”, or “text/x-json”.

Specifying that your data is in JSON format will cause Riak Search to use the field names of the JSON object as index field names. Nested objects will use underscore (‘\_’) as a field name separator.

For example, storing the following JSON object in a Search-enabled bucket:

{
 "name":"Alyssa P. Hacker"
 "bio":"I'm an engineer, making awesome things."
 "favorites":{
              "book":"The Moon is a Harsh Mistress",
              "album":"Magical Mystery Tour",
             }
}

Would cause four fields to be indexed: “name”, “bio”, “favorites\_book”, and “favorites\_album”. You could later query this data with queries like, “bio:enginer AND favorites\_album:mystery”.

XML Encoded Data

If your data is in XML format, set your Content-Type to “application/xml” or “text/xml”.

Specifying that your data is in plain-text format will cause Riak Search to use tag names as index field names. Nested tags separate their names with underscores. Attributes are stored in their own fields, the names of which are created by appending an at symbol (‘@’) and the attribute name to the tag name.

For example, storing the following XML object in a Search-enabled bucket:

<?xml version="1.0"?>
<person>
   <name>Alyssa P. Hacker</name>
   <bio>I'm an engineer, making awesome things.</bio>
   <favorites>
      <item type="book">The Moon is a Harsh Mistress</item>
      <item type="album">Magical Mystery Tour</item>
   </favorites>
</person>

Would cause four fields to be indexed: “person\_name”, “person\_bio”, “person\_favorites\_item”, and “person\_favorite\_item@type”. The values of the “…\_item” and “…\_item@type” fields will be the concatenation of the two distinct elements in the object (“The Moon is a Harsh Mistress Magical Mystery Tour” and “book album”, respectively). You could later query this data with queries like, “person\_bio:enginer AND person\_favorites\_item:mystery”.

Plain-text Data

If your data is plain text, set your Content-Type to “text/plain”. The plain-text decoder is also used if no Content-Type is found.

Specifying that your data is in plain-text format will cause Riak Search to index all of the text in the object’s value under a single field, named “value”. Queries can be explicit about searching this field, as in “value:seven AND value:score”, or omit the default field name, as in “seven AND score”.

Other Data Encodings

If your data is not in JSON, XML, or plain-text, or you would like field name or value extraction to behave differently, you may also write your own extractor. To tell Riak Search where to find your customer extractor for your bucket, set the bucket property ‘rs\_extractfun’ to one of the following values:

  • {modfun, Module, Function}, where Module and Function name an Erlang module and function to call for extraction (more on that API later).
  • {qfun, Fun}, where Fun is a function to call for extraction.
  • {jsanon, Source}, where Source is the source of a Javascript function to call for extraction.
  • {jsanon, {Bucket, Key}}, where Bucket and Key are binaries naming an object stored in Riak, whose value is the source of a Javascript function to call for extraction
  • {jsfun, Name}, where Name is the name of a pre-defined Javascript function, stored in a file in your js\_path, to call for extraction.
  • {FunTerm, Arg}, where FunTerm is any of the above constructions, and Arg is a static argument to pass the function.
  • {struct, JsonProplist}, where JsonProplist is a list of 2-tuples, as mochijson2 would produce from decoding a JSON object (to allow setting this field over the HTTP interface). Fields that must be included are:
    • “language”, either “erlang” or “javascript”
    • If language is Erlang, the following must be included:
      • “module”, the name of the module to use for extraction
      • “function”, the name of the function in the given module
    • If language is Javascript, one of the following must be included:
      • “source”, the source of the function to run (as with the ‘jsanon’ construction above)
      • “name”, the name of a predefined function to run (as with the ‘jsfun’ construction above)
      • “bucket” and “key”, a reference to a Riak object storing the source of the function to run (as with the alternate ‘jsanon’ construction above)
    • “arg”, optional, a static argument to pass the function

    Any of the extractor forms must name a function that takes two arguments. The first argument will be a riak\_object to index, and the second argument will be the static arguments given (or ‘undefined’ if none are given).

    The extractor function is expected to produce a list of field-name:value pairs. Erlang functions should do this with a list of 2-tuples, with each element being a binary, as in:

    [
     {<<"field1">>,<<"value1">>},
     {<<"field2">>,<<"value2">>}
    ]
        

    Javascript functions should return their results with an object, as in:

    {
     "field1":"value1",
     "field2":"value2"
    }
        

    The modules riak\_search\_kv\_json\_extractor, riak\_search\_kv\_xml\_extractor, and riak\_search\_kv\_raw\_extractor should be referred to for examples.

Field Types

If you read the “Other Data Encodings” section about writing your own encoder, you may have been surprised to find that all fields should be extracted as strings. The reason for this is that it’s the schema’s job to say what the types of the fields are.

If you do not specify a schema, the default will be used. The default schema indexes all fields as string values, unless they end in “\_num” or “\_dt”, where they will be indexed as integers or dates, respectively.

You may define your own schema for your KV indexes, in the same manner as you would define a schema for non-KV indexes. Just make sure the field names match those produced by the extractor in use.

Operations and Troubleshooting

Riak Search has all of the same operational properties as Riak. Refer to the Riak wiki (see below) for more information on running Riak in a clustered environment.

https://wiki.basho.com/display/RIAK/Home

Default Ports

By default, Search uses the following ports:

  • 8098 - Solr Interface
  • 8099 - Riak Handoff
  • 8087 - Protocol Buffers interface
  • 6095 - Analyzer Port

Be sure to take the necessary security precautions to prevent exposing these ports to the outside world.

Merge Index Settings

These settings can be found in the Riak Search app.config file under the “merge_index” section.

data\_root
Set to the location where data files are written, relative to the Riak Search root directory.
buffer\_rollover\_size
Maximum size of the in-memory buffer before it is transformed to a segment and written to disk. Higher numbers will result in faster indexing but more memory usage.
buffer\_delayed\_write\_size
Bytes to accumulate in the write-ahead log before flushing to disk.
buffer\_delayed\_write\_ms
Interval to flush write-ahead log to disk.
max\_compact\_segments
The maximum number of segments to compact during a compaction. Smaller values will to quicker compactions and a more balanced number of files in each partition, at the expense of more frequent compactions, and a higher likelihood of compacting the same data multiple times.
segment\_query\_read\_ahead\_size
Size of the file read-ahead buffer, in bytes, to use when looking up results in a query.
segment\_compact\_read\_ahead\_size
Size of the file read-ahead buffer, in bytes, to use when reading a segment for compaction.
segment\_file\_buffer\_size
Amount of segment compaction data to batch, in bytes, before writing to the file handle. This should be less than or equal to segment\_delayed\_write\_size, otherwise that setting will have no effect.
segment\_delayed\_write\_size
Size of the delayed write buffer in bytes. Once this is exceeded, the compaction buffer is flushed to disk.
segment\_delayed\_write\_ms
Interval at which data will be written to a file during compaction.
segment\_full\_read\_size
Segment files below this size will be read into memory during a compaction for higher performance at the cost of more RAM usage. This setting plus max\_compact\_segments directly affects the maximum amount of RAM that a compaction can take.
segment\_block\_size
Determines the block size across which a segment will calculate offsets and lookup information. Setting to a lower value will increase query performance, but will also lead to more RAM and disk usage.
segment\_values\_staging\_size
Maximum number of values to hold in memory before compressing the batch and adding to the output buffer.
segment\_values\_compression\_threshold
Since compression is more effective with a larger number of values, this is the number of values that must be present in a batch before the system compresses the batch.
segment\_values\_compression\_level
zlib compression level to use when compressing a batch of values.