Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 537 lines (433 sloc) 24.457 kB
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
1 # Overview
2
2054404 @seancribbs Add Travis CI hook and build status.
seancribbs authored
3 [![Build Status](https://secure.travis-ci.org/basho/merge_index.png?branch=master)](http://travis-ci.org/basho/merge_index)
4
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
5 MergeIndex is an Erlang library for storing ordered sets on
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
6 disk. It is very similar to an SSTable (in Google's
7 Bigtable) or an HFile (in Hadoop).
8
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
9 Basho Technologies developed MergeIndex to serve as the underlying
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
10 index storage format for Riak Search and the upcoming Secondary Index
11 functionality in Riak.
12
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
13 MergeIndex has the following characteristics:
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
14
15 * Fast write performance; handles "spiky" situations with high write
16 loads gracefully.
17 * Fast read performance for single lookups.
18 * Moderately fast read performance for range queries.
19 * "Immutable" data files, so the system can be stopped hard with no ill effects.
20 * Timestamp-based conflict resolution.
21 * Relatively low RAM usage.
22 * Stores data in compressed form.
23
24 And some tradeoffs:
25
26 * Range queries can cause spiky RAM usage, depending on range size.
27 * Needs a high number of file handles. During extremely high write
28 loads for extended periods of time, this can exhaust file handles
29 and ETS tables if the system is not given a chance to recover.
30 * High disk churn during segment compaction.
31
32
33 # Data Model
34
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
35 A MergeIndex database is a three-level hierarchy of data. (The
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
36 chosen terminology reflects merge_index's roots as a storage engine
37 for document data, but it can be considered a general purpose index.)
38
39 The hierarchy is:
40
41 * **Index** - Top level. For example: `shoes`.
42 * **Field** - Second level. For example: `color`.
43 * **Term** - Third level. For example: `red`.
44
45 Underneath each term, you can store one or more values, with
46 associated properties and a timestamp:
47
48 * **Value** - The value to store, usually. In indexing situations, this
49 may be a primary key or document ID. For example: "SKU-52167"
50
51 * **Properties** - Additional metadata associated with the value,
52 returned in lookups and range queries. Setting properties to
53 `undefined` will delete a value from the database.
54
55 * **Timestamp** - A user-defined timestamp value. This is used by the
56 system to resolve conflicts. The largest timestamp wins.
57
58 These six fields together form a **Posting**. For example:
59
60 {Index, Field, Term, Value, Properties, Timestamp}
61 {<<"shoes">>, <<"color">>, <<"red">>, <<"SKU-52167">>, [], 23487197}
62
63 # API
64
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
65 * `merge_index:start_link(DataDir)` - Open a MergeIndex
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
66 database. Note that the database is NOT thread safe, and this is NOT
67 enforced by the software. You should only have one Pid per
68 directory, otherwise your data will be corrupted.
69
70 * `merge_index:index(Pid, Postings)` - Index a list of postings.
71
72 * `merge_index:lookup(Pid, Index, Field, Term)` - Returns an iterator
73 that will yield all of the `{Value, Properties}` records stored
74 under the provided term. The iterator should be consumed quickly, as
75 the results are stored in the Erlang mailbox of the calling Pid. The
76 iterator is a function that when executed returns either `{Result,
77 NewIterator}` or `eof`.
78
79 * `merge_index:lookup_sync(Pid, Index, Field, Term)` - Returns a list of
a59b040 @rzezeski Fix some typos in README
rzezeski authored
80 `{Value, Properties}` records.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
81
82 * `merge_index:range(Pid, Index, Field, StartTerm, EndTerm)` - Returns
83 an iterator that will yield all of the `{Value, Properties}` records
84 matching the provided range, inclusive. The iterator should be
85 consumed quickly, as the results are stored in the Erlang mailbox of
86 the calling Pid. The iterator is a function that when executed
87 returns either `{Result, NewIterator}` or `eof`.
88
89 * `merge_index:range_sync(Pid, Index, Field, StartTerm, EndTerm)` -
90 Returns a list of the `{Value, Properties}` records matching the
167da99 @rzezeski Couple of small fixes to README
rzezeski authored
91 provided range, inclusive.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
92
93 * `merge_index:info(Pid, Index, Field, Term)` - Get an *estimate* of
94 how many results exist for a given term. This is an estimate
95 because, for performance reasons, the system does not factor
a59b040 @rzezeski Fix some typos in README
rzezeski authored
96 tombstones into the result. In addition, due to how bloom filters
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
97 and signatures are used, results may be miscounted. This is mainly
98 meant to be used as a way to optimize query planning, not for
99 reliable counts.
100
101 * `merge_index:compact(Pid)` - Trigger a compaction, if necessary.
102
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
103 * `merge_index:drop(Pid)` - Delete a MergeIndex database. This will
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
104 delete your data.
105
106 * `merge_index:stop(Pid)` - Close a merge_index database.
107
108 # Example Usage
109
110 The example below opens a merge_index database, generates some dummy
111 data using a list comprehension, indexes the dummy data, and then
112 performs a lookup and range query.
113
114 %% Open a merge_index database.
115 application:start(merge_index),
116 {ok, Pid} = merge_index:start_link("./merge_index_data"),
5a909da @rampage Updated API usage.
rampage authored
117 Filter = fun(_,_) -> true end,
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
118
119 %% Index a posting...
5a909da @rampage Updated API usage.
rampage authored
120 merge_index:index(Pid, [{"index", "field", "term", "value1", [], 1}]),
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
121
122 %% Run a query, get results back as a list...
5a909da @rampage Updated API usage.
rampage authored
123 List1 = merge_index:lookup_sync(Pid, "index", "field", "term", Filter),
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
124 io:format("lookup_sync1:~n~p~n", [List1]),
125
126 %% Run a query, get results back as an iterator.
127 %% Iterator returns {Result, NewIterator} or 'eof'.
5a909da @rampage Updated API usage.
rampage authored
128 Iterator1 = merge_index:lookup(Pid, "index", "field", "term", Filter),
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
129 {Result1, Iterator2} = Iterator1(),
130 eof = Iterator2(),
131 io:format("lookup:~n~p~n", [Result1]),
132
133 %% Index multiple postings...
134 merge_index:index(Pid, [
135 {"index", "field", "term", "value1", [], 2},
136 {"index", "field", "term", "value2", [], 2},
137 {"index", "field", "term", "value3", [], 2}
138 ]),
139
140 %% Run another query...
5a909da @rampage Updated API usage.
rampage authored
141 List2 = merge_index:lookup_sync(Pid, "index", "field", "term", Filter),
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
142 io:format("lookup_sync2:~n~p~n", [List2]),
143
144 %% Delete some postings...
145 merge_index:index(Pid, [
146 {"index", "field", "term", "value1", undefined, 3},
147 {"index", "field", "term", "value3", undefined, 3}
148 ]),
149
150 %% Run another query...
5a909da @rampage Updated API usage.
rampage authored
151 List3 = merge_index:lookup_sync(Pid, "index", "field", "term", Filter),
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
152 io:format("lookup_sync2:~n~p~n", [List3]),
153
154 %% Delete the database.
155 merge_index:drop(Pid),
156
157 %% Close the database.
158 merge_index:stop(Pid).
159
160 # Architecture
161
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
162 At a high level, MergeIndex is a collection of one or more
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
163 in-memory **buffers** storing recently written data, plus one or more
164 immutable **segments** storing older data. As data is written, the
5d17007 @rampage Edited README.md via GitHub
rampage authored
165 buffers are converted to segments, and small segments are compacted
a59b040 @rzezeski Fix some typos in README
rzezeski authored
166 together to form larger segments. Each buffer is backed by an
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
167 append-only disk log, ensuring that the buffer state is recoverable if
168 the system is shut down before the buffer is converted to a segment.
169
170 Queries involve all active buffers and segments, but avoid touching
171 disk as much as possible. Queries against buffers execute directly
172 against memory, and as a result are fast. Queries against segments
173 consult an in-memory offsets table with a bloom filter and signature
174 table to determine key existence, and then seek directly to the
175 correct disk position if the key is found within a given segment. If
176 the key is not found, there is no disk penalty.
177
178 ## MI Server (`mi_server` module)
179
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
180 The mi\_server module holds the coordinating logic of MergeIndex. It
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
181 keeps track of which buffers and segments exist, handles incoming
182 writes, manages locks on buffers and segments, and spawns new
183 processes to respond to queries.
184
185 During startup, mi_server performs the following steps:
186
187 * Delete any files that should be deleted, marked by a file of the
188 same name with a ".deleted" extension.
189 * Load the latest buffer (determined by buffer number)
190 as the active buffer.
191 * Convert any other buffers to segments.
192 * Open all segment offset files.
193
194 It then waits for incoming **index**, **lookup**, or **range**
195 requests.
196
197 On an **index** request, `mi_server` is passed a list of postings, of
198 the form `{Index, Field, Term, Value, Props, Timestamp}`. As a speed
199 optimization, we invert the timestamp (multiply by -1). This allows a
200 simple ascending sort to put the latest timestamped value first
201 (otherwise the earliest timestamped value would be first). Later,
a59b040 @rzezeski Fix some typos in README
rzezeski authored
202 iterators across data used during querying and compacting take
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
203 advantage of this information to filter out duplicates. Also, each
204 posting is translated to `{{Index, Field, Term}, Value,
205 InvertedTimestamp, Props}` which is the posting format that the buffer
206 expects. The postings are then written to the buffer. If the index
207 operation causes the buffer to exceed the `buffer_rollover_size`
208 setting, then the buffer is converted to a segment. More details are
209 in the following sections.
210
211 On a **lookup** request, `mi_server` is passed an Index, Field, and
212 Term. It first puts a lock on all buffers and segments that will be
213 used in the query. This ensures that buffers and segments won't be
214 deleted before the query has completed. Next, it spawns a linked
215 process that creates an iterator across each buffer and segment for
216 the provided Index/Field/Term key, and returns the results in
217 ascending sorted order.
218
219 A **range** request is similar to a **lookup** request, except the
220 iterators return the values for a range of keys.
221
222 ## Buffers (`mi_buffer` module)
223
224 A buffer consists of an in-memory Erlang ETS table plus an append only
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
225 log file. All new data written to a MergeIndex database is first
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
226 written to the buffer. Once a buffer reaches a certain size, it is
227 converted to a segment.
228
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
229 MergeIndex opens the ETS table as a `duplicate_bag`, keyed on `{Index, Field,
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
230 Term}`. Postings are written to the buffer in a batch.
231
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
232 At query time, the MergeIndex performs an `ets:lookup/N` to retrieve
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
233 matching postings, sorts them, and wraps them in an iterator.
234
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
235 Range queries work slightly differently. MergeIndex gets a list of
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
236 keys from the table, filters the keys according to what matches the
237 range, and then returns an iterator for each key.
238
239 Buffer contents are also stored on disk in an append-only log file,
240 named `buffer.<NUMBER>`. The format is simple: a 4-byte unsigned
241 integer followed by the `term_to_binary/1` encoded bytes for the list
242 of postings.
243
244 When a buffer exceeds `buffer_rollover_size`, it is converted to a
245 segment. The system puts the contents of the ETS table into a list,
246 sorts the list, constructs an iterator over the list, and then sends
247 the iterator to the same process used to compact segments, described
248 below.
249
250 ## Segments (`mi_segment` module)
251
252 A segment consists of a **data file** and an **offsets table**. It is
253 immutable; once written, it is read only. (Though eventually it may be
254 compacted into a larger segment and deleted.)
255
256 The **data file** is a flat file with the following format: a key, followed
257 by a list of values, followed by another key, followed by another list
258 of values. Both the keys an the values are sorted in ascending
259 order. Conceptually, the data file is split into blocks (approximately
260 32k in size by default). The offsets table contains one entry per block.
261
262 A **key** is an `{Index, Field, Term}` tuple. To save space, if the key
263 has the same `Index` as the previous key, and it is not the first key
264 in a block, then the Index will be omitted from the tuple. Likewise
265 with the `Field`. The key is stored as a single bit set to '1',
266 followed by a 15-bit unsigned integer containing the size of the
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
267 key on disk, followed by the `term_to_binary/N` representation of the
268 key. The maximum on-disk key size is 32k.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
269
270 A **value** is a `{Value, Timestamp, Props}` tuple. It is put in this
a59b040 @rzezeski Fix some typos in README
rzezeski authored
271 order to optimize sorting and comparisons during later
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
272 operations. The list of values is compressed, and then stored as a
273 single bit set to '0', followed by a 31-bit unsigned integer
274 containing the size of the list of values on disk, followed by the
275 `term_to_binary/N` representation of the values. If the list of values
276 is larger than the `segment_values_compression_threshold`, then the
277 values are compressed. If the list of values grows larger than the
278 `segment_values_staging_size`, then it is broken up into multiple
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
279 chunks. The maximum on-disk value size is theoretically 2GB.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
280
281 The **offsets table** is an ETS ordered_set table with an entry for
282 each block in the **data file**. The entry is keyed on the *last* key
283 in the block (which makes lookup using `ets:next/2` possible).
284
285 Each entry is a compressed tuple containing:
286
287 * The offset of the block in the data file. (Variable length integer.)
288 * A bloom filter on the keys contained in the block. (200 bytes)
289 * The longest prefix contained by all keys in the block. For example,
290 the prefix for the keys "business", "bust", "busy" would be
291 "bus". This is currently unused. (Size depends on keys.)
292 * A list of entries for each key containing two signatures, plus some
293 lookup information allowing the system to skip directly to the
294 proper read location during queries. Each entry is approximately 5 bytes,
295 but could be up to 10 bytes for very large values.
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
296 * An edit signature, constructed by comparing the current term
297 against the final term in the block. The edit signature is a
298 bitstring, where the bit is '0' if the bytes match, or '1' if the
299 bytes don't match.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
300 * A hash signature, which is a single byte representation of the
301 bytes of the term xor'd and rotated.
302 * The size of the key on disk.
303 * The size of the values on disk.
304 * The total number of values.
305
306 ## Compaction (`mi_segment_writer` module)
307
308 When the number of segments passes a threshold, the system compacts
309 segments. This merges together the data files from multiple segments
310 to create a new, larger segment, and deletes the old segments. In the
311 process, duplicate or deleted values (determined by a tombstone) are
312 removed. The `mi_scheduler` module ensures that only one compaction
313 occurs at a time on a single Erlang VM, even when multiple
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
314 MergeIndex databases are opened.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
315
316 The advantage of compaction is that it moves the values for a given
317 key closer together on disk, and reduces the number of disk seeks
318 necessary to find the values for a given lookup. The disadvantage of a
319 compaction is that it requires the system to rewrite all of the data
320 involved in the compaction.
321
322 When the system decides to perform a compaction, it focuses on the
323 smallest segments first. This ensures we get the optimal "bang for our
324 buck" out of compaction, doing the most to reduce file handle usage
325 and disk seeks during a query while touching the smallest amount of
326 data.
327
328 To perform the compaction, the `mi_server` module spawns a new linked
329 process. The process opens an iterator across each segment in the
330 compaction set. The data is stored in sorted order by key and value,
331 so the iterator simply needs to walk through the values from the
332 beginning to the end of the file. The
333 `segment_compact_read_ahead_size` setting determines how much of a
334 file cache we use when reading the segment. For small segments, it
335 might make sense to read the entire segment into memory, the
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
336 `segment_full_read_size` setting determines this threshold. In this
337 case, `segment_compact_read_ahead_size` is unused. The individual
338 iterators are grouped into a single master iterator.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
339
340 The `mi_segment_writer` module reads values from the master iterator,
341 writing keys and values to the data file and offset information to the
342 offsets table. While writing, the segment-in-progress is marked with a
343 file of the same name with a ".deleted" extension, ensuring that if
344 the system crashes and restarts, then it will be removed. Once
345 finished, the obsolete segments are marked with files with ".deleted"
346 extensions.
347
348 Note: That this is the same process used when rolling a single buffer
349 into a segment.
350
351 ## Locking (`mi_locks` module)
352
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
353 MergeIndex uses a functional locking structure to manage
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
354 locks on buffers and segments. The locks are really a form of
355 reference counting. During query time, the system opens iterators
356 against all available buffers and segments. This increments a separate
357 lock count for each buffer and segment. When the query ends, the
358 system decrements the lock count. Once a buffer rollover (or segment
359 compaction) makes a buffer (or segment) obsolete, the system registers
360 a function to call when the lock count drops to zero. This is a simple
361 and easy way to make sure that buffers and segments stay around as
362 long as necessary to answer queries, but no longer.
363
364 New queries are directed to the latest buffers and segments, they don't
365 touch obsolete buffers or segments, so even during periods of high
366 query loads, we are guaranteed that the locks will eventually be
367 released and the obsolete buffers or segments deleted.
368
369 # Configuration Settings
370
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
371 ## Overview
372
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
373 MergeIndex exposes a number of dials to tweak operations and RAM
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
374 usage.
375
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
376 The most important MergeIndex setting in terms of memory usage is
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
377 `buffer_rollover_size`. This affects how large the buffer is allowed
378 to grow, in bytes, before getting converted to an on-disk
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
379 segment. The higher this number, the less frequently a MergeIndex
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
380 database will need compactions.
381
382 The second most important settings for memory usage are a combination
383 of `segment_full_read_size` and `max_compact_segments`. During
384 compaction, the system will completely page any segments smaller than
385 the `segment_full_read_size` value into memory. This should generally
386 be as large or larger than the
387 `buffer_rollover_size`.
388
389 `max_compact_segments` is the maximum number of segments to compact at
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
390 one time. The higher this number, the more segments MergeIndex can
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
391 involve in each compaction. In the worst case, a compaction could take
392 (`segment_full_read_size` * `max_compact_segments`) bytes of RAM.
393
394 The rest of the settings have a much smaller impact on performance and
395 memory usage, and exist mainly for tweaking and special cases.
396
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
397 ## Full List of Settings
398
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
399 * `buffer_rollover_size` - The maximum size a buffer can reach before
400 it is converted into a segment. Note that this is measured in terms
401 of ETS table size, not the size the data will take on disk. Because
402 of compaction, the actual segment on disk may be substantially
403 smaller than this number. Default is 1MB. This setting is one of
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
404 the biggest factors in determining how much memory MergeIndex
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
405 will use, and how often compaction is needed. Setting this to a very
406 small number (ie: 100k) will cause buffers to roll out to disk very
407 rapidly, but compaction will trigger more often. Setting this to a
408 very high number (ie: 10MB) will use more RAM, but require fewer
409 compactions.
410 * `buffer_delayed_write_size` - The number of bytes the buffer log can
411 write before being synced to disk. The smaller this number, the
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
412 less chance of data data loss during a hard kill of the system,
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
413 with the tradeoff that touching disk is expensive. This is set to a
414 high number (500k) by default, so the system mainly relies on the
415 `buffer_delayed_write_ms` setting to ensure crash safety.
416 * `buffer_delayed_write_ms` - The number of milliseconds between
417 syncing the buffer log to disk. This is set to 2 seconds by default,
418 meaning that the system will lose at most 2 seconds of data.
419 * `max_compact_segments` - The maximum number of segments to compact
420 at once. Setting this to a low number means more compactions, but
421 compactions will occur more quickly. The default is 20, which is a
422 very high number of segments. In practice, this limit will only be
423 reached during long periods of high write throughput.
424 * `segment_query_read_ahead_size` - The number of bytes to allocate to
425 the read-ahead buffer for reading data files during a query. Default is
426 65k. In practice, since we are reading compressed lists of keys from
427 data files, it is okay to keep this number quite small.
428 * `segment_compact_read_ahead_size` - The number of bytes to allocate
429 to the read-ahead buffer for reading data files during
430 compaction. During compaction, the system is reading from multiple
431 file handles, and writing to a file handle. As a result, the disk
432 head is skipping around quite a bit. Setting this to a large value
433 can reduce the number of file seeks, reducing disk head movement,
434 and speeding up compaction. Default is 5MB.
435 * `segment_file_buffer_size` - The amount of data to accumulate in the
436 `mi_segment_writer` process between writes to disk. This acts as a
437 first level of write buffering. The following two settings combined
438 act as a second level of write buffering. Default is 20MB.
439 * `segment_delayed_write_size` - The number of bytes to allocate to the
440 write buffer for writing a data file during compaction. The system
441 will flush after this many bytes are written. As mentioned
442 previously, the system juggles multiple file handles during a
443 compaction. Setting this to a large value can substantially reduce
444 disk movement. Default is 20MB.
445 * `segment_delayed_write_ms` - The number of milliseconds between
446 syncs while writing a data file during compaction. As mentioned
447 previously, the system juggles multiple file handles during a
448 compaction. Setting this to a long interval can substantially reduce
449 disk movement. Default is 10000. (10 seconds).
450 * `segment_full_read_size` - During compaction, segments below this
451 size are read completely into memory. This can help reduce disk
452 contention during compaction, at the expense of using more
453 RAM. Default is 5MB.
454 * `segment_block_size` - Determines the size of each block in a
455 segment data file. Since there is one offset table entry per block,
456 this indirectly determines how many entries are in the offsets
457 table. Each offsets table has an overhead of ~200 bytes, plus about
458 5 bytes per key, so fewer offset entries means means less RAM
459 usage. The tradeoff is that too few offset entries may cause false
460 positive key lookups during a query, leading to wasted disk
461 seeks. Default is 32k.
462 * `segment_values_staging_size` - The number of values that
463 `mi_segment_writer` should accumulate before writing values to a
464 segment data file. Default is 1000.
e3bee25 @rustyio Minor README.md edits and clarifications.
rustyio authored
465 * `segment_values_compression_threshold` - Determines the point at
466 which the segment writer begins compressing the staging list. If the
467 list of values to write is greater than the threshold, then the data
468 is compressed before storing to disk. Compression is a tradeoff
469 between time/CPU usage and space , so raising this value can reduce
470 load on the CPU at the cost of writing more data to disk. Default is
471 0.
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
472 * `segment_values_compression_level` - Determines the compression
473 level to use when compressing data. Default is 1. Valid values are 1
474 through 9.
475
476 A number of configuration settings are fuzzed:
477
478 * `buffer_rollover_size` by 25%
479 * `buffer_delayed_write_size` by 10%
480 * `buffer_delayed_write_ms` by 10%
481
482 "Fuzzed" means that the actual value is increased or decreased by a
cdd0343 @rustyio Minor formatting changes for easier reading.
rustyio authored
483 certain random percent. If you open multiple MergeIndex databases
d325072 @rustyio Update README file, added sections for Data Model, API, Code Examples,
rustyio authored
484 and write to them with an evenly balanced load, then all of the
485 buffers tend to roll over at the same time. Fuzzing spaces out the
486 rollovers.
487
488 # Troubleshooting
489
490 ## Determine the Number of Open Buffers/Segments
491
492 Run the following command to check how many buffers are currently
493 open:
494
495 find <PATH> -name "buffer.*" | wc -l
496
497 Run the following command to check how many segments are currently
498 open:
499
500 find <PATH> -name "segment.*.data" | wc -l
501
502 Run the following command to determine whether a compaction is
503 currently in progress:
504
505 find <PATH> -name "segment.*.data.deleted"
506
507 ## Check Memory Usage
508
509 Run the following code in the Erlang shell to see how much space the
510 in-memory buffers are consuming:
511
512 WordSize = erlang:system_info(wordsize),
513 F = fun(X, Acc) ->
514 case ets:info(X, name) == 'buffer' of
515 true -> Acc + (ets:info(X, memory) * WordSize);
516 false -> Acc
517 end
518 end,
519 lists:foldl(F, 0, ets:all()).
520
521 Run the following code in the Erlang shell to see how much space the
522 segment offset tables are consuming:
523
524 WordSize = erlang:system_info(wordsize),
525 F = fun(X, Acc) ->
526 case ets:info(X, name) == 'segment_offsets' of
527 true -> Acc + (ets:info(X, memory) * WordSize);
528 false -> Acc
529 end
530 end,
531 lists:foldl(F, 0, ets:all()).
532
533 # Further Reading
534 + [Google's Bigtable Paper - PDF](http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/bigtable-osdi06.pdf)
535 + [HFile Description - cloudpr.blogspot.com](http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html)
536 + [HFile Paper - slideshare.net](http://www.slideshare.net/schubertzhang/hfile-a-blockindexed-file-format-to-store-sorted-keyvalue-pairs)
Something went wrong with that request. Please try again.