Skip to content
Newer
Older
100644 522 lines (312 sloc) 16.4 KB
c763882 @floere + moved the docs into the Picky main repo
authored
1 ## Indexes{#indexes}
2
3 Indexes do three things:
4
5 * Define where the data comes from.
6 * Define how data is handled before it enters the index.
7 * Hold index categories.
8
9 ### Types{#indexes-types}
10
11 Picky offers a choice of four index types:
12
13 * Memory: Saves its indexes in JSON on disk and loads them into memory.
14 * Redis: Saves its indexes in Redis.
15 * SQLite: Saves its indexes in rows of a SQLite DB.
16 * File: Saves its indexes in JSON in files.
17
18 This is how they look in code:
19
20 books_memory_index = Index.new :books do
21 # Configuration goes here.
22 end
23
24 books_redis_index = Index.new :books do
25 backend Backends::Redis.new
26 # Configuration goes here.
27 end
28
29 Both save the preprocessed data from the data source in the `/index` directory so you can go look if the data is preprocessed correctly.
30
31 Indexes are then used in a `Search` interface.
32
33 Searching over one index:
34
35 books = Search.new books_index
36
37 Searching over multiple indexes:
38
39 media = Search.new books_index, dvd_index, mp3_index
40
41 The resulting ids should be from the same id space to be useful – or the ids should be exclusive, such that eg. a book id does not collide with a dvd id.
42
43 #### In-Memory / File-based{#indexes-types-memory}
44
45 The in-memory index saves its indexes as files transparently in the form of JSON files that reside in the `/index` directory.
46
47 When the server is started, they are loaded into memory. As soon as the server is stopped, the indexes are not in memory again.
48
49 Indexing regenerates the JSON index files and can be reloaded into memory, even in the running server (see below).
50
51 #### Redis{#indexes-types-redis}
52
53 The Redis index saves its indexes in the Redis server on the default port, using database 15.
54
55 When the server is started, it connects to the Redis server and uses the indexes in the key-value store.
56
57 Indexing regenerates the indexes in the Redis server – you do not have to restart the server for that.
58
59 #### SQLite{#indexes-types-sqlite}
60
61 TODO
62
63 #### File{#indexes-types-file}
64
65 TODO
66
67 ### Accessing{#indexes-acessing}
68
69 If you don't have access to your indexes directly, like so
70
71 books_index = Index.new(:books) do
72 # ...
73 end
74
75 books_index.do_something_with_the_index
76
77 and for example you'd like to access the index from a rake task, you can use
78
79 Picky::Indexes
80
81 to get *all indexes*.
82
83 To get a *single index* use
84
85 Picky::Indexes[:index_name]
86
87 and to get a *single category*, use
88
89 Picky::Indexes[:index_name][:category_name]
90
91 That's it.
92
93 ### Configuration{#indexes-configuration}
94
95 This is all you can do to configure an index:
96
97 books_index = Index.new :books do
98 source { Book.order("isbn ASC") }
99
100 indexing removes_characters: /[^a-zA-Z0-9\s\:\"\&\.\|]/i, # Default: nil
101 stopwords: /\b(and|the|or|on|of|in)\b/i, # Default: nil
102 splits_text_on: /[\s\/\-\_\:\"\&\/]/, # Default: /\s/
103 removes_characters_after_splitting: /[\.]/, # Default: nil
104 normalizes_words: [[/\$(\w+)/i, '\1 dollars']], # Default: nil
105 rejects_token_if: lambda { |token| token == :blurf }, # Default: nil
106 case_sensitive: true, # Default: false
107 substitutes_characters_with: Picky::CharacterSubstituters::WestEuropean.new, # Default: nil
108 stems_with: Lingua::Stemmer.new # Default: nil
109
110 category :id
111 category :title,
112 partial: Partial::Substring.new(:from => 1),
113 similarity: Similarity::DoubleMetaphone.new(2),
114 qualifiers: [:t, :title, :titre]
115 category :author,
116 partial: Partial::Substring.new(:from => -2)
117 category :year,
118 partial: Partial::None.new
119 qualifiers: [:y, :year, :annee]
120
121 result_identifier 'boooookies'
122 end
123
124 Usually you don't need to configure all that.
125
126 But if your boss comes in the door and asks why X is not found… you know. And you can improve the search engine relatively *quickly and painless*.
127
128 More power to you.
129
130 ### Data Sources{#indexes-sources}
131
132 Data sources define where the data for an index comes from.
133
134 You define them on an *index*:
135
136 Index.new :books do
137 source Book.all # Loads the data instantly.
138 end
139
140 Index.new :books do
141 source { Book.all } # Loads on indexing. Preferred.
142 end
143
144 Or even a *single category*:
145
146 Index.new :books do
147 category :title,
148 source: lambda { Book.all }
149 end
150
151 At the moment there are two possibilities: [Objects responding to #each](#indexes-sources-each) and [Picky classic style sources](#indexes-sources-classic).
152
153 #### Responding to #each{#indexes-sources-each}
154
155 Picky supports any data source as long as it supports `#each`.
156
157 See [under Flexible Sources](http://florianhanke.com/blog/2011/04/14/picky-two-point-two-point-oh.html) how you can use this.
158
159 In short. Model:
160
161 class Monkey
162 attr_reader :id, :name, :color
163 def initialize id, name, color
164 @id, @name, @color = id, name, color
165 end
166 end
167
168 The data:
169
170 monkeys = [
171 Monkey.new(1, 'pete', 'red'),
172 Monkey.new(2, 'joey', 'green'),
173 Monkey.new(3, 'hans', 'blue')
174 ]
175
176 Setting the array as a source
177
178 Index::Memory.new :monkeys do
179 source { monkeys }
180 category :name
181 category :couleur, :from => :color # The couleur category will take its data from the #color method.
182 end
183
184 #### Delayed{#indexes-sources-delayed}
185
186 If you define the source directly in the index block, it will be evaluated instantly:
187
188 Index::Memory.new :books do
189 source Book.order('title ASC')
190 end
191
192 This works with ActiveRecord and other similar ORMs since @Book.order@ returns a proxy object that will only be evaluated when the server is indexing.
193
194 For example, this would instantly get the records, since `#all` is a kicker method:
195
196 Index::Memory.new :books do
197 source Book.all # Not the best idea.
198 end
199
200 In this case, you can give the `source` method a block:
201
202 Index::Memory.new :books do
203 source { Book.all }
204 end
205
206 This block will be executed as soon as the indexing is running, but not earlier.
207
208 #### Classic Style{#indexes-sources-classic}
209
210 The classic style uses Picky's own `Picky::Sources` to load the data into the index.
211
212 Index.new :books do
213 source Sources::CSV.new(:title, :author, file: 'app/library.csv')
214 end
215
216 Use this one if you want to use a simple CSV file.
217
218 However, you could also use the built-in Ruby `CSV` class and use it as an `#each` source (see above).
219
220 Index.new :books do
221 source Sources::DB.new('SELECT id, title, author, isbn13 as isbn FROM books', file: 'app/db.yml')
222 end
223
224 Use this one if you want to use a database source with very custom SQL statements. If not, we suggest you use an ORM as an `#each` source (see above).
225
226 ### Indexing / Tokenizing{#indexes-indexing}
227
228 See [Tokenizing](#tokenizing) for tokenizer options.
229
230 ### Categories{#indexes-categories}
231
232 Categories – usually what other search engines call fields – define *categorized data*. For example, book data might have a `title`, an `author` and an `isbn`.
233
234 So you define that:
235
236 Index.new :books do
237 source { Book.order('author DESC') }
238
239 category :title
240 category :author
241 category :isbn
242 end
243
244 (The example assumes that a `Book` has readers for `title`, `author`, and `isbn`)
245
246 This already works and a search will return categorized results. For example, a search for "Alan Tur" might categorize both words as `author`, but it might also at the same time categorize both as `title`. Or one as `title` and the other as `author`.
247
248 That's a great starting point. So how can I customize the categories?
249
250 #### Option partial{#indexes-categories-partial}
251
252 The partial option defines if a word is also found when it is only *partially entered*. So, "Picky" might be already found when typing "Pic".
253
254 You define this by this:
255
256 category :some, partial: Partial::Substring.new(from: -3)
257
258 (This is also the default)
259 The option `from: 1` will make a word completely partially findable.
260
261 If you don't want any partial finds to occur, use:
262
263 category :some, partial: Partial::None.new
264
265 You can also pass in your own partial generators. See [this article](http://florianhanke.com/blog/2011/08/15/picky-30-its-all-ruby-part-1.html) to learn more.
266
267 #### Option weights{#indexes-categories-weights}
268
269 The weights option defines how strongly a word is weighed. By default, Picky rates a word according to the logarithm of its occurrence. This means that a word that occurs more often will be slightly higher weighed.
270
271 You define this by this:
272
273 category :some, weights: MyWeights.new
274
275 The default is `Weights::Logarithmic.new`.
276
277 You can also pass in your own weights generators. See [this article](http://florianhanke.com/blog/2011/08/15/picky-30-its-all-ruby-part-1.html) to learn more.
278
279 If you don't want Picky to calculate weights for your indexed entries, you can use constant or dynamic weights.
280
281 With 0.0 as default weight:
282
283 category :some, weights: Weights::Constant.new # Returns 0.0 for all results.
284
285 With 3.14 as set weight:
286
287 category :some, weights: Weights::Constant.new(3.14) # Returns 3.14 for all results.
288
289 Or with a dynamically calculated weight:
290
291 Weights::Dynamic.new do |str_or_sym|
292 sym_or_str.length # Uses the length of the symbol as weight.
293 end
294
295 You almost never need to use your specific weights. More often than not, you can fiddle with boosting combinations of categories, via the `boost` method in searches.
296
297 #### Option similarity{#indexes-categories-similarity}
298
299 The similarity option defines if a word is also found when it is typed wrong, or _close_ to another word. So, "Picky" might be already found when typing "Pocky~". (Picky will search for similar word when you use the tilde, ~)
300
301 You define this by this:
302
303 category :some, similarity: Similarity::None.new
304
305 (This is also the default)
306
307 There are several built-in similarity options, like
308
309 category :some, similarity: Similarity::Soundex.new
310 category :this, similarity: Similarity::Metaphone.new
311 category :that, similarity: Similarity::DoubleMetaphone.new
312
313 You can also pass in your own similarity generators. See [this article](http://florianhanke.com/blog/2011/08/15/picky-30-its-all-ruby-part-1.html) to learn more.
314
315 #### Option qualifier/qualifiers (categorizing){#indexes-categories-qualifiers}
316
317 Usually, when you search for `title:wizard` you will only find books with "wizard" in their title.
318
319 Maybe your client would like to be able to only enter "t:wizard". In that case you would use this option:
320
321 category :some,
322 :qualifier => :t
323
324 Or if you'd like more to match:
325
326 category :some,
327 qualifiers: [:t, :title, :titulo]
328
329 (This matches "t", "title", and also the italian "titulo")
330
331 Picky will warn you if on one index the qualifiers are ambiguous (Picky will assume that the last "t" for example is the one you want to use).
332
333 This means that:
334
335 category :some, :qualifier => :t
336 category :other, :qualifier => :t
337
338 Picky will assume that if you enter "t:bla", you want to search in the :other category.
339
340 Searching in multiple categories can also be done. If you have:
341
342 category :some, :qualifier => :s
343 category :other, :qualifier => :o
344
345 Then searching with "s,o:bla" will search for bla in both @:some@ and @:other@. Neat, eh?
346
347 #### Option from{#indexes-categories-from}
348
349 Usually, the categories will take their data from the reader or field that is the same as their name.
350
351 Sometimes though, the model has not the right names. Say, you have an italian book model, `Libro`. But you still want to use english category names.
352
353 Index.new :books do
354 source { Libro.order('autore DESC') }
355
356 category :title, :from => :titulo
357 category :author, :from => :autore
358 category :isbn
359 end
360
361 #### Option key_format{#indexes-categories-keyformat}
362
363 You almost never use this, as the key format will usually be the same for all categories, which is when you would define it on the index, [like so](#indexes-keyformat).
364
365 But if you need to, use as with the index.
366
367 Index.new :books do
368 category :title,
369 :key_format => :to_sym
370 end
371
372 #### Option source{#indexes-categories-source}
373
374 You almost never use this, as the source will usually be the same for all categories, which is when you would define it on the index, "like so":#indexes-sources.
375
376 But if you need to, use as with the index.
377
378 Index.new :books do
379 category :title,
380 source: some_source
381 end
382
383 #### Searching{#indexes-categories-searching}
384
385 Users can use some special features when searching. They are:
386
387 * Partial: `something*` (By default, the last word is implicitly partial)
388 * Non-Partial: `"something"` (The quotes make the query on this word explicitly non-partial)
389 * Similarity: `something~` (The tilde makes this word eligible for similarity search)
390 * Categorized: `title:something` (Picky will only search in the category designated as title, in each index of the search)
391 * Multi-categorized: `title,author:something` (Picky will search in title _and_ author categories, in each index of the search)
392
393 These options can be combined (e.g. `title,author:"funky~"`): This will try to find similar words to funky (like "fonky"), but no partials of them (like "fonk"), in both title and author.
394
395 Non-partial will win over partial, if you use both, as in `"test*"`.
396
397 Also note that these options need to make it through the [tokenizing](#tokenizing), so don't remove any of `*":,`.
398
399 ### Key Format (Format of the indexed Ids){#indexes-keyformat}
400
401 By default, the indexed data points to keys that are integers, or differently said, are formatted using `to_i`.
402
403 If you are indexing keys that are strings, use `to_sym` – a good example are MongoDB BSON keys, or UUID keys.
404
405 The `key_format` method lets you define the format:
406
407 Index.new :books do
408 key_format :to_sym
409 end
410
411 The `Picky::Sources` already set this correctly. However, if you use an `#each` source that supplies Picky with symbol ids, you should tell it what format the keys are in, eg. `key_format :to_sym`.
412
413 ### Identifying in Results{#indexes-results}
414
415 By default, an index is identified by its *name* in the results. This index is identified by `:books`:
416
417 Index.new :books do
418 # ...
419 end
420
421 This index is identified by `:media` in the results:
422
423 Index.new :books do
424 # ...
425 result_identifier :media
426 end
427
428 You still refer to it as `:books` in e.g. Rake tasks, `Picky::Indexes[:books].reload`. It's just for the results.
429
430 ### Indexing{#indexes-indexing}
431
432 Indexing can be done programmatically, at any time. Even while the server is running.
433
434 Indexing *all indexes* is done with
435
436 Picky::Indexes.index
437
438 Indexing a *single index* can be done either with
439
440 Picky::Indexes[:index_name].index
441
442 or
443
444 index_instance.index
445
446 Indexing a *single category* of an index can be done either with
447
448 Picky::Indexes[:index_name][:category_name].index
449
450 or
451
452 category_instance.index
453
454 ### Loading{#indexes-reloading}
455
456 Loading (or reloading) your indexes in a running application is possible.
457
458 Loading *all indexes* is done with
459
460 Picky::Indexes.load
461
462 Loading a *single index* can be done either with
463
464 Picky::Indexes[:index_name].load
465
466 or
467
468 index_instance.load
469
470 Loading a *single category* of an index can be done either with
471
472 Picky::Indexes[:index_name][:category_name].load
473
474 or
475
476 category_instance.load
477
478 #### Using signals{#indexes-reloading-signals}
479
480 To communicate with your server using signals:
481
482 books_index = Index.new(:books) do
483 # ...
484 end
485
486 Signal.trap("USR1") do
487 books_index.reindex
488 end
489
490 This reindexes the books_index when you call
491
492 kill -USR1 <server_process_id>
493
494 You can refer to the index like so if want to define the trap somewhere else:
495
496 Signal.trap("USR1") do
497 Picky::Indexes[:books].reindex
498 end
499
500 ### Reindexing{#indexes-reindexing}
501
502 Reindexing your indexes is just indexing followed by reloading (see above).
503
504 Reindexing *all indexes* is done with
505
506 Picky::Indexes.reindex
507
508 Reindexing a *single index* can be done either with
509
510 Picky::Indexes[:index_name].reindex
511
512 or
513
514 index_instance.reindex
515
516 Reindexing a *single category* of an index can be done either with
517
518 Picky::Indexes[:index_name][:category_name].reindex
519
520 or
521
522 category_instance.reindex
Something went wrong with that request. Please try again.