Categories configuration

floere edited this page Dec 16, 2011 · 14 revisions

Indexes in Picky hold categorized data. You use Categories to define how Picky searches in the data.

An index needs:

  • A Source: Where the data comes from.
  • One or more Categories: The various ways the data is accessed.

For a valid index, you will need to define the index and one or more categories. An index without categories cannot be searched. The categories link the data – the what – in the source to how the data is searched.

To define a category on an index (and its data), use the category method.

Example

In code (using Ruby 1.9 new style hashes):

books_index = Index::Memory.new(:books) do
  source   some_source  # some_source can be a Sources::DB, Sources::CSV etc.
  category :title,  partial: Partial::Substring.new(from: 1)
  category :author, similarity: Similarity::DoubleMetaphone.new(3),
                    partial: Partial::Substring.new(from: 1)
  category :isbn,   tokenizer: IsbnTokenizer.new
end

This creates a new index – using the index method –, which has some data source (not important for the example). That data source provides us with a title, an author, and an isbn for each entry.

For the title category, we tell Picky to find a title word even if it is only partially matched, so “Hob” will find “Hobbit”. Since it is partial from the first character, even a single “H” will find “hobbit”. Search for partial words using the asterisk *. The last word in a query is partially searched by default.

For the author category, we also want Picky to find partial matches, and also phonetically similar matches. So “Solschenyzin” will also find “Solschenizyn”. Search for similar words using the tilde ~.

The isbn category uses neither a similarity, nor a partial search – does not make sense on an ISBN –, but a special tokenizer which will define how ISBNs are indexed. If you’re starting out with Picky you won’t need that yet.

Options of category

category defines both how data is indexed and how data is searched. The first argument is the identifier of the category. This identifier is used in the front end, but also to categorize query text. For example, “title:hobbit” will narrow the hobbit query on categories with the identifier :title.

  • partial: Partial::None.new or Partial::Substring.new(from: starting_char, to: ending_char). Default is Partial::Substring.new(from: -3, to: -1).
  • similarity: Similarity::None.new, Similarity::DoubleMetaphone.new(similar_words_searched), Similarity::Metaphone.new(similar_words_searched), or Similarity::Soundex.new(similar_words_searched). Default is Similarity::None.new.
  • weight: Weights::Logarithmic.new, Weights::Constant.new or Weights::Dynamic.new. Default is Weights::Logarithmic.new.
  • key_format: How to format the ids/keys. If it is integers, like from a database, use :to_i, or nothing, as :to_i is the default. If it’s strings, from Redis or similar, use :to_s, or :to_sym if you prefer Symbols. Note that Symbols are not garbage collected, and will use up more permanent memory. However, this can improve speed.
  • backend: The backend to use. Default is Backends::Memory.new. Other options are: Backends::Redis.new, Backends::SQLite.new, Backends::File.new.
  • tokenizer: Give the category a specific tokenizer. Takes the same options as Picky::Index#indexing.
  • qualifiers: An array of qualifiers with which you can define which category you’d like to search, for example “title:hobbit” will search for hobbit in just title categories. Example: qualifiers: [:t, :titre, :title] (use it for example with multiple languages). Default is the name of the category.
  • qualifier: Convenience options if you just need a single qualifier, see above. Example: qualifiers => :title. Default is the name of the category.
  • from: Take the data from the data category with this name. Example: You have a source Sources::CSV.new(:title, file:'some_file.csv') but you want the category to be called differently. The you use from: category(:similar_title, :from => :title).
  • source: Use a different source than the index uses. If you think you need that, there might be a better solution to your problem. Please post to the mailing list first with your application.rb :)

Advanced options

TODO

  • tokenizer