Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Item based retrieval engine with Bayesian Sets.

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 simsearch
Octocat-spinner-32 tests
Octocat-spinner-32 tools
Octocat-spinner-32 tutorial
Octocat-spinner-32 INSTALL.md
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.md
Octocat-spinner-32 TODO
Octocat-spinner-32 setup.py
README.md

SimSearch is an item based retrieval engine which implements Bayesian Sets. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.

For example, for the query with the two animated movies, "Lilo & Stitch" and "Up", Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice blog post about item based search with Bayesian Sets. Feel free to read through it.

This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords and are re-ranked by similary to the queried items.

It is important to note that Bayesian Sets does not care about the actual feature engineering. In fact SimSearch only implements a simple bag of words model. However other feature types are possible as long as they can be binarized. The index is a set of files in a .xco and .yco format (more in the tutorial) that represents the presence of a feature value in a given item. So as long as you can create these files, SimSearch can read them and perform the matching.

SimSearch has been tested on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed search and real time indexing. For more information, feel free please to follow the tutorial.

Something went wrong with that request. Please try again.