Permalink
Browse files

Moving to version 0.5 with the file index, indexer and better Sphinx …

…integration.

- added a file index which represents the binary matrix
- added indexer
- added better SimClient to combine full text with item based search queries
- added tutorial on how to use
- fixed documentation
  • Loading branch information...
1 parent 7527833 commit d60fa03731fb7a6700519abb24b4f98f660903ca @alexksikes committed Oct 3, 2012
View
@@ -1,16 +1,22 @@
Download and extract the latest tarball and install the package:
- wget http://github.com/alexksikes/SimilaritySearch/tarball/master
+ wget http://github.com/alexksikes/SimSearch/tarball/master
tar xvzf "the tar ball"
cd "the tar ball"
python setup.py install
-You will need [NumPy][1] which is used for sparse matrix multiplications.
-To combine full text search with similarity search, you will need [Sphinx][2] and
-[fSphinx][3].
+You will need [SciPy][1] which is used for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and
+[fSphinx][3].
-Enjoy!
+Installing fSphinx and Sphinx is pretty straight forward. On linux (debian) to install scipy, you may need the following libraries:
-[1]: http://numpy.scipy.org/
+sudo aptitude install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-base-dev python-all-dev
+
+Finally you can install scipy:
+
+pip install numpy
+pip install scipy
+
+[1]: http://www.scipy.org/
[2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation
[3]: http://github.com/alexksikes/fSphinx/
View
102 README.md
@@ -1,95 +1,17 @@
-This module is an implementation of [Bayesian Sets][1]. Bayesian Sets is a new
-framework for information retrieval in which a query consists of a set of items
-which are examples of some concept. The result is a set of items which attempts
-to capture the example concept given by the query.
+SimSearch is an item based retrieval engine which implements [Bayesian Sets][0]. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.
-For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][2],
-Bayesian Sets would return other similar animated movies, like "Toy Story".
+For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][1], Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice [blog post][2] about item based search with Bayesian Sets. Feel free to [read][2] through it.
-This module also adds the novel ability to combine full text search with
-item based search. For example a query can be a combination of items and full text search
-keywords. In this case the results match the keywords but are re-ranked by how similar
-to the queried items.
+This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are also re-ranked by similary to the queried items.
-This implementation has been [tested][3] on datasets with millions of documents and
-hundreds of thousands of features. It has become an integrant part of [Cloud Mining][4].
-At the moment only features of bag of words are supported. However it is faily easy
-to change the code to make it work on other feature types.
+It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. In this respect SimSearch only implements a simple [bag of words][4] model but other feature types are possible. In fact the index is made of a set of files which represent the presence of a feature value in a given item. As long as you can create these files, SimSearch can read them and perform its matching.
-This module works as follow:
+SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed searching and real time indexing. For more information, please follow the [tutorial][6] for more information.
-1) First a configuration file has to be written (have a look at tools/sample_config.py).
-The most important variable holds the list of features to index. Those are indexed
-with SQL queries of the type:
-
- sql_features = ['select id as item_id, word as feature from table']
-
-Note that id and word must be aliased as item_id and feature respectively.
-
-2) Now use tools/index_features.py on the configuration file to index those features.
-
- python tools/index_features.py config.py
-
-The indexer will create a computed index named index.dat in your working directory.
-A computed index is a pickled file with all its hyper parameters already computed and
-with the matrix in CSR format.
-
-3) You can now test this index:
-
- python tools/query_index.py index.dat
-
-4) The script *query_index.py* will load the index in memory each time. In order to load it
-only once, you can serve the index with some client/server code (see client_server code).
-The index can also be loaded along side the web application. In [webpy][5] web.config
-dictionnary can be used for this purpose.
-
-This module relies and [Sphinx][6] and [fSphinx][7] to perform the full-text and item based
-search combination. A regular sphinx client is wrapped together with a computed index,
-and a function called *setup_sphinx* is called upon similarity search.
-This function resets the sphinx client if an item based query is encountered.
-
-Here is an example of a *setup_sphinx* function:
-
- # this is only used for sim_sphinx (see doc)
- def sphinx_setup(cl):
- import sphinxapi
-
- # custom sorting function for the search
- # we always make sure highly ranked items with a log score are at the top.
- cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')'
-
- # custom grouping function for the facets
- group_func = 'sum(log_score_attr)'
-
- # setup sorting and ordering of each facet
- for f in cl.facets:
- # group by a custom function
- f.SetGroupFunc(group_func)
-
-Note that the log_scores are found in the Sphinx attributes *log_score_attr*. It must be set
-to 1 and declared as a float in your Sphinx configuration file:
-
- # log_score_attr must be set to 1
- sql_query = \
- select *,\
- 1 as log_score_attr,\
- from table
-
- # log_score_attr will hold the log scores after item base search
- sql_attr_float = log_score_attr
-
-There is a nice [blog post][8] about item based search with Bayesian Sets. Feel free to
-[read][8] through it.
-
-That's it for the documentation. Have fun playing with item based search and don't forget
-to leave [feedback][9].
-
-[1]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
-[2]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
-[3]: http://imdb.cloudmining.net
-[4]: https://github.com/alexksikes/CloudMining
-[5]: http://webpy.org/
-[6]: http://sphinxsearch.com/
-[7]: https://github.com/alexksikes/fSphinx
-[8]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
-[9]: mailto:alex.ksikes@gmail.com&subject=SimSearch
+[0]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
+[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
+[2]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
+[3]: http://en.wikipedia.org/wiki/Feature_(machine_learning)
+[4]: http://en.wikipedia.org/wiki/Bag_of_words
+[5]: http://imdb.cloudmining.net
+[6]: https://github.com/alexksikes/SimSearch/tree/master/tutorial/
View
25 TODO
@@ -1,15 +1,24 @@
-[ ] implement other feature types besides bag of words.
+[*] separate feature creation from computed index
+
+[ ] incremental indexing
+ - use mode 'append' but the index needs to be recomputed
+
+[ ] distributed computation of the sparse multiplication
+ - use multi-processing module
+ - have workers compute a chuck of the matrix (a sequential list of items)
+ - merge sort each worker result
+ - accross machines (not just cores), we need distributed indexes as well
+
+[ ] implement other feature types besides bag of words
+- some basic image features (color histogram)
[ ] for bag of words features:
-- mulitple features in one table
-- same feature value for different features.
-- normalize the feature values.
+- multiple features in one table
+- normalize the feature values
+- database agnostic
[ ] SSCursor is better to fetch lots of rows but still has problems:
http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python
-[*] ad feature value information right into the index (ComputedIndex.index_to_feat)
-
-[ ] return only a restricted set of ids
-[ ] to speed things, we could actually only perform the matrix multiplication on he reamining ids
+[ ] to speed things, we could actually only perform the matrix multiplication on the reamining ids
(either by looping over each item or by manipulating the matrix)
View
@@ -1,29 +0,0 @@
-# database parameters
-db_params = dict(user='user', passwd='password', db='dbname')
-
-# list of SQL queries to fetch the features from
-sql_features = [
- 'select id as item_id, word as feature from table',
- 'select id as item_id, word as feature from table2',
- '...'
-]
-
-# path to read or save the index
-index_path = './index.dat'
-
-# maximum number of items to match
-max_items = 10000
-
-# this is only used for sim_sphinx (see doc)
-def sphinx_setup(cl):
- # import sphinxapi
-
- # custom sorting function for the search
- # cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, 'log_score_attr')
-
- # custom grouping function for the facets
- group_func = 'sum(log_score_attr)'
-
- # setup sorting and ordering of each facet
- for f in cl.facets:
- f.SetGroupFunc(group_func)
View
@@ -7,7 +7,7 @@
'''
setup(name='SimSearch',
- version='0.2',
+ version='0.5',
description='Implementation of Bayesian Sets for fast similarity searches',
author='Alex Ksikes',
author_email='alex.ksikes@gmail.com',
View
@@ -1,19 +1,15 @@
#!/usr/bin/env python
-#!/usr/bin/env python
-
-"""This is an implementation of Bayesian Sets as described in:
+"""SimSearch is an item based retrieval engine which implements Bayesian Sets:
http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
-
"""
-__version__ = '0.2'
+__version__ = '0.5'
__author__ = 'Alex Ksikes <alex.ksikes@gmail.com>'
__license__ = 'GPL'
-import bsets
from bsets import *
from simsphinx import *
-import utils
+from indexer import *
Oops, something went wrong.

0 comments on commit d60fa03

Please sign in to comment.