Moving to version 0.5 with the file index, indexer and better Sphinx …

…integration. - added a file index which represents the binary matrix - added indexer - added better SimClient to combine full text with item based search queries - added tutorial on how to use - fixed documentation
alexksikes · Oct 3, 2012 · d60fa03 · d60fa03
1 parent 7527833
commit d60fa03
Show file tree

Hide file tree

Showing 25 changed files with 1,736 additions and 764 deletions.
diff --git a/INSTALL.md b/INSTALL.md
@@ -1,16 +1,22 @@
 Download and extract the latest tarball and install the package:
 
-    wget http://github.com/alexksikes/SimilaritySearch/tarball/master
+    wget http://github.com/alexksikes/SimSearch/tarball/master
     tar xvzf "the tar ball"
     cd "the tar ball"
     python setup.py install
 
-You will need [NumPy][1] which is used for sparse matrix multiplications. 
-To combine full text search with similarity search, you will need [Sphinx][2] and 
-[fSphinx][3]. 
+You will need [SciPy][1] which is used for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and
+[fSphinx][3].
 
-Enjoy!
+Installing fSphinx and Sphinx is pretty straight forward. On linux (debian) to install scipy, you may need the following libraries:
 
-[1]: http://numpy.scipy.org/
+sudo aptitude install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-base-dev python-all-dev
+
+Finally you can install scipy:
+
+pip install numpy
+pip install scipy
+
+[1]: http://www.scipy.org/
 [2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation
 [3]: http://github.com/alexksikes/fSphinx/
diff --git a/README.md b/README.md
@@ -1,95 +1,17 @@
-This module is an implementation of [Bayesian Sets][1]. Bayesian Sets is a new 
-framework for information retrieval in which a query consists of a set of items 
-which are examples of some concept. The result is a set of items which attempts 
-to capture the example concept given by the query.
+SimSearch is an item based retrieval engine which implements [Bayesian Sets][0]. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.
 
-For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][2], 
-Bayesian Sets would return other similar animated movies, like "Toy Story".
+For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][1], Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice [blog post][2] about item based search with Bayesian Sets. Feel free to [read][2] through it.
 
-This module also adds the novel ability to combine full text search with
-item based search. For example a query can be a combination of items and full text search 
-keywords. In this case the results match the keywords but are re-ranked by how similar 
-to the queried items.
+This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are also re-ranked by similary to the queried items.
 
-This implementation has been [tested][3] on datasets with millions of documents and 
-hundreds of thousands of features. It has become an integrant part of [Cloud Mining][4]. 
-At the moment only features of bag of words are supported. However it is faily easy 
-to change the code to make it work on other feature types.
+It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. In this respect SimSearch only implements a simple [bag of words][4] model but other feature types are possible. In fact the index is made of a set of files which represent the presence of a feature value in a given item. As long as you can create these files, SimSearch can read them and perform its matching.
 
-This module works as follow:
+SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed searching and real time indexing. For more information, please follow the [tutorial][6] for more information.
 
-1) First a configuration file has to be written (have a look at tools/sample_config.py). 
-The most important variable holds the list of features to index. Those are indexed 
-with SQL queries of the type:
-
-    sql_features = ['select id as item_id, word as feature from table']
-
-Note that id and word must be aliased as item_id and feature respectively.
-
-2) Now use tools/index_features.py on the configuration file to index those features.
-
-    python tools/index_features.py config.py
-
-The indexer will create a computed index named index.dat in your working directory. 
-A computed index is a pickled file with all its hyper parameters already computed and 
-with the matrix in CSR format. 
-
-3) You can now test this index:
-
-    python tools/query_index.py index.dat
-
-4) The script *query_index.py* will load the index in memory each time. In order to load it
-only once, you can serve the index with some client/server code (see client_server code).
-The index can also be loaded along side the web application. In [webpy][5] web.config 
-dictionnary can be used for this purpose.
-
-This module relies and [Sphinx][6] and [fSphinx][7] to perform the full-text and item based 
-search combination. A regular sphinx client is wrapped together with a computed index,
-and a function called *setup_sphinx* is called upon similarity search. 
-This function resets the sphinx client if an item based query is encountered.
-
-Here is an example of a *setup_sphinx* function:
-
-    # this is only used for sim_sphinx (see doc)
-    def sphinx_setup(cl):
-        import sphinxapi
-        
-        # custom sorting function for the search
-        # we always make sure highly ranked items with a log score are at the top.
-        cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')'
-        
-        # custom grouping function for the facets
-        group_func = 'sum(log_score_attr)'
-        
-        # setup sorting and ordering of each facet 
-        for f in cl.facets:
-            # group by a custom function
-            f.SetGroupFunc(group_func)
-
-Note that the log_scores are found in the Sphinx attributes *log_score_attr*. It must be set 
-to 1 and declared as a float in your Sphinx configuration file:
-
-    # log_score_attr must be set to 1
-    sql_query            = \
-        select *,\
-            1 as log_score_attr,\
-        from table
-
-    # log_score_attr will hold the log scores after item base search
-    sql_attr_float = log_score_attr
-
-There is a nice [blog post][8] about item based search with Bayesian Sets. Feel free to 
-[read][8] through it.
-
-That's it for the documentation. Have fun playing with item based search and don't forget
-to leave [feedback][9].
-
-[1]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
-[2]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
-[3]: http://imdb.cloudmining.net
-[4]: https://github.com/alexksikes/CloudMining
-[5]: http://webpy.org/
-[6]: http://sphinxsearch.com/
-[7]: https://github.com/alexksikes/fSphinx
-[8]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
-[9]: mailto:alex.ksikes@gmail.com&subject=SimSearch
+[0]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
+[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
+[2]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
+[3]: http://en.wikipedia.org/wiki/Feature_(machine_learning)
+[4]: http://en.wikipedia.org/wiki/Bag_of_words
+[5]: http://imdb.cloudmining.net
+[6]: https://github.com/alexksikes/SimSearch/tree/master/tutorial/
diff --git a/TODO b/TODO
@@ -1,15 +1,24 @@
-[ ] implement other feature types besides bag of words. 
+[*] separate feature creation from computed index
+
+[ ] incremental indexing
+ - use mode 'append' but the index needs to be recomputed
+
+[ ] distributed computation of the sparse multiplication
+ - use multi-processing module
+ - have workers compute a chuck of the matrix (a sequential list of items)
+ - merge sort each worker result
+ - accross machines (not just cores), we need distributed indexes as well
+
+[ ] implement other feature types besides bag of words
+- some basic image features (color histogram)
 
 [ ] for bag of words features:
-- mulitple features in one table
-- same feature value for different features.
-- normalize the feature values.
+- multiple features in one table
+- normalize the feature values
+- database agnostic
 
 [ ] SSCursor is better to fetch lots of rows but still has problems:
   http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python
 
-[*] ad feature value information right into the index (ComputedIndex.index_to_feat)
-
-[ ] return only a restricted set of ids
-[ ] to speed things, we could actually only perform the matrix multiplication on he reamining ids 
+[ ] to speed things, we could actually only perform the matrix multiplication on the reamining ids
   (either by looping over each item or by manipulating the matrix)
diff --git a/config_example.py b/config_example.py
diff --git a/setup.py b/setup.py
@@ -7,7 +7,7 @@
 '''
 
 setup(name='SimSearch',
-    version='0.2',
+    version='0.5',
     description='Implementation of Bayesian Sets for fast similarity searches',
     author='Alex Ksikes',
     author_email='alex.ksikes@gmail.com',

diff --git a/simsearch/__init__.py b/simsearch/__init__.py
@@ -1,19 +1,15 @@
 #!/usr/bin/env python
 
-#!/usr/bin/env python
-
-"""This is an implementation of Bayesian Sets as described in:
+"""SimSearch is an item based retrieval engine which implements Bayesian Sets:
 
 http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
 http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
-
 """
 
-__version__ = '0.2'
+__version__ = '0.5'
 __author__ = 'Alex Ksikes <alex.ksikes@gmail.com>'
 __license__ = 'GPL'
 
-import bsets
 from bsets import *
 from simsphinx import *
-import utils
+from indexer import *