Permalink
Browse files

First commit

  • Loading branch information...
0 parents commit 2fc259262893233dedf5c17be47584f89dd21125 @alexksikes committed Sep 6, 2011
Showing with 1,843 additions and 0 deletions.
  1. +16 −0 INSTALL.md
  2. +674 −0 LICENSE
  3. +95 −0 README.md
  4. +15 −0 TODO
  5. +29 −0 config_example.py
  6. +19 −0 setup.py
  7. +18 −0 simsearch/__init__.py
  8. +359 −0 simsearch/bsets.py
  9. +170 −0 simsearch/simsphinx.py
  10. +149 −0 simsearch/utils.py
  11. +38 −0 tests/test_argsort_best.py
  12. +11 −0 tests/test_simsphinx.py
  13. +47 −0 tools/client.py
  14. +45 −0 tools/index_features.py
  15. +64 −0 tools/query_index.py
  16. +94 −0 tools/server.py
@@ -0,0 +1,16 @@
+Download and extract the latest tarball and install the package:
+
+ wget http://github.com/alexksikes/SimilaritySearch/tarball/master
+ tar xvzf "the tar ball"
+ cd "the tar ball"
+ python setup.py install
+
+You will need [NumPy][1] which is used for sparse matrix multiplications.
+To combine full text search with similarity search, you will need [Sphinx][2] and
+[fSphinx][3].
+
+Enjoy!
+
+[1]: http://numpy.scipy.org/
+[2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation
+[3]: http://github.com/alexksikes/fSphinx/
674 LICENSE

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -0,0 +1,95 @@
+This module is an implementation of [Bayesian Sets][1]. Bayesian Sets is a new
+framework for information retrieval in which a query consists of a set of items
+which are examples of some concept. The result is a set of items which attempts
+to capture the example concept given by the query.
+
+For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][2],
+Bayesian Sets would return other similar animated movies, like "Toy Story".
+
+This module also adds the novel ability to combine full text search with
+item based search. For example a query can be a combination of items and full text search
+keywords. In this case the results match the keywords but are re-ranked by how similar
+to the queried items.
+
+This implementation has been [tested][3] on datasets with millions of documents and
+hundreds of thousands of features. It has become an integrant part of [Cloud Mining][4].
+At the moment only features of bag of words are supported. However it is faily easy
+to change the code to make it work on other feature types.
+
+This module works as follow:
+
+1) First a configuration file has to be written (have a look at tools/sample_config.py).
+The most important variable holds the list of features to index. Those are indexed
+with SQL queries of the type:
+
+ sql_features = ['select id as item_id, word as feature from table']
+
+Note that id and word must be aliased as item_id and feature respectively.
+
+2) Now use tools/index_features.py on the configuration file to index those features.
+
+ python tools/index_features.py config.py
+
+The indexer will create a computed index named index.dat in your working directory.
+A computed index is a pickled file with all its hyper parameters already computed and
+with the matrix in CSR format.
+
+3) You can now test this index:
+
+ python tools/query_index.py index.dat
+
+4) The script *query_index.py* will load the index in memory each time. In order to load it
+only once, you can serve the index with some client/server code (see client_server code).
+The index can also be loaded along side the web application. In [webpy][5] web.config
+dictionnary can be used for this purpose.
+
+This module relies and [Sphinx][6] and [fSphinx][7] to perform the full-text and item based
+search combination. A regular sphinx client is wrapped together with a computed index,
+and a function called *setup_sphinx* is called upon similarity search.
+This function resets the sphinx client if an item based query is encountered.
+
+Here is an example of a *setup_sphinx* function:
+
+ # this is only used for sim_sphinx (see doc)
+ def sphinx_setup(cl):
+ import sphinxapi
+
+ # custom sorting function for the search
+ # we always make sure highly ranked items with a log score are at the top.
+ cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')'
+
+ # custom grouping function for the facets
+ group_func = 'sum(log_score_attr)'
+
+ # setup sorting and ordering of each facet
+ for f in cl.facets:
+ # group by a custom function
+ f.SetGroupFunc(group_func)
+
+Note that the log_scores are found in the Sphinx attributes *log_score_attr*. It must be set
+to 1 and declared as a float in your Sphinx configuration file:
+
+ # log_score_attr must be set to 1
+ sql_query = \
+ select *,\
+ 1 as log_score_attr,\
+ from table
+
+ # log_score_attr will hold the log scores after item base search
+ sql_attr_float = log_score_attr
+
+There is a nice [blog post][8] about item based search with Bayesian Sets. Feel free to
+[read][8] through it.
+
+That's it for the documentation. Have fun playing with item based search and don't forget
+to leave [feedback][9].
+
+[1]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
+[2]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
+[3]: http://imdb.cloudmining.net
+[4]: https://github.com/alexksikes/CloudMining
+[5]: http://webpy.org/
+[6]: http://sphinxsearch.com/
+[7]: https://github.com/alexksikes/fSphinx
+[8]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
+[9]: mailto:alex.ksikes@gmail.com&subject=SimSearch
15 TODO
@@ -0,0 +1,15 @@
+[ ] implement other feature types besides bag of words.
+
+[ ] for bag of words features:
+- mulitple features in one table
+- same feature value for different features.
+- normalize the feature values.
+
+[ ] SSCursor is better to fetch lots of rows but still has problems:
+ http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python
+
+[*] ad feature value information right into the index (ComputedIndex.index_to_feat)
+
+[ ] return only a restricted set of ids
+[ ] to speed things, we could actually only perform the matrix multiplication on he reamining ids
+ (either by looping over each item or by manipulating the matrix)
@@ -0,0 +1,29 @@
+# database parameters
+db_params = dict(user='user', passwd='password', db='dbname')
+
+# list of SQL queries to fetch the features from
+sql_features = [
+ 'select id as item_id, word as feature from table',
+ 'select id as item_id, word as feature from table2',
+ '...'
+]
+
+# path to read or save the index
+index_path = './index.dat'
+
+# maximum number of items to match
+max_items = 10000
+
+# this is only used for sim_sphinx (see doc)
+def sphinx_setup(cl):
+ # import sphinxapi
+
+ # custom sorting function for the search
+ # cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, 'log_score_attr')
+
+ # custom grouping function for the facets
+ group_func = 'sum(log_score_attr)'
+
+ # setup sorting and ordering of each facet
+ for f in cl.facets:
+ f.SetGroupFunc(group_func)
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+from distutils.core import setup
+
+long_description = '''
+Implementation of Bayesian Sets for fast similarity searches.
+'''
+
+setup(name='SimSearch',
+ version='0.2',
+ description='Implementation of Bayesian Sets for fast similarity searches',
+ author='Alex Ksikes',
+ author_email='alex.ksikes@gmail.com',
+ url='https://github.com/alexksikes/SimSearch',
+ download_url='https://github.com/alexksikes/SimSearch/zipball/0.2',
+ packages=['simsearch'],
+ long_description=long_description,
+ license='GPL'
+)
@@ -0,0 +1,18 @@
+#!/usr/bin/env python
+
+#!/usr/bin/env python
+
+"""This is an implementation of Bayesian Sets as described in:
+
+http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
+http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
+
+"""
+
+__version__ = '0.2'
+__author__ = 'Alex Ksikes <alex.ksikes@gmail.com>'
+__license__ = 'GPL'
+
+from bsets import *
+from simsphinx import *
+import utils
Oops, something went wrong.

0 comments on commit 2fc2592

Please sign in to comment.