Skip to content

Commit

Permalink
Moving to version 0.5 with the file index, indexer and better Sphinx …
Browse files Browse the repository at this point in the history
…integration.

- added a file index which represents the binary matrix
- added indexer
- added better SimClient to combine full text with item based search queries
- added tutorial on how to use
- fixed documentation
  • Loading branch information
alexksikes committed Oct 3, 2012
1 parent 7527833 commit d60fa03
Show file tree
Hide file tree
Showing 25 changed files with 1,736 additions and 764 deletions.
18 changes: 12 additions & 6 deletions INSTALL.md
@@ -1,16 +1,22 @@
Download and extract the latest tarball and install the package:

wget http://github.com/alexksikes/SimilaritySearch/tarball/master
wget http://github.com/alexksikes/SimSearch/tarball/master
tar xvzf "the tar ball"
cd "the tar ball"
python setup.py install

You will need [NumPy][1] which is used for sparse matrix multiplications.
To combine full text search with similarity search, you will need [Sphinx][2] and
[fSphinx][3].
You will need [SciPy][1] which is used for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and
[fSphinx][3].

Enjoy!
Installing fSphinx and Sphinx is pretty straight forward. On linux (debian) to install scipy, you may need the following libraries:

[1]: http://numpy.scipy.org/
sudo aptitude install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-base-dev python-all-dev

Finally you can install scipy:

pip install numpy
pip install scipy

[1]: http://www.scipy.org/
[2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation
[3]: http://github.com/alexksikes/fSphinx/
102 changes: 12 additions & 90 deletions README.md
@@ -1,95 +1,17 @@
This module is an implementation of [Bayesian Sets][1]. Bayesian Sets is a new
framework for information retrieval in which a query consists of a set of items
which are examples of some concept. The result is a set of items which attempts
to capture the example concept given by the query.
SimSearch is an item based retrieval engine which implements [Bayesian Sets][0]. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.

For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][2],
Bayesian Sets would return other similar animated movies, like "Toy Story".
For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][1], Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice [blog post][2] about item based search with Bayesian Sets. Feel free to [read][2] through it.

This module also adds the novel ability to combine full text search with
item based search. For example a query can be a combination of items and full text search
keywords. In this case the results match the keywords but are re-ranked by how similar
to the queried items.
This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are also re-ranked by similary to the queried items.

This implementation has been [tested][3] on datasets with millions of documents and
hundreds of thousands of features. It has become an integrant part of [Cloud Mining][4].
At the moment only features of bag of words are supported. However it is faily easy
to change the code to make it work on other feature types.
It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. In this respect SimSearch only implements a simple [bag of words][4] model but other feature types are possible. In fact the index is made of a set of files which represent the presence of a feature value in a given item. As long as you can create these files, SimSearch can read them and perform its matching.

This module works as follow:
SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed searching and real time indexing. For more information, please follow the [tutorial][6] for more information.

1) First a configuration file has to be written (have a look at tools/sample_config.py).
The most important variable holds the list of features to index. Those are indexed
with SQL queries of the type:

sql_features = ['select id as item_id, word as feature from table']

Note that id and word must be aliased as item_id and feature respectively.

2) Now use tools/index_features.py on the configuration file to index those features.

python tools/index_features.py config.py

The indexer will create a computed index named index.dat in your working directory.
A computed index is a pickled file with all its hyper parameters already computed and
with the matrix in CSR format.

3) You can now test this index:

python tools/query_index.py index.dat

4) The script *query_index.py* will load the index in memory each time. In order to load it
only once, you can serve the index with some client/server code (see client_server code).
The index can also be loaded along side the web application. In [webpy][5] web.config
dictionnary can be used for this purpose.

This module relies and [Sphinx][6] and [fSphinx][7] to perform the full-text and item based
search combination. A regular sphinx client is wrapped together with a computed index,
and a function called *setup_sphinx* is called upon similarity search.
This function resets the sphinx client if an item based query is encountered.

Here is an example of a *setup_sphinx* function:

# this is only used for sim_sphinx (see doc)
def sphinx_setup(cl):
import sphinxapi
# custom sorting function for the search
# we always make sure highly ranked items with a log score are at the top.
cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')'
# custom grouping function for the facets
group_func = 'sum(log_score_attr)'
# setup sorting and ordering of each facet
for f in cl.facets:
# group by a custom function
f.SetGroupFunc(group_func)

Note that the log_scores are found in the Sphinx attributes *log_score_attr*. It must be set
to 1 and declared as a float in your Sphinx configuration file:

# log_score_attr must be set to 1
sql_query = \
select *,\
1 as log_score_attr,\
from table

# log_score_attr will hold the log scores after item base search
sql_attr_float = log_score_attr

There is a nice [blog post][8] about item based search with Bayesian Sets. Feel free to
[read][8] through it.

That's it for the documentation. Have fun playing with item based search and don't forget
to leave [feedback][9].

[1]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
[2]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
[3]: http://imdb.cloudmining.net
[4]: https://github.com/alexksikes/CloudMining
[5]: http://webpy.org/
[6]: http://sphinxsearch.com/
[7]: https://github.com/alexksikes/fSphinx
[8]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
[9]: mailto:alex.ksikes@gmail.com&subject=SimSearch
[0]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
[2]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
[3]: http://en.wikipedia.org/wiki/Feature_(machine_learning)
[4]: http://en.wikipedia.org/wiki/Bag_of_words
[5]: http://imdb.cloudmining.net
[6]: https://github.com/alexksikes/SimSearch/tree/master/tutorial/
25 changes: 17 additions & 8 deletions TODO
@@ -1,15 +1,24 @@
[ ] implement other feature types besides bag of words.
[*] separate feature creation from computed index

[ ] incremental indexing
- use mode 'append' but the index needs to be recomputed

[ ] distributed computation of the sparse multiplication
- use multi-processing module
- have workers compute a chuck of the matrix (a sequential list of items)
- merge sort each worker result
- accross machines (not just cores), we need distributed indexes as well

[ ] implement other feature types besides bag of words
- some basic image features (color histogram)

[ ] for bag of words features:
- mulitple features in one table
- same feature value for different features.
- normalize the feature values.
- multiple features in one table
- normalize the feature values
- database agnostic

[ ] SSCursor is better to fetch lots of rows but still has problems:
http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python

[*] ad feature value information right into the index (ComputedIndex.index_to_feat)

[ ] return only a restricted set of ids
[ ] to speed things, we could actually only perform the matrix multiplication on he reamining ids
[ ] to speed things, we could actually only perform the matrix multiplication on the reamining ids
(either by looping over each item or by manipulating the matrix)
29 changes: 0 additions & 29 deletions config_example.py

This file was deleted.

2 changes: 1 addition & 1 deletion setup.py
Expand Up @@ -7,7 +7,7 @@
'''

setup(name='SimSearch',
version='0.2',
version='0.5',
description='Implementation of Bayesian Sets for fast similarity searches',
author='Alex Ksikes',
author_email='alex.ksikes@gmail.com',
Expand Down
10 changes: 3 additions & 7 deletions simsearch/__init__.py
@@ -1,19 +1,15 @@
#!/usr/bin/env python

#!/usr/bin/env python

"""This is an implementation of Bayesian Sets as described in:
"""SimSearch is an item based retrieval engine which implements Bayesian Sets:
http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
"""

__version__ = '0.2'
__version__ = '0.5'
__author__ = 'Alex Ksikes <alex.ksikes@gmail.com>'
__license__ = 'GPL'

import bsets
from bsets import *
from simsphinx import *
import utils
from indexer import *

0 comments on commit d60fa03

Please sign in to comment.