Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Moving to version 0.5 with the file index, indexer and better Sphinx …
…integration. - added a file index which represents the binary matrix - added indexer - added better SimClient to combine full text with item based search queries - added tutorial on how to use - fixed documentation
- Loading branch information
1 parent
7527833
commit d60fa03
Showing
25 changed files
with
1,736 additions
and
764 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,22 @@ | ||
Download and extract the latest tarball and install the package: | ||
|
||
wget http://github.com/alexksikes/SimilaritySearch/tarball/master | ||
wget http://github.com/alexksikes/SimSearch/tarball/master | ||
tar xvzf "the tar ball" | ||
cd "the tar ball" | ||
python setup.py install | ||
|
||
You will need [NumPy][1] which is used for sparse matrix multiplications. | ||
To combine full text search with similarity search, you will need [Sphinx][2] and | ||
[fSphinx][3]. | ||
You will need [SciPy][1] which is used for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and | ||
[fSphinx][3]. | ||
|
||
Enjoy! | ||
Installing fSphinx and Sphinx is pretty straight forward. On linux (debian) to install scipy, you may need the following libraries: | ||
|
||
[1]: http://numpy.scipy.org/ | ||
sudo aptitude install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-base-dev python-all-dev | ||
|
||
Finally you can install scipy: | ||
|
||
pip install numpy | ||
pip install scipy | ||
|
||
[1]: http://www.scipy.org/ | ||
[2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation | ||
[3]: http://github.com/alexksikes/fSphinx/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,95 +1,17 @@ | ||
This module is an implementation of [Bayesian Sets][1]. Bayesian Sets is a new | ||
framework for information retrieval in which a query consists of a set of items | ||
which are examples of some concept. The result is a set of items which attempts | ||
to capture the example concept given by the query. | ||
SimSearch is an item based retrieval engine which implements [Bayesian Sets][0]. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query. | ||
|
||
For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][2], | ||
Bayesian Sets would return other similar animated movies, like "Toy Story". | ||
For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][1], Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice [blog post][2] about item based search with Bayesian Sets. Feel free to [read][2] through it. | ||
|
||
This module also adds the novel ability to combine full text search with | ||
item based search. For example a query can be a combination of items and full text search | ||
keywords. In this case the results match the keywords but are re-ranked by how similar | ||
to the queried items. | ||
This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are also re-ranked by similary to the queried items. | ||
|
||
This implementation has been [tested][3] on datasets with millions of documents and | ||
hundreds of thousands of features. It has become an integrant part of [Cloud Mining][4]. | ||
At the moment only features of bag of words are supported. However it is faily easy | ||
to change the code to make it work on other feature types. | ||
It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. In this respect SimSearch only implements a simple [bag of words][4] model but other feature types are possible. In fact the index is made of a set of files which represent the presence of a feature value in a given item. As long as you can create these files, SimSearch can read them and perform its matching. | ||
|
||
This module works as follow: | ||
SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed searching and real time indexing. For more information, please follow the [tutorial][6] for more information. | ||
|
||
1) First a configuration file has to be written (have a look at tools/sample_config.py). | ||
The most important variable holds the list of features to index. Those are indexed | ||
with SQL queries of the type: | ||
|
||
sql_features = ['select id as item_id, word as feature from table'] | ||
|
||
Note that id and word must be aliased as item_id and feature respectively. | ||
|
||
2) Now use tools/index_features.py on the configuration file to index those features. | ||
|
||
python tools/index_features.py config.py | ||
|
||
The indexer will create a computed index named index.dat in your working directory. | ||
A computed index is a pickled file with all its hyper parameters already computed and | ||
with the matrix in CSR format. | ||
|
||
3) You can now test this index: | ||
|
||
python tools/query_index.py index.dat | ||
|
||
4) The script *query_index.py* will load the index in memory each time. In order to load it | ||
only once, you can serve the index with some client/server code (see client_server code). | ||
The index can also be loaded along side the web application. In [webpy][5] web.config | ||
dictionnary can be used for this purpose. | ||
|
||
This module relies and [Sphinx][6] and [fSphinx][7] to perform the full-text and item based | ||
search combination. A regular sphinx client is wrapped together with a computed index, | ||
and a function called *setup_sphinx* is called upon similarity search. | ||
This function resets the sphinx client if an item based query is encountered. | ||
|
||
Here is an example of a *setup_sphinx* function: | ||
|
||
# this is only used for sim_sphinx (see doc) | ||
def sphinx_setup(cl): | ||
import sphinxapi | ||
# custom sorting function for the search | ||
# we always make sure highly ranked items with a log score are at the top. | ||
cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')' | ||
# custom grouping function for the facets | ||
group_func = 'sum(log_score_attr)' | ||
# setup sorting and ordering of each facet | ||
for f in cl.facets: | ||
# group by a custom function | ||
f.SetGroupFunc(group_func) | ||
|
||
Note that the log_scores are found in the Sphinx attributes *log_score_attr*. It must be set | ||
to 1 and declared as a float in your Sphinx configuration file: | ||
|
||
# log_score_attr must be set to 1 | ||
sql_query = \ | ||
select *,\ | ||
1 as log_score_attr,\ | ||
from table | ||
|
||
# log_score_attr will hold the log scores after item base search | ||
sql_attr_float = log_score_attr | ||
|
||
There is a nice [blog post][8] about item based search with Bayesian Sets. Feel free to | ||
[read][8] through it. | ||
|
||
That's it for the documentation. Have fun playing with item based search and don't forget | ||
to leave [feedback][9]. | ||
|
||
[1]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf | ||
[2]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29 | ||
[3]: http://imdb.cloudmining.net | ||
[4]: https://github.com/alexksikes/CloudMining | ||
[5]: http://webpy.org/ | ||
[6]: http://sphinxsearch.com/ | ||
[7]: https://github.com/alexksikes/fSphinx | ||
[8]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/ | ||
[9]: mailto:alex.ksikes@gmail.com&subject=SimSearch | ||
[0]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf | ||
[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29 | ||
[2]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/ | ||
[3]: http://en.wikipedia.org/wiki/Feature_(machine_learning) | ||
[4]: http://en.wikipedia.org/wiki/Bag_of_words | ||
[5]: http://imdb.cloudmining.net | ||
[6]: https://github.com/alexksikes/SimSearch/tree/master/tutorial/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,24 @@ | ||
[ ] implement other feature types besides bag of words. | ||
[*] separate feature creation from computed index | ||
|
||
[ ] incremental indexing | ||
- use mode 'append' but the index needs to be recomputed | ||
|
||
[ ] distributed computation of the sparse multiplication | ||
- use multi-processing module | ||
- have workers compute a chuck of the matrix (a sequential list of items) | ||
- merge sort each worker result | ||
- accross machines (not just cores), we need distributed indexes as well | ||
|
||
[ ] implement other feature types besides bag of words | ||
- some basic image features (color histogram) | ||
|
||
[ ] for bag of words features: | ||
- mulitple features in one table | ||
- same feature value for different features. | ||
- normalize the feature values. | ||
- multiple features in one table | ||
- normalize the feature values | ||
- database agnostic | ||
|
||
[ ] SSCursor is better to fetch lots of rows but still has problems: | ||
http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python | ||
|
||
[*] ad feature value information right into the index (ComputedIndex.index_to_feat) | ||
|
||
[ ] return only a restricted set of ids | ||
[ ] to speed things, we could actually only perform the matrix multiplication on he reamining ids | ||
[ ] to speed things, we could actually only perform the matrix multiplication on the reamining ids | ||
(either by looping over each item or by manipulating the matrix) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,15 @@ | ||
#!/usr/bin/env python | ||
|
||
#!/usr/bin/env python | ||
|
||
"""This is an implementation of Bayesian Sets as described in: | ||
"""SimSearch is an item based retrieval engine which implements Bayesian Sets: | ||
http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf | ||
http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/ | ||
""" | ||
|
||
__version__ = '0.2' | ||
__version__ = '0.5' | ||
__author__ = 'Alex Ksikes <alex.ksikes@gmail.com>' | ||
__license__ = 'GPL' | ||
|
||
import bsets | ||
from bsets import * | ||
from simsphinx import * | ||
import utils | ||
from indexer import * |
Oops, something went wrong.