Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

A couple of bug fixes

- improved documenation
- fixed sphinxapi bug with socket timeout and SetOverrides
- fixed time is now properly reported
- fixed issue with caching and json serialization
  • Loading branch information...
commit 0c1ad40d4f7cfaa638fe78680064185d1296b588 1 parent c0251c9
@alexksikes authored
View
5 INSTALL.md
@@ -5,8 +5,7 @@ Download and extract the latest tarball and install the package:
cd "the tar ball"
python setup.py install
-You will need [SciPy][1] which is used for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and
-[fSphinx][3].
+You will need [SciPy][1] for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and [fSphinx][3].
Installing fSphinx and Sphinx is pretty straight forward. On linux (debian) to install scipy, you may need the following libraries:
@@ -18,5 +17,5 @@ pip install numpy
pip install scipy
[1]: http://www.scipy.org/
-[2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation
+[2]: http://sphinxsearch.com/docs/current.html#installation
[3]: http://github.com/alexksikes/fSphinx/
View
6 README.md
@@ -2,11 +2,11 @@ SimSearch is an item based retrieval engine which implements [Bayesian Sets][0].
For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][1], Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice [blog post][2] about item based search with Bayesian Sets. Feel free to [read][2] through it.
-This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are also re-ranked by similary to the queried items.
+This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords and are re-ranked by similary to the queried items.
-It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. In this respect SimSearch only implements a simple [bag of words][4] model but other feature types are possible. In fact the index is made of a set of files which represent the presence of a feature value in a given item. As long as you can create these files, SimSearch can read them and perform its matching.
+It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. As an example SimSearch implements a simple [bag of words][4] model. However any other feature binary features are possible. In this case you will need to create the index directly. The index is a set of files in a .xco and .yco format (more in the [tutorial][6]) that represents the presence of a feature value in a given item. So as long as you can create these files, SimSearch can read them and perform the matching.
-SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed searching and real time indexing. For more information, please follow the [tutorial][6] for more information.
+SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed search and real time indexing. For more information, feel free please to follow the [tutorial][6].
[0]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
View
19 simsearch/bsets.py
@@ -29,7 +29,7 @@ def __init__(self, index_path):
self._compute_hyper_parameters()
index.close()
- @utils.show_time_taken
+ #@utils.show_time_taken
def _load_file_index(self, index_path):
logger.info("Loading file index ...")
return indexer.FileIndex(index_path, mode='read')
@@ -68,7 +68,7 @@ def __init__(self, computed_index):
utils.auto_assign(self, vars(computed_index))
self.computed_index = computed_index
self.time = 0
-
+
def query(self, item_ids, max_results=100):
"""Queries the given computed against the given item ids.
"""
@@ -179,14 +179,14 @@ def _compute_detailed_scores(self, item_ids, query_item_ids=None, max_terms=20):
return scores
- def _update_time_taken(self, reset=False):
+ def _update_time_taken(self):
self.time = (
- + getattr(self._make_query_vector, 'time_taken', 0)
- + getattr(self._compute_scores, 'time_taken', 0)
- + getattr(self._order_indexes_by_scores, 'time_taken', 0)
- + getattr(self._compute_detailed_scores, 'time_taken', 0)
+ + getattr(self,'_time_taken__make_query_vector', 0)
+ + getattr(self,'_time_taken__compute_scores', 0)
+ + getattr(self,'_time_taken__order_indexes_by_scores', 0)
+ + getattr(self,'_time_taken__compute_detailed_scores', 0)
)
-
+
@property
def results(self):
"""Returns the results as a ResultSet object.
@@ -196,7 +196,8 @@ def results(self):
self._update_time_taken()
def get_tuple_item_id_score(scores):
- return [(self.index_to_item_id[i], scores[i]) for i in self.ordered_indexes]
+ # index_to_item_id[i] is numpy int, for serialization we need to convert to int
+ return [(int(self.index_to_item_id[i]), float(scores[i])) for i in self.ordered_indexes]
#return ((self.index_to_item_id[i], scores[i]) for i in self.ordered_indexes)
return ResultSet(
View
73 simsearch/simsphinx.py
@@ -2,11 +2,6 @@
__all__ = ['SimClient', 'QuerySimilar', 'QueryTermSimilar']
-# TODO:
-# - we need to check caching again
-# - see if some things can be simplified
-# - printing the results ...
-# - review tutorial
import re
import sys
@@ -31,12 +26,12 @@ class SimClient(object):
def __init__(self, cl=None, query_handler=None, sphinx_setup=None, **opts):
# essential options
self.Wrap(cl)
- if opts.get('index_path'):
- self.LoadIndex(opts['index_path'])
- else:
- self.query_handler = query_handler
+ self.query_handler = query_handler
self.SetSphinxSetup(sphinx_setup)
# other options
+ index_path = opts.get('index_path', '')
+ if index_path:
+ self.LoadIndex(index_path)
self.max_items = opts.get('max_items', 1000)
self.max_terms = opts.get('max_terms', 20)
self.exclude_queried = opts.get('exclude_queried', True)
@@ -61,8 +56,7 @@ def Wrap(self, cl):
def LoadIndex(self, index_path):
"""Load the similarity search index in memory.
"""
- idx = bsets.load_index(index_path)
- self.query_handler = bsets.QueryHandler(idx)
+ self.query_handler = bsets.QueryHandler(bsets.load_index(index_path))
def SetSphinxSetup(self, setup):
"""Set the setup function which will be triggered in similarity search
@@ -75,26 +69,30 @@ def SetSphinxSetup(self, setup):
"""
self.sphinx_setup = setup
- def Query(self, query):
+ def Query(self, query, index='*', comment=''):
"""If the query has item ids perform a similarity search query otherwise
perform a normal sphinx query.
"""
# parse the query which is assumed to be a string
self.query = self.query_parser.Parse(query)
+ self.time_similarity = 0
item_ids = self.query.GetItemIds()
if item_ids:
# perform similarity search on the set of query items
- results = self.DoSimQuery(item_ids)
+ log_scores = self.DoSimQuery(item_ids)
# setup the sphinx client with log scores
- self._SetupSphinxClient(item_ids, dict(results.log_scores))
+ self._SetupSphinxClient(item_ids, dict(log_scores))
# perform the Sphinx query
- hits = self.DoSphinxQuery(self.query)
+ hits = self.DoSphinxQuery(self.query, index, comment)
if item_ids:
# add the statistics to the matches
- self._AddStats(hits, results)
+ self._AddStats(hits, item_ids)
+
+ # and other statistics
+ hits['time_similarity'] = self.time_similarity
return hits
@@ -102,9 +100,12 @@ def Query(self, query):
def DoSimQuery(self, item_ids):
"""Performs the actual simlarity search query.
"""
- return self.query_handler.query(item_ids, self.max_items)
+ results = self.query_handler.query(item_ids, self.max_items)
+ self.time_similarity = results.time
+
+ return results.log_scores
- def DoSphinxQuery(self, query):
+ def DoSphinxQuery(self, query, index='*', comment=''):
"""Peforms a normal sphinx query.
"""
if isinstance(self.wrap_cl, FSphinxClient):
@@ -114,9 +115,8 @@ def DoSphinxQuery(self, query):
return self.wrap_cl.Query(query.sphinx)
def _SetupSphinxClient(self, item_ids, log_scores):
- # if the setup is in a configuration file
- if self.sphinx_setup:
- self.sphinx_setup(self.wrap_cl)
+ # this fixes a nasty bug in the sphinxapi with sockets timing out
+ self.wrap_cl._timeout = None
# override log_score_attr and exclude selected ids
self.wrap_cl.SetOverride('log_score_attr', sphinxapi.SPH_ATTR_FLOAT, log_scores)
@@ -125,19 +125,24 @@ def _SetupSphinxClient(self, item_ids, log_scores):
# only hits with non zero log scores are considered if the query is empty
if not self.query.sphinx and self.allow_empty:
- self.wrap_cl.SetFilterFloatRange('log_score_attr', 0.0, 1.0, exclude=True)
+ self.wrap_cl.SetFilterFloatRange('log_score_attr', 0.0, 1.0, exclude=True)
- def _AddStats(self, sphinx_results, sim_results):
- # add detailed scoring information
- scores = self._GetDetailedScores(sphinx_results['ids'])
+ # further setup of the wrapped sphinx client
+ if self.sphinx_setup:
+ self.sphinx_setup(self.wrap_cl)
+
+ def _AddStats(self, sphinx_results, item_ids):
+ scores = self._GetDetailedScores(sphinx_results['ids'], item_ids)
for scores, match in zip(scores, sphinx_results['matches']):
match['attrs']['@sim_scores'] = scores
- # and other statitics
- sphinx_results['time_similarity'] = sim_results.time
@CacheIO
- def _GetDetailedScores(self, ids):
- return self.query_handler.get_detailed_scores(ids, max_terms=self.max_terms)
+ def _GetDetailedScores(self, result_ids, query_item_ids=None):
+ scores = self.query_handler.get_detailed_scores(
+ result_ids, query_item_ids, max_terms=self.max_terms)
+ self.time_similarity = self.query_handler.time
+
+ return scores
def Clone(self, memo={}):
"""Creates a copy of this client.
@@ -159,14 +164,8 @@ def __deepcopy__(self, memo):
def FromConfig(cls, path):
"""Creates a client from a config file.
"""
- # if path is a module
- if hasattr(path, '__file__'):
- path = os.path.splitext(path.__file__)[0] + '.py'
-
- for d in utils.get_all_sub_dirs(path)[::-1]:
- sys.path.insert(0, d)
- cf = {'sys':sys}; execfile(path, cf, cf)
- return SimClient(**cf)
+ return FSphinxClient.FromConfig(path)
+
class QueryTermSimilar(QueryTerm):
"""This is like an fSphinx multi-field query but with the representation of
View
10 simsearch/utils.py
@@ -17,12 +17,12 @@ def get_basic_logger():
def show_time_taken(func):
- def new(*args, **kw):
+ def new(self, *args, **kw):
start = time.time()
- res = func(*args, **kw)
+ res = func(self, *args, **kw)
timed = time.time() - start
- setattr(new, 'time_taken', timed)
logger.info('%.2f sec.', timed)
+ setattr(self, '_time_taken_'+func.__name__, timed)
return res
return new
@@ -133,8 +133,8 @@ def __setattr__(self, name, value):
def parse_config_file(path, **opts):
cf = {}
execfile(path, cf, cf)
- cf.update(**opts)
- return _O(cf)
+ opts.update(**cf)
+ return _O(opts)
def argsort_best(arr, best_k, reverse=False):
View
16 tests/test_argsort_best.py
@@ -1,15 +1,27 @@
import numpy as np
import sys
+import time
from simsearch import utils
-@utils.show_time_taken
+def show_time_taken(func):
+ def new(*args, **kw):
+ start = time.time()
+ res = func(*args, **kw)
+ timed = time.time() - start
+ utils.logger.info('%.2f sec.', timed)
+ setattr(new, 'time_taken', timed)
+ return res
+ return new
+
+
+@show_time_taken
def argsort(arr):
arr.argsort(0)
-@utils.show_time_taken
+@show_time_taken
def argsort_best(arr, best_k, reverse=False):
return utils.argsort_best(arr, best_k, reverse)
View
34 tutorial/README.md
@@ -1,9 +1,9 @@
In this tutorial, we will show how to use SimSearch to find similar movies. The dataset is taken from a scrape of the top 400 movies found on IMDb. We assume the current working directory to be the "tutorial" directory. All the code samples can be found in the file "./test.py".
-Loading the Data in the Database
---------------------------------
+Loading the Data
+----------------
-First thing we need is some data. The dataset in this example is the same as featured in the [fSphinx tutorial][0]. If you don't already have it, create a MySQL database called "fsphinx" with user and password "fsphinx".
+First thing we need is some data. We will be using the same dataset as the one in the [fSphinx tutorial][0]. If you don't already have the data, create a MySQL database called "fsphinx" with user and password "fsphinx".
In a MySQL shell type:
@@ -49,9 +49,9 @@ Now let's create an index and add some keywords of interest:
index.add(107048, 'weather forecasting')
index.close()
-SimSearch has created 4 files called .xco, .yxo, .ids and .fts in ./data/sim-index/. The files .xco and .yco are the x and y coordinates of the binary matrix. This matrix represents the presence of a feature for a given item id. The file .ids keep track of the item ids with respect to their index in this matrix. The .fts keep track of the feature values. The line number of the file is the actual matrix index.
+SimSearch has created 4 files called .xco, .yxo, .ids and .fts in ./data/sim-index/. The files .xco and .yco are the x and y coordinates of the binary matrix. This matrix represents the presence of a feature for a given item. The file .ids keeps track of all the item ids with respect to their index in this matrix. Similarly the file .fts keeps track of the feature values. The line number of the file is the actual matrix index.
-If we'd like to build a larger index from a database, we would use an indexer. Let's build an index of all the plot keywords found on IMDb for this database.
+If we'd like to build a larger index from a database, we would use the indexer. Let's build an index with features from all the plot keywords found on this sample IMDb dataset.
# let's create our index
index = simsearch.FileIndex('./data/sim-index', mode='write')
@@ -77,7 +77,7 @@ If we'd like to build a larger index from a database, we would use an indexer. L
2012-10-03 11:34:12,895 - INFO - Number of features: 13607
2012-10-03 11:34:12,895 - INFO - 1.29 sec.
-It is important to note that the bag of words iterator is just an example. The indexer can take any iterator as long as the couple (item\_id, feature\_value) is returned. The id must always be an integer and the feature_value is a unique string representation of that feature value. However please note that you do not need to use these tools. In fact if you can directly create the matrix in .xco and .yco format, SimSearch can read it and perform its magic. For example the matrix could represent user's preferences. In this case the matrix would be the coordinates (item\_id, user\_id) indicating that user_id liked item_id. In this case the items are thought to be similar if they share a set of users liking them (the "you may also like" Amazon feature ...).
+It is important to note that the bag of words iterator is just an example. The indexer can take any iterator which returns the couple (item\_id, feature\_value) for a given item. The id must be an integer and the feature_value must be a unique string representation of the feature value. However please note that you can also directly create the matrix in .xco and .yco format and then have SimSearch read it. In fact SimSearch does not care as to how the features are extracted. All that SimSearch does is the actual matching of items with respect to these features. For example the matrix could be representing user preferences. In this case the coordinates (item\_id, user\_id) would indicate that user_id has liked item_id. The items are then thought to be similar if they share a set of users liking them (the "you may also like" Amazon feature ...).
Querying the Index
------------------
@@ -126,7 +126,7 @@ SimSearch does not have a storage engine. Instead we have to query our database
| 55031 | Judgment at Nuremberg |
+---------+------------------------------+
-Ok obvisouly it matched itself, but why did "Prison Break" and "In the Name of the Father" matched?
+OK obviously it matched itself, but why did "Prison Break" and "In the Name of the Father" matched?
# let's get detailed scores for the movie id 455275 and 107207
scores = handler.get_detailed_scores([455275, 107207], max_terms=5)
@@ -147,14 +147,14 @@ Ok obvisouly it matched itself, but why did "Prison Break" and "In the Name of t
Of course things would be much more interesting if we could index all movies in IMDb and consider other feature types such as directors or actors or preference data.
-Note that the query handler is not thread safe. It is mearly meant to be used once and thrown away after each new query. However the computed index is and should be loaded somewhere in memory so it can be reused for subsequent queries. Also note that SimSearch is not limited to single item queries, you can just as quickly perform multiple item queries. Care to know what the movies "Lilo & Stitch" and "Up" [have in common][1]?
+Note that the query handler is not thread safe. It is merely meant to be used once and thrown away after each new query. However the computed index is and should be loaded somewhere in memory so it can be reused for subsequent queries. Also note that SimSearch is not limited to single item queries, you can just as quickly perform multiple item queries. Care to know what the movies "Lilo & Stitch" and "Up" [have in common][1]?
-Although this is a toy example, SimSearch has been shown to perform quite well on millions of documents each having hundreds of thousands of possible feature values. There are also future plans to implement distributed search as well as real time indexing.
+Although this is a toy example, SimSearch has been shown to perform quite well on millions of documents each having hundreds of thousands of possible feature values. There are also plans to implement distributed search and real time indexing.
-Combining Full Text Search with Similarity Search
--------------------------------------------------
+Combining Full Text Search
+--------------------------
-Ok this is rather interesting, however sometimes we'd like to combine full text with item based search. For example we'd like to search for specific keywords and order these results based on how similar they are to a given set of items. This is accomplished by using the simsphinx module. The full text search query is handled by [Sphinx][2] so a little bit of setting up is necessary first.
+OK this is rather interesting, however sometimes we'd like to combine full text with item based search. For example we'd like to search for specific keywords and order these results based on how similar they are to a given set of items. This is accomplished by using the simsphinx module. The full text search query is handled by [Sphinx][2] so a little bit of setting up is necessary.
First you need to install [Sphinx][2] and [fSphinx][3].
@@ -182,10 +182,10 @@ We are now ready to combine full text search with item based search.
# creating a sphinx client to handle full text search
cl = simsearch.SimClient(fsphinx.FSphinxClient(), handler, max_terms=5)
-A SimClient wraps any SphinxClient to provide it with similarity search ability.
+A SimClient wraps a SphinxClient to provide it with similarity search ability.
- # assuming searchd is running on 9315
- cl.SetServer('localhost', 9315)
+ # assuming searchd is running on 10001
+ cl.SetServer('localhost', 10001)
# telling fsphinx how to fetch the results
db = fsphinx.utils.database(dbn='mysql', **db_params)
@@ -206,7 +206,7 @@ A SimClient wraps any SphinxClient to provide it with similarity search ability.
# searching for all animation movies re-ranked by similarity to "The Shawshank Redemption"
results = cl.Query('@genres animation @similar 111161')
-On seeing the query term "@similar 111161", the client performs a similarity search and then sets the log_score_attr accordingly. Let's have a look at these results:
+On seeing the query term "@similar 111161", the client performed a similarity search and then set the log_score_attr accordingly. Let's have a look at these results:
# looking at the results with similarity search
print results
@@ -228,7 +228,7 @@ On seeing the query term "@similar 111161", the client performs a similarity sea
id=198781
title=Monsters, Inc.
-Again note that a SimClient is not thread safe. It is mearly meant to be used once or sequentially after each each request. In a web application you will need to create a new client for each new request. You can use SimClient.Clone on each new request for this purpose or you can create a new client from a config file with SimClient.FromConfig.
+Again note that a SimClient is not thread safe. It is merely meant to be used once or sequentially after each each request. In a web application you will need to create a new client for each new request. You can use SimClient.Clone on each new request for this purpose or you can create a new client from a config file with SimClient.FromConfig.
That's pretty much it. I hope you'll enjoy using SimSearch and please don't forget to leave [feedback][5].
View
10 tutorial/config/indexer.conf
@@ -89,7 +89,7 @@ source items
index items
{
source = items
- path = data/sph_index/
+ path = data/sph-index/
#morphology = stem_en
#stopwords = data/stopwords.txt
@@ -105,14 +105,14 @@ indexer
searchd
{
- listen = localhost:9315
+ listen = localhost:10001
read_timeout = 5
max_children = 30
max_matches = 1000
seamless_rotate = 1
- log = data/sph_logs/searchd.log
- query_log = data/sph_logs/query.log
- pid_file = data/sph_logs/searchd.pid
+ log = data/sph-logs/searchd.log
+ query_log = data/sph-logs/query.log
+ pid_file = data/sph-logs/searchd.pid
}
View
0  tutorial/data/sph_logs/.gitignore → tutorial/data/sph-index/.gitignore
File renamed without changes
View
0  tutorial/data/sph_index/.gitignore → tutorial/data/sph-logs/.gitignore
File renamed without changes
View
4 tutorial/test.py
@@ -53,8 +53,8 @@
# creating a sphinx client to handle full text search
cl = simsearch.SimClient(fsphinx.FSphinxClient(), handler, max_terms=5)
-# assuming searchd is running on 9315
-cl.SetServer('localhost', 9315)
+# assuming searchd is running on 10001
+cl.SetServer('localhost', 10001)
# telling fsphinx how to fetch the results
db = fsphinx.utils.database(dbn='mysql', **db_params)
Please sign in to comment.
Something went wrong with that request. Please try again.