Browse files

Moving to version 0.5

- added a file index which represents the binary matrix
- added indexer
- added better SimClient to combine full text with item based search queries
- added tutorial on how to use
- fixed documentation
  • Loading branch information...
1 parent 7527833 commit 8443b8e0df2b1898244c37431db60377a9912765 @alexksikes committed Oct 3, 2012
View
18 INSTALL.md
@@ -1,16 +1,22 @@
Download and extract the latest tarball and install the package:
- wget http://github.com/alexksikes/SimilaritySearch/tarball/master
+ wget http://github.com/alexksikes/SimSearch/tarball/master
tar xvzf "the tar ball"
cd "the tar ball"
python setup.py install
-You will need [NumPy][1] which is used for sparse matrix multiplications.
-To combine full text search with similarity search, you will need [Sphinx][2] and
-[fSphinx][3].
+You will need [SciPy][1] which is used for sparse matrix multiplications. To combine full text search with similarity search, you will need [Sphinx][2] and
+[fSphinx][3].
-Enjoy!
+Installing fSphinx and Sphinx is pretty straight forward. On linux (debian) to install scipy, you may need the following libraries:
-[1]: http://numpy.scipy.org/
+sudo aptitude install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-base-dev python-all-dev
+
+Finally you can install scipy:
+
+pip install numpy
+pip install scipy
+
+[1]: http://www.scipy.org/
[2]: http://sphinxsearch.com/docs/manual-2.0.1.html#installation
[3]: http://github.com/alexksikes/fSphinx/
View
102 README.md
@@ -1,95 +1,17 @@
-This module is an implementation of [Bayesian Sets][1]. Bayesian Sets is a new
-framework for information retrieval in which a query consists of a set of items
-which are examples of some concept. The result is a set of items which attempts
-to capture the example concept given by the query.
+SimSearch is an item based retrieval engine which implements [Bayesian Sets][0]. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.
-For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][2],
-Bayesian Sets would return other similar animated movies, like "Toy Story".
+For example, for the query with the two animated movies, ["Lilo & Stitch" and "Up"][1], Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice [blog post][2] about item based search with Bayesian Sets. Feel free to [read][2] through it.
-This module also adds the novel ability to combine full text search with
-item based search. For example a query can be a combination of items and full text search
-keywords. In this case the results match the keywords but are re-ranked by how similar
-to the queried items.
+This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are also re-ranked by similary to the queried items.
-This implementation has been [tested][3] on datasets with millions of documents and
-hundreds of thousands of features. It has become an integrant part of [Cloud Mining][4].
-At the moment only features of bag of words are supported. However it is faily easy
-to change the code to make it work on other feature types.
+It is important to note that Bayesian Sets does not care about how the actual [feature][3] engineering. In this respect SimSearch only implements a simple [bag of words][4] model but other feature types are possible. In fact the index is made of a set of files which represent the presence of a feature value in a given item. As long as you can create these files, SimSearch can read them and perform its matching.
-This module works as follow:
+SimSearch has been [tested][5] on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed searching and real time indexing. For more information, please follow the [tutorial][6] for more information.
-1) First a configuration file has to be written (have a look at tools/sample_config.py).
-The most important variable holds the list of features to index. Those are indexed
-with SQL queries of the type:
-
- sql_features = ['select id as item_id, word as feature from table']
-
-Note that id and word must be aliased as item_id and feature respectively.
-
-2) Now use tools/index_features.py on the configuration file to index those features.
-
- python tools/index_features.py config.py
-
-The indexer will create a computed index named index.dat in your working directory.
-A computed index is a pickled file with all its hyper parameters already computed and
-with the matrix in CSR format.
-
-3) You can now test this index:
-
- python tools/query_index.py index.dat
-
-4) The script *query_index.py* will load the index in memory each time. In order to load it
-only once, you can serve the index with some client/server code (see client_server code).
-The index can also be loaded along side the web application. In [webpy][5] web.config
-dictionnary can be used for this purpose.
-
-This module relies and [Sphinx][6] and [fSphinx][7] to perform the full-text and item based
-search combination. A regular sphinx client is wrapped together with a computed index,
-and a function called *setup_sphinx* is called upon similarity search.
-This function resets the sphinx client if an item based query is encountered.
-
-Here is an example of a *setup_sphinx* function:
-
- # this is only used for sim_sphinx (see doc)
- def sphinx_setup(cl):
- import sphinxapi
-
- # custom sorting function for the search
- # we always make sure highly ranked items with a log score are at the top.
- cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')'
-
- # custom grouping function for the facets
- group_func = 'sum(log_score_attr)'
-
- # setup sorting and ordering of each facet
- for f in cl.facets:
- # group by a custom function
- f.SetGroupFunc(group_func)
-
-Note that the log_scores are found in the Sphinx attributes *log_score_attr*. It must be set
-to 1 and declared as a float in your Sphinx configuration file:
-
- # log_score_attr must be set to 1
- sql_query = \
- select *,\
- 1 as log_score_attr,\
- from table
-
- # log_score_attr will hold the log scores after item base search
- sql_attr_float = log_score_attr
-
-There is a nice [blog post][8] about item based search with Bayesian Sets. Feel free to
-[read][8] through it.
-
-That's it for the documentation. Have fun playing with item based search and don't forget
-to leave [feedback][9].
-
-[1]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
-[2]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
-[3]: http://imdb.cloudmining.net
-[4]: https://github.com/alexksikes/CloudMining
-[5]: http://webpy.org/
-[6]: http://sphinxsearch.com/
-[7]: https://github.com/alexksikes/fSphinx
-[8]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
-[9]: mailto:alex.ksikes@gmail.com&subject=SimSearch
+[0]: http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
+[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
+[2]: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
+[3]: http://en.wikipedia.org/wiki/Feature_(machine_learning)
+[4]: http://en.wikipedia.org/wiki/Bag_of_words
+[5]: http://imdb.cloudmining.net
+[6]: https://github.com/alexksikes/SimSearch/tree/master/tutorial/
View
25 TODO
@@ -1,15 +1,24 @@
-[ ] implement other feature types besides bag of words.
+[*] separate feature creation from computed index
+
+[ ] incremental indexing
+ - use mode 'append' but the index needs to be recomputed
+
+[ ] distributed computation of the sparse multiplication
+ - use multi-processing module
+ - have workers compute a chuck of the matrix (a sequential list of items)
+ - merge sort each worker result
+ - accross machines (not just cores), we need distributed indexes as well
+
+[ ] implement other feature types besides bag of words
+- some basic image features (color histogram)
[ ] for bag of words features:
-- mulitple features in one table
-- same feature value for different features.
-- normalize the feature values.
+- multiple features in one table
+- normalize the feature values
+- database agnostic
[ ] SSCursor is better to fetch lots of rows but still has problems:
http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python
-[*] ad feature value information right into the index (ComputedIndex.index_to_feat)
-
-[ ] return only a restricted set of ids
-[ ] to speed things, we could actually only perform the matrix multiplication on he reamining ids
+[ ] to speed things, we could actually only perform the matrix multiplication on the reamining ids
(either by looping over each item or by manipulating the matrix)
View
29 config_example.py
@@ -1,29 +0,0 @@
-# database parameters
-db_params = dict(user='user', passwd='password', db='dbname')
-
-# list of SQL queries to fetch the features from
-sql_features = [
- 'select id as item_id, word as feature from table',
- 'select id as item_id, word as feature from table2',
- '...'
-]
-
-# path to read or save the index
-index_path = './index.dat'
-
-# maximum number of items to match
-max_items = 10000
-
-# this is only used for sim_sphinx (see doc)
-def sphinx_setup(cl):
- # import sphinxapi
-
- # custom sorting function for the search
- # cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, 'log_score_attr')
-
- # custom grouping function for the facets
- group_func = 'sum(log_score_attr)'
-
- # setup sorting and ordering of each facet
- for f in cl.facets:
- f.SetGroupFunc(group_func)
View
8 setup.py
@@ -3,17 +3,17 @@
from distutils.core import setup
long_description = '''
-Implementation of Bayesian Sets for fast similarity searches.
+Item based retrieval engine with Bayesian Sets.
'''
setup(name='SimSearch',
- version='0.2',
- description='Implementation of Bayesian Sets for fast similarity searches',
+ version='0.5',
+ description='Item based retrieval engine with Bayesian Sets',
author='Alex Ksikes',
author_email='alex.ksikes@gmail.com',
url='https://github.com/alexksikes/SimSearch',
download_url='https://github.com/alexksikes/SimSearch/tarball/master',
packages=['simsearch'],
long_description=long_description,
license='GPL'
-)
+)
View
10 simsearch/__init__.py
@@ -1,19 +1,15 @@
#!/usr/bin/env python
-#!/usr/bin/env python
-
-"""This is an implementation of Bayesian Sets as described in:
+"""SimSearch is an item based retrieval engine which implements Bayesian Sets:
http://www.gatsby.ucl.ac.uk/~heller/bsets.pdf
http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/
-
"""
-__version__ = '0.2'
+__version__ = '0.5'
__author__ = 'Alex Ksikes <alex.ksikes@gmail.com>'
__license__ = 'GPL'
-import bsets
from bsets import *
from simsphinx import *
-import utils
+from indexer import *
View
416 simsearch/bsets.py
@@ -1,347 +1,232 @@
"""This is module is an implementation of Bayesian Sets."""
-__all__ = ['Indexer', 'ComputedIndex', 'QueryHandler', 'load_index']
+__all__ = ['ComputedIndex', 'QueryHandler', 'load_index']
-from scipy import sparse
-from MySQLdb import cursors
-import numpy
-import MySQLdb
import random
-import cPickle as pickle
+import scipy
+from scipy import sparse
+import indexer
import utils
-logging = utils.basic_logger
+from utils import logger
-class Indexer(object):
- """"This class is used to index the list of features fetch from a database.
-
- It will create a computed index which can then be queried.
+
+class ComputedIndex(utils.Serializable):
+ """"This class represents a computed index.
+
+ A computed index contains the matrix in CSR format and all hyper parameters
+ already computed.
+
+ A computed index can then be queried using a QueryHandler object or saved
+ into a file.
"""
- def __init__(self, config):
- """Builds an indexer given a configuration file.
-
- The configuration file must have the db parameters and a list of sql queries
- to fetch the features from.
- """
- # get params from config dict ...
- self.db_params = config.db_params
- self.sql_features = config.sql_features
-
- def index_dataset(self, limit):
- """Indexes the data into a computed index.
-
- For now only bag of words are implemented.
- Override _make_datasets to index other feature types.
+ def __init__(self, index_path):
+ """ Creates a computed index from the path to an index.
"""
- # connect to database
- self._init_database()
-
- # make the dataset
- logging.info('Building the dataset from sql table ...')
- self._make_dataset(limit)
- logging.info('%.2f sec.', self._make_dataset.time_taken)
-
- # compute hyperparameters
- logging.info('Computing hyper parameters ...')
+ index = self._load_file_index(index_path)
+ self._create_indexes(index.ids, index.fts)
+ self._compute_matrix_to_csr(index.xco, index.yco)
self._compute_hyper_parameters()
- logging.info('%.2f sec.', self._compute_hyper_parameters.time_taken)
-
- def _init_database(self):
- self.db_params['use_unicode'] = True
- self.db_params['cursorclass'] = cursors.SSCursor
- self.db = MySQLdb.connect(**self.db_params)
-
- @utils.time_func
- def _make_dataset(self, limit=''):
- # initialize dataset variables
- X = {}
- self.item_id_to_index = {}
- self.index_to_item_id = {}
- self.feat_to_index = {}
- self.index_to_feat = {}
-
- # go through the features in our sql table
- limit = limit and 'limit %s' % limit or ''
- for sql in self.sql_features:
- cur = self.db.cursor()
- sql = '%s %s' % (sql, limit)
- logging.info('SQL: %s', sql)
- cur.execute(sql)
+ index.close()
- for i, (id_val, feat_val) in enumerate(cur):
- if id_val not in self.item_id_to_index:
- r = len(self.item_id_to_index)
- self.item_id_to_index[id_val] = r
- self.index_to_item_id[r] = id_val
- X[r] = []
-
- if feat_val not in self.feat_to_index:
- c = len(self.feat_to_index)
- self.feat_to_index[feat_val] = c
- self.index_to_feat[c] = feat_val
-
- r = self.item_id_to_index[id_val]
- c = self.feat_to_index[feat_val]
- X[r].append(c)
-
- cur.close()
-
- # cleaning up
- cur.close()
- self.db.close()
-
- # give some simple statistics
- logging.info('Done processing the dataset.')
- self.no_items = len(self.item_id_to_index)
- self.no_features = len(self.feat_to_index)
- logging.info('Number of items: %s', self.no_items)
- logging.info('Number of features: %s', self.no_features)
-
- # make a sparse matrix from the dataset
- logging.info('Constructing sparse matrix from dataset ...')
- self.X = sparse.lil_matrix((self.no_items, self.no_features))
- for r in X.keys():
- for c in X[r]:
- self.X[r,c] = 1
+ @utils.show_time_taken
+ def _load_file_index(self, index_path):
+ logger.info("Loading file index ...")
+ return indexer.FileIndex(index_path, mode='read')
+
+ @utils.show_time_taken
+ def _create_indexes(self, ids, fts):
+ logger.info("Creating indices ...")
+ self.item_id_to_index = dict(ids)
+ self.index_to_item_id = dict((i, id) for id, i in ids.iteritems())
+ self.index_to_feat = dict((i, ft) for ft, i in fts.iteritems())
+ self.no_items = len(ids)
+ self.no_features = len(fts)
+
+ @utils.show_time_taken
+ def _compute_matrix_to_csr(self, xco, yco):
+ logger.info("Creating CSR matrix ...")
+ data = scipy.ones(len(xco))
+ self.X = sparse.csr_matrix((data, (xco, yco)))
- # and convert it to csr for matrix operations
- logging.info('Converting sparse matrix to csr ...')
- self.X = self.X.tocsr()
-
- @utils.time_func
+ @utils.show_time_taken
def _compute_hyper_parameters(self, c=2):
+ logger.info("Computing hyper parameters ...")
self.mean = self.X.mean(0)
-
self.alpha = c * self.mean
self.beta = c * (1 - self.mean)
-
self.alpha_plus_beta = self.alpha + self.beta
- self.log_alpha_plus_beta = numpy.log(self.alpha_plus_beta)
-
- self.log_alpha = numpy.log(self.alpha)
- self.log_beta = numpy.log(self.beta)
-
- def get_computed_index(self):
- """Returns a computed index.
+ self.log_alpha_plus_beta = scipy.log(self.alpha_plus_beta)
+ self.log_alpha = scipy.log(self.alpha)
+ self.log_beta = scipy.log(self.beta)
- This must be called after index_dataset.
- """
- return ComputedIndex(
- no_items = self.no_items,
- X = self.X,
- item_id_to_index = self.item_id_to_index,
- index_to_item_id = self.index_to_item_id,
- alpha = self.alpha,
- beta = self.beta,
- alpha_plus_beta = self.alpha_plus_beta,
- log_alpha = self.log_alpha,
- log_beta = self.log_beta,
- db_params = self.db_params,
- index_to_feat = self.index_to_feat)
-
- def save_index(self, path):
- """Saves the computed index into a pickled file.
-
- This must be called after index_dataset.
- """
- self.get_computed_index().dump(path)
-class ComputedIndex(utils.Serializable):
- """"This class represents a computed index which is returned by indexer.
-
- A computed index contains the matrix in CSR format and all hyper paramters
- already computed.
-
- A computed index can then be queried using a QueryHandler object or saved
- into a file.
- """
- def __init__(self, no_items, X, item_id_to_index, index_to_item_id,
- alpha, beta, alpha_plus_beta, log_alpha, log_beta, db_params, index_to_feat):
- utils.auto_assign(self, locals())
-
- def dump(self, index_path):
- """Saves this index into a file.
- """
- logging.info('Saving the index to %s ...', index_path)
- dump = super(ComputedIndex, self).dump
- dump(index_path)
- logging.info('%.2f sec.', dump.time_taken)
-
- @staticmethod
- def load(index_path):
- """Load this picked index into an object.
- """
- logging.info('Loading index from %s in memory ...', index_path)
- load = utils.Serializable.load
- index = load(index_path)
- logging.info('%.2f sec.', load.time_taken)
-
- return index
-
- def get_sample_item_ids(self):
- """Returns some sample item ids from this index.
- """
- return [self.index_to_item_id[i] for i in random.sample(xrange(self.no_items), 10)]
-
class QueryHandler(object):
"""This class is used to query a computed index.
"""
- def __init__(self, computed_index, caching=False):
+ def __init__(self, computed_index):
utils.auto_assign(self, vars(computed_index))
- self.time_taken = 0
-
+ self.computed_index = computed_index
+ self.time = 0
+
def query(self, item_ids, max_results=100):
- """Query the given computed against the item ids.
+ """Queries the given computed against the given item ids.
"""
- # check the query is valid
+ item_ids = utils.listify(item_ids)
if not self.is_valid_query(item_ids):
return self.empty_results
-
- # make query vector
- logging.info('Computing the query vector ...')
+
+ logger.info('Computing the query vector ...')
self._make_query_vector()
- logging.info('%.2f sec.', self._make_query_vector.time_taken)
-
- # compute log scores
- logging.info('Computing log scores ...')
+ logger.info('Computing log scores ...')
self._compute_scores()
- logging.info('%.2f sec.', self._compute_scores.time_taken)
-
- # sort the results by log scores
- logging.info('Get the top %s log scores ...', max_results)
+ logger.info('Get the top %s log scores ...', max_results)
self._order_indexes_by_scores(max_results)
- logging.info('%.2f sec.', self._order_indexes_by_scores.time_taken)
-
+
return self.results
-
- def get_detailed_scores(self, query_item_ids, result_ids, max_terms=20):
- # if the query vector was not computed, we need to recompute it!
- if not hasattr(self, 'q'):
- if not self.is_valid_query(query_item_ids):
- return None
- else:
- logging.info('Computing the query vector ...')
- self._make_query_vector()
- logging.info('%.2f sec.', self._make_query_vector.time_taken)
- self.time_taken += self._make_query_vector.time_taken
-
- # computing deatailed scores for the chosen items
- logging.info('Computing detailed scores ...')
- scores = self._compute_detailed_scores(result_ids, max_terms)
- logging.info('%.2f sec.', self._compute_detailed_scores.time_taken)
+
+ def get_detailed_scores(self, item_ids, query_item_ids=None, max_terms=20):
+ """Returns detailed statistics about the matched items.
+
+ This will assume the same items previously queried unless otherwise
+ specified by 'query_item_ids'.
+ """
+ item_ids = utils.listify(item_ids)
+
+ logger.info('Computing detailed scores ...')
+ scores = self._compute_detailed_scores(item_ids, query_item_ids, max_terms)
- self.time_taken += self._compute_detailed_scores.time_taken
- return utils._O(scores=scores, time=self.time_taken)
+ self._update_time_taken()
+ return scores
+
+ def get_sample_item_ids(self):
+ """Returns some sample item ids from the index.
+ """
+ return [self.index_to_item_id[i] for i in random.sample(xrange(self.no_items), 10)]
- @utils.time_func
def is_valid_query(self, item_ids):
+ """Checks whether the item ids are within the index.
+ """
self.item_ids = item_ids
self._item_ids = [id for id in item_ids if id in self.item_id_to_index]
return self._item_ids != []
-
- @utils.time_func
+
+ @utils.show_time_taken
def _make_query_vector(self):
item_ids = self._item_ids
N = len(item_ids)
-
+
sum_xi = self.X[self.item_id_to_index[item_ids[0]]]
for id in item_ids[1:]:
sum_xi = sum_xi + self.X[self.item_id_to_index[id]]
-
+
alpha_bar = self.alpha + sum_xi
beta_bar = self.beta + N - sum_xi
- log_alpha_bar = numpy.log(alpha_bar)
- log_beta_bar = numpy.log(beta_bar)
-
- self.c = (self.alpha_plus_beta - numpy.log(self.alpha_plus_beta + N) + log_beta_bar - self.log_beta).sum()
+ log_alpha_bar = scipy.log(alpha_bar)
+ log_beta_bar = scipy.log(beta_bar)
+
+ self.c = (self.alpha_plus_beta - scipy.log(self.alpha_plus_beta + N)
+ + log_beta_bar - self.log_beta).sum()
self.q = log_alpha_bar - self.log_alpha - log_beta_bar + self.log_beta
-
- @utils.time_func
+
+ @utils.show_time_taken
def _compute_scores(self):
- self.scores = self.X * self.q.transpose()
- self.scores = numpy.asarray(self.scores).flatten()
- self.log_scores = self.c + self.scores
-
- @utils.time_func
+ scores = self.X * self.q.transpose()
+ scores = scipy.asarray(scores).flatten()
+ self.log_scores = self.c + scores
+
+ @utils.show_time_taken
def _order_indexes_by_scores(self, max_results=100):
if max_results == -1:
self.ordered_indexes = xrange(len(self.log_scores))
else:
self.ordered_indexes = utils.argsort_best(self.log_scores, max_results, reverse=True)
- logging.info('Got %s indexes ...', len(self.ordered_indexes))
+ logger.info('Got %s indexes ...', len(self.ordered_indexes))
- @utils.time_func
- def _compute_detailed_scores(self, item_ids, max_terms=20):
- """Returns detailed statistics about the matched items.
- """
+ @utils.show_time_taken
+ def _compute_detailed_scores(self, item_ids, query_item_ids=None, max_terms=20):
+ # if set to None we assume previously queried items
+ if query_item_ids is None:
+ query_item_ids = self.item_ids
+
+ # if the query vector is different than previously computed
+ # or not computed at all, we need to recompute it.
+ if not hasattr(self, 'q') or query_item_ids != self.item_ids:
+ if not self.is_valid_query(query_item_ids):
+ return []
+ else:
+ logger.info('Computing the query vector ...')
+ self._make_query_vector()
+
+ # computing the score for each item
scores = []
for id in item_ids:
if id not in self.item_id_to_index:
- scores.append(dict(total_score=0, scores=[]))
+ scores.append(utils._O(total_score=0, scores=[]))
continue
-
+
xi = self.X[self.item_id_to_index[id]]
xi_ind = xi.indices
feat = (self.index_to_feat[i] for i in xi_ind)
qi = self.q.transpose()[xi_ind]
- qi = numpy.asarray(qi).flatten()
-
+ qi = scipy.asarray(qi).flatten()
+
sc = sorted(zip(feat, qi), key=lambda x: (x[1], x[0]), reverse=True)
total_score = qi.sum()
-
- scores.append(dict(total_score=total_score, scores=sc[0:max_terms]))
+ scores.append(utils._O(total_score=total_score, scores=sc[0:max_terms]))
+
return scores
-
+
def _update_time_taken(self, reset=False):
- self.time_taken = (
+ self.time = (
+ getattr(self._make_query_vector, 'time_taken', 0)
+ getattr(self._compute_scores, 'time_taken', 0)
+ getattr(self._order_indexes_by_scores, 'time_taken', 0)
+ getattr(self._compute_detailed_scores, 'time_taken', 0)
)
-
+
@property
def results(self):
"""Returns the results as a ResultSet object.
-
+
This must be called after the index has been queried.
"""
self._update_time_taken()
-
+
def get_tuple_item_id_score(scores):
return [(self.index_to_item_id[i], scores[i]) for i in self.ordered_indexes]
#return ((self.index_to_item_id[i], scores[i]) for i in self.ordered_indexes)
-
+
return ResultSet(
- time = self.time_taken,
+ time = self.time,
total_found = len(self.ordered_indexes),
query_item_ids = self.item_ids,
_query_item_ids = self._item_ids,
log_scores = get_tuple_item_id_score(self.log_scores)
)
-
+
@property
def empty_results(self):
return ResultSet.get_empty_result_set(query_item_ids=self.item_ids, _query_item_ids=self._item_ids)
-
+
+
class ResultSet(utils.Serializable):
"""This class represents the results returned by a query handler.
-
+
It holds the log scores amongst other variables.
"""
def __init__(self, time, total_found, query_item_ids, _query_item_ids, log_scores):
utils.auto_assign(self, locals())
-
+
def __str__(self):
- s = 'You look for (after cleaning up) :\n%s \n' % '\n'.join(map(str, self._query_item_ids))
- s += 'Found %s in %.2f sec. \n' % (self.total_found, self.time)
- s += 'Best results found:\n'
- for id, log_score in self.log_scores[0:10]:
- s+= 'id = %s, log score = %s\n' % (id, log_score)
+ s = 'You looked for item ids (after cleaning up): %s \n' % ', '.join(map(str, self._query_item_ids))
+ s += 'Found %s in %.2f sec. (showing top 10 here):\n' % (self.total_found, self.time)
+ s += '\n'.join('id = %s, log score = %s' % (id, log_score)
+ for id, log_score in self.log_scores[0:10])
return s
-
+
@staticmethod
def get_empty_result_set(**kwargs):
o = dict(
@@ -353,29 +238,28 @@ def get_empty_result_set(**kwargs):
)
o.update(kwargs)
return ResultSet(**o)
-
-class Searcher(object):
- """Creates a query handler from a pickled computed index.
- """
- def __call__(self, computed_index_path):
- index = ComputedIndex.load(computed_index_path)
- return QueryHandler(index)
-
-def load_index(index_path, once=None):
- """Loads a pickled computed index into a ComputedIndex object.
+
+
+def search(index_path, item_ids):
+ """Load the index and then query it against the item ids.
"""
- if once and len(once):
- return once
- return ComputedIndex.load(index_path)
+ index = ComputedIndex(index_path)
+ return QueryHandler(index).query(item_ids)
-def load_index_to(index_path, to):
- """Loads a pickled computed index into a ComputedIndex object.
+
+def load_index(index_path, pickled=False):
+ """Loads a computed index given the path to an index.
+
+ If pickled is true, load from a pickled computed index file.
"""
- if n and len(once):
- return once
- return ComputedIndex.load(index_path)
+ if pickled:
+ index = ComputedIndex.load(index_path)
+ else:
+ index = ComputedIndex(index_path)
+ return index
+
-def handle_query(item_ids, computed_index, max_results=100):
+def query_index(item_ids, computed_index, max_results=100):
"""Queries a computed index against the item ids.
"""
- return QueryHandler(computed_index).query(item_ids, max_results)
+ return QueryHandler(computed_index).query(item_ids, max_results)
View
219 simsearch/indexer.py
@@ -0,0 +1,219 @@
+"""This is module used to create similarity search indexes.
+
+An index is made of 4 files called .xco, .yco, .ids and .fts.
+The files .xco and .yco holds the x and y coordinates of the matrix. This
+matrix represents whether a particular item id has a particular feature.
+
+The file .ids is used to keep track of the matrix indices with respect to
+the item ids. The line number as the index in the matrix for the given the
+item id in the matrix. In a similar way, the file .fts is used to keep track
+of the features.
+"""
+
+__all__ = ['Indexer', 'BagOfWordsIter', 'FileIndex']
+
+import os
+import scipy
+from scipy import sparse
+import codecs
+import MySQLdb
+from MySQLdb import cursors
+
+import utils
+from utils import logger
+
+
+class Indexer(object):
+ def __init__(self, index, iter_features):
+ """ An indexer takes a FileIndex object and an iterator.
+
+ The iterator must return the couple (item id, feature). The item id
+ must be an integer, whereas the feature must be a unique string
+ representing the feature (utf8 encoded or a unicode).
+ """
+ if not isinstance(index, FileIndex):
+ self.index = FileIndex(index, 'write')
+ self.index = index
+ self.iter_features = iter_features
+
+ @utils.show_time_taken
+ def index_data(self):
+ with self.index:
+ for id, feat in self.iter_features:
+ self.index.add(id, feat)
+ self.show_stats()
+
+ def show_stats(self):
+ logger.info('Done processing the dataset.')
+ logger.info('Number of items: %s', len(self.index.ids))
+ logger.info('Number of features: %s', len(self.index.fts))
+
+
+class BagOfWordsIter(object):
+ """ This class implements the bag of words model and is passed to Indexer
+ object.
+ """
+ def __init__(self, db_params, sql_features, limit=0):
+ """ Takes the parameters of the database (only MySQL is supported for now)
+ and a list of SQL statements to fetch the data.
+
+ The SQL statements must select 2 fields, respectively the item id
+ and the keyword.
+ """
+ self.db_params = dict(use_unicode=True, cursorclass=cursors.SSCursor)
+ self.db_params.update(db_params)
+
+ self.db = MySQLdb.connect(**self.db_params)
+ self.sql_features = sql_features
+ if limit:
+ self.sql_features = ['%s limit %s' % (sql, limit)
+ for sql in sql_features]
+
+ def __iter__(self):
+ for sql in self.sql_features:
+ c = self.db.cursor()
+ logger.info('SQL: %s', sql)
+ c.execute(sql)
+ for id, feat in c:
+ if isinstance(feat, int) or isinstance(feat, long):
+ feat = utils._unicode(feat)
+ yield id, feat
+ c.close()
+ self.db.close()
+
+
+class FileIndex(object):
+ """ This class is used to manipulate the index.
+
+ The index can be opened in 3 different modes. The mode 'write' is used
+ to create the index. It will overwrite any other existing index.
+ The mode 'read' is used to load the index in memory. Finally the mode
+ 'append' appends data to an already existing index.
+ """
+ def __init__(self, index_path, mode='read', feat_enc='utf8'):
+ self.index_path = index_path
+ self.mode = mode
+ self.ids = {}
+ self.fts = {}
+
+ self.xco = []
+ self.yco = []
+ self.X = None
+
+ if mode not in ('read', 'append', 'write'):
+ raise Exception('Incorrect mode %s, choose read, write \
+ or append' % self.mode)
+
+ if mode == 'read':
+ self._read()
+ elif mode == 'append':
+ self._read()
+ self._open_index_files('append')
+ else:
+ if not os.path.exists(index_path):
+ os.makedirs(index_path)
+ self._open_index_files('write')
+
+ def _read(self):
+ self._open_index_files(mode='read')
+ for ext in ('ids', 'fts', 'xco', 'yco'):
+ self._read_index_file(ext)
+ if self.mode == 'append':
+ self._make_coo()
+ self._close_index_files()
+
+ @utils.show_time_taken
+ def _make_coo(self):
+ logger.info('Making coordinate matrix for append ...')
+ data = scipy.ones(len(self.xco))
+ self.X = sparse.csr_matrix((data, (self.xco, self.yco)))
+
+ def add(self, id, feat):
+ """ Adds the given (id, feature) to the index.
+
+ The id must an int and the feature must be a unique string representation
+ of the feature. The feature is expected to be unicode or utf8 encoded.
+
+ This method does not check whether (id, feature) has already been inserted
+ to the index.
+ """
+ if not self._check_input(id, feat):
+ return
+ feat = utils._unicode(feat)
+ if id not in self.ids:
+ x = len(self.ids)
+ self.ids[id] = x
+ self.fids.write('%s\n' % id)
+ if feat not in self.fts:
+ y = len(self.fts)
+ self.fts[feat] = y
+ self.ffts.write('%s\n' % feat)
+ (x, y) = (self.ids[id], self.fts[feat])
+ if not self._in_coo(x, y):
+ self.fxco.write('%s\n' % x)
+ self.fyco.write('%s\n' % y)
+
+ def close(self):
+ self._close_index_files()
+
+ def _in_coo(self, x, y):
+ in_coo = False
+ if self.mode == 'append':
+ try:
+ in_coo = bool(self.X[x, y])
+ except IndexError:
+ pass
+ return in_coo
+
+ def _check_input(self, id, feat):
+ success = False
+ if self.mode == 'read':
+ raise Exception('Can\'t write to read only index!')
+ elif id is None:
+ logger.warn('Undefined item id skipping ... skipping.')
+ elif not isinstance(id, (int, long)):
+ raise Exception('List of ids must be integers!')
+ elif not isinstance(feat, basestring):
+ logger.warn('Feature "%s" is not string or a unicode ... converting.' % feat)
+ success = True
+ elif feat is None:
+ logger.warn('Undefined feature for item %s ... skipping.' % id)
+ else:
+ success = True
+ return success
+
+ def _open_index_files(self, mode='read'):
+ mode = dict(write='wb', append='ab', read='rb')[mode]
+ self.fxco = self._new_index_file_handle('xco', mode)
+ self.fyco = self._new_index_file_handle('yco', mode)
+ self.fids = self._new_index_file_handle('ids', mode)
+ self.ffts = self._new_index_file_handle('fts', mode)
+
+ def _close_index_files(self):
+ for f in ('fxco', 'fyco', 'fids', 'ffts'):
+ if hasattr(self, f):
+ getattr(self, f).close()
+
+ def _new_index_file_handle(self, ext, mode='rb'):
+ if ext == 'fts':
+ return codecs.open(os.path.join(self.index_path, '.'+ext), mode, encoding='utf8')
+ else:
+ return open(os.path.join(self.index_path, '.'+ext), mode)
+
+ @utils.show_time_taken
+ def _read_index_file(self, ext):
+ f = self.__dict__['f'+ext]
+ logger.info('Reading file %s ...' % f.name)
+ if ext == 'fts':
+ vals = f.read().split('\n')[:-1]
+ else:
+ vals = scipy.fromfile(f, sep='\n', dtype=scipy.int32)
+ if ext == 'fts' or ext == 'ids':
+ vals = dict((v, i) for i, v in enumerate(vals))
+ self.__dict__[ext] = vals
+
+ def __enter__(self):
+ return self
+
+ def __exit__(self, type, value, traceback):
+ self._close_index_files()
View
222 simsearch/simsphinx.py
@@ -1,165 +1,164 @@
"""Wraps a Sphinx client with similarity search functionalities."""
-__all__ = ['SimSphinxWrap', 'QuerySimilar', 'QueryTermSimilar']
+__all__ = ['SimClient', 'QuerySimilar', 'QueryTermSimilar']
+import utils
import re
+import copy
import sphinxapi
-from fsphinx import MultiFieldQuery, QueryTerm, QueryParser, FSphinxClient, \
- queries, CacheIO
-import bsets
+from fsphinx import *
-class SimSphinxWrap(sphinxapi.SphinxClient):
+class SimClient(FSphinxClient):
"""Creates a wrapped sphinx client together with a computed index.
+
+ The computed index is queried if a similarity search query is encountered.
- The computed index is queried if a similary search query is encoutered.
- In this case the the function sphinx_setup is called in order to reset
- the wrapped sphinx client.
-
- The log_score of each item is found in the Sphinx attribute "log_score_attr".
+ The log_score of each item is found in the Sphinx attribute "log_score_attr".
It must be set to 1 and declared as a float in your Sphinx configuration file.
"""
- def __init__(self, computed_index, cl=None, sphinx_setup=None, max_items=1000):
- self.sim = bsets.QueryHandler(computed_index)
- self.sphinx_setup = sphinx_setup
- self.max_items = max_items
- self.query_parser = QueryParser(QuerySimilar)
+ def __init__(self, query_handler=None, cl=None, **opts):
+ FSphinxClient.__init__(self)
+ # default query parser
+ self.AttachQueryParser(QueryParser(QuerySimilar))
+ # set the query handler for simsearch
+ self.SetQueryHandler(query_handler)
+ # default sorting function
+ self.SetSortMode(sphinxapi.SPH_SORT_EXPR, 'log_score_attr')
+ # initiate from an existing client
if cl:
- self.Wrap(cl)
- else:
- self.wrap_cl = None
-
- def __getattr__(self, name):
- return getattr(self.wrap_cl, name)
-
- def Wrap(self, cl):
- self.wrap_cl = cl
- if hasattr(cl, 'query_parser'):
- user_sph_map = cl.query_parser.kwargs.get('user_sph_map', {})
- else:
- user_sph_map = {}
- self.query_parser = QueryParser(QuerySimilar, user_sph_map=user_sph_map)
-
- return self
-
- def SetMaxItems(self, max_items):
- """Set the maximum number of items to match.
+ self.SetSphinxClient(cl)
+ # some default options
+ self._max_items = opts.get('max_items', 1000)
+ self._max_terms = opts.get('max_terms', 20)
+ self._exclude_queried = opts.get('exclude_queried', True)
+ self._allow_empty = opts.get('allow_empty', True)
+ if self._allow_empty:
+ QuerySimilar.ALLOW_EMPTY = True
+
+ def SetSphinxClient(self, cl):
+ """Use this method to wrap the sphinx client.
"""
- self.max_items = max_items
-
- def SetSphinxSetup(self, setup):
- """Set the setup function which will be triggered in similarity search
- on the sphinx client.
-
- This function takes a sphinx client and operates on it in order to
- change sorting mode or ranking etc ...
-
- The Sphinx attribute "log_score_attr" holds each item log score.
+ # prototype pattern, create based on existing instance
+ self.__dict__.update(copy.deepcopy(cl).__dict__)
+ if hasattr(cl, 'query_parser'):
+ if hasattr(cl.query_parser, 'user_sph_map'):
+ self.query_parser = QueryParser(
+ QuerySimilar, user_sph_map=cl.user_sph_map)
+
+ def SetQueryHandler(self, query_handler):
+ """Sets the query handler to perform the similarity search.
"""
- self.sphinx_setup = setup
-
- def Query(self, query, index='*', comment=''):
+ self.query_handler = query_handler
+
+ def Query(self, query, index='', comment=''):
"""If the query has item ids perform a similarity search query otherwise
perform a normal sphinx query.
"""
- # parse the query which is assumed to be a string
- self.query = self.query_parser.Parse(query)
- self.time_similarity = 0
+ # first let's parse the query if possible
+ if isinstance(query, basestring):
+ query = self.query_parser.Parse(query)
+ self.query = query
+ # now let's get the item ids
item_ids = self.query.GetItemIds()
if item_ids:
# perform similarity search on the set of query items
log_scores = self.DoSimQuery(item_ids)
# setup the sphinx client with log scores
self._SetupSphinxClient(item_ids, dict(log_scores))
- # perform the Sphinx query
- hits = self.DoSphinxQuery(self.query, index, comment)
-
- if item_ids:
- # add detailed scoring information
- self._AddStats(hits, item_ids)
+ # perform the normal Sphinx query
+ hits = FSphinxClient.Query(self, query, index, comment)
- # and other statitics
- hits['time_similarity'] = self.time_similarity
-
- return hits
+ # reset filters for subsequent queries
+ self.ResetOverrides()
+ self.ResetFilters()
+ # add detailed scoring information to each match
+ self._AddStats(item_ids)
+
+ # keep expected return of SphinxClient
+ return self.hits
+
@CacheIO
def DoSimQuery(self, item_ids):
- """Performs the actual simlarity search query.
+ """Performs the actual similarity search query.
"""
- results = self.sim.query(item_ids, self.max_items)
- self.time_similarity = results.time
-
+ results = self.query_handler.query(item_ids, self._max_items)
return results.log_scores
- def DoSphinxQuery(self, query, index='*', comment=''):
- """Peforms a normal sphinx query.
- """
- if isinstance(self.wrap_cl, FSphinxClient):
- return self.wrap_cl.Query(query)
- else:
- # check we don't loose the parsed query
- return self.wrap_cl.Query(query.sphinx)
-
def _SetupSphinxClient(self, item_ids, log_scores):
- # if the setup is in a configuration file
- if self.sphinx_setup:
- self.sphinx_setup(self.wrap_cl)
-
- # override log_score_attr and exclude selected ids
- self.wrap_cl.SetOverride('log_score_attr', sphinxapi.SPH_ATTR_FLOAT, log_scores)
- self.wrap_cl.SetFilter('@id', item_ids, exclude=True)
+ # override the log_score_attr attributes with its value
+ self.SetOverride('log_score_attr', sphinxapi.SPH_ATTR_FLOAT, log_scores)
+ # exclude query item ids from results
+ if self._exclude_queried:
+ self.SetFilter('@id', item_ids, exclude=True)
+ # allow full scan on empty query but restrict to non zero log scores
+ if not self.query.sphinx and self._allow_empty:
+ self.SetFilterFloatRange('log_score_attr', 0.0, 1.0, exclude=True)
+
+ def _AddStats(self, query_item_ids):
+ scores = []
+ ids = [match['id'] for match in self.hits['matches']]
+ if ids:
+ scores = self._GetDetailedScores(ids, query_item_ids)
+ for scores, match in zip(scores, self.hits['matches']):
+ match['attrs']['@sim_scores'] = scores.scores
+ self.hits['time_similarity'] = self.query_handler.time
- # only hits with non zero log scores are considered if the query is empty
- QuerySimilar.ALLOW_EMPTY = True
- if not self.query.sphinx:
- self.wrap_cl.SetFilterFloatRange('log_score_attr', 0.0, 1.0, exclude=True)
-
- def _AddStats(self, sphinx_results, item_ids):
- scores = self._GetDetailedScores(item_ids,
- [match['id'] for match in sphinx_results['matches']])
- for scores, match in zip(scores, sphinx_results['matches']):
- match['@sim_scores'] = scores
-
@CacheIO
- def _GetDetailedScores(self, query_item_ids, result_ids, max_terms=20):
- scores = self.sim.get_detailed_scores(query_item_ids, result_ids, max_terms)
- self.time_similarity = scores.time
-
- return scores.scores
+ def _GetDetailedScores(self, ids, query_item_ids):
+ return self.query_handler.get_detailed_scores(ids, query_item_ids, max_terms=self._max_terms)
+ def Clone(self, memo={}):
+ """Creates a copy of this client.
+
+ This makes sure the whole index is not recopied.
+ """
+ return self.__deepcopy__(memo)
+
+ def __deepcopy__(self, memo):
+ cl = self.__class__()
+ attrs = utils.save_attrs(self,
+ [a for a in self.__dict__ if a not in ['query_handler']])
+ utils.load_attrs(cl, attrs)
+ if self.query_handler:
+ computed_index = self.query_handler.computed_index
+ cl.SetQueryHandler(QueryHandler(computed_index))
+ return cl
+
+
class QueryTermSimilar(QueryTerm):
- """This is like an fSphinx multi-field query but with the representation of
+ """This is like an fSphinx multi-field query but with the representation of
a query for similar items.
-
+
A query for a similar item uses the special field @similar followed by the
item id and some extra terms.
-
+
Here is an example of a query to look up for the author "Alex Ksikes" and
- the item similar to the item with id "1234". The variable "Machine Learing"
- is passed along.
-
+ the item similar to the item with id "1234". The variable "Machine Learning"
+ is passed along.
+
(@author alex ksikes) (@similar 1234--"Machine Learning")
"""
p_item_id = re.compile('\s*(\d+)(?:--)?')
p_extra = re.compile('--(.+?)(?=--|$)', re.I|re.U)
-
+
def __init__(self, status, term):
QueryTerm.__init__(self, status, 'similar', term)
self.item_id = QueryTermSimilar.p_item_id.search(term).group(1)
self.extra = QueryTermSimilar.p_extra.findall(term)
-
+
def GetExtraStr(self):
"""Returns a string representation of the extra items.
- """
+ """
return '--'.join(self.extra.items())
-
+
@property
def sphinx(self):
return ''
-
+
@property
def uniq(self):
if self.status in ('', '+'):
@@ -173,19 +172,20 @@ def uniq(self):
def __hash__(self):
return hash((self.user_field, self.item_id))
+
class QuerySimilar(MultiFieldQuery):
"""Used internally by a query term similar query.
-
- These query terms may be created from a match object or its string representation.
+
+ These query terms may be created from a match object or its string representation.
"""
@queries.ChangeQueryTerm
def AddQueryTerm(self, query_term):
if query_term.user_field == 'similar':
query_term = QueryTermSimilar(query_term.status, query_term.term)
MultiFieldQuery.AddQueryTerm(self, query_term)
-
+
def GetItemIds(self):
"""Returns the item ids of this query term.
"""
- return map(int, (qt.item_id for qt in self if qt.user_field == 'similar'
+ return map(int, (qt.item_id for qt in self if qt.user_field == 'similar'
and qt.status in ('', '+')))
View
119 simsearch/utils.py
@@ -1,39 +1,52 @@
import time
import threading
import logging
-import numpy as np
+import scipy
import random
import cPickle as pickle
+import os
+import copy
-def time_func(func):
- setattr(func, 'time_taken', 0)
+
+def get_basic_logger():
+ logging.basicConfig(
+ level=logging.INFO,
+ format='%(asctime)s - %(levelname)s - %(message)s')
+ return logging.getLogger()
+logger = get_basic_logger()
+
+
+def show_time_taken(func):
def new(*args, **kw):
start = time.time()
res = func(*args, **kw)
timed = time.time() - start
setattr(new, 'time_taken', timed)
+ logger.info('%.2f sec.', timed)
return res
return new
+
class Serializable(object):
- @time_func
+ @show_time_taken
def dump(self, path):
return pickle.dump(self, open(path, 'wb'), -1)
-
- @time_func
+
+ @show_time_taken
def dumps(self):
return pickle.dumps(self, -1)
-
+
@staticmethod
- @time_func
+ @show_time_taken
def load(path):
return pickle.load(open(path, 'rb'))
@staticmethod
- @time_func
+ @show_time_taken
def loads(ser_str):
return pickle.loads(ser_str)
+
class ThreadingMixIn:
def process_request_thread(self, request, client_address):
try:
@@ -42,19 +55,20 @@ def process_request_thread(self, request, client_address):
except:
self.handle_error(request, client_address)
self.close_request(request)
-
+
def process_request(self, request, client_address):
"""Start a new thread to process the request."""
- t = threading.Thread(target = self.process_request_thread,
- args = (request, client_address))
+ t = threading.Thread(
+ target = self.process_request_thread,
+ args = (request, client_address))
t.start()
-def get_basic_logger():
- logging.basicConfig(level=logging.DEBUG,
- format='%(asctime)s - %(levelname)s - %(message)s')
-
- return logging.getLogger()
-basic_logger = get_basic_logger()
+
+def listify(l):
+ if not isinstance(l, list):
+ l = [l]
+ return l
+
# from webpy
def auto_assign(self, locals):
@@ -75,23 +89,25 @@ def __init__(self, foo, bar, baz=1): autoassign(self, locals())
continue
setattr(self, key, value)
+
# from tornado web
def _utf8(s):
if isinstance(s, unicode):
- return s.encode("utf-8")
- assert isinstance(s, str)
+ s = s.encode("utf-8")
+ elif not isinstance(s, str):
+ s = str(s)
return s
+
# from tornado web
def _unicode(s):
if isinstance(s, str):
- try:
- return s.decode("utf-8")
- except UnicodeDecodeError:
- raise HTTPError(400, "Non-utf8 argument")
- assert isinstance(s, unicode)
+ s = s.decode("utf-8")
+ elif not isinstance(s, unicode):
+ s = str(s).decode("utf-8")
return s
-
+
+
# from tornado web
def _time_independent_equals(a, b):
if len(a) != len(b):
@@ -100,7 +116,8 @@ def _time_independent_equals(a, b):
for x, y in zip(a, b):
result |= ord(x) ^ ord(y)
return result == 0
-
+
+
# from tornado web
class _O(dict):
"""Makes a dictionary behave like an object."""
@@ -109,41 +126,61 @@ def __getattr__(self, name):
return self[name]
except KeyError:
raise AttributeError(name)
-
def __setattr__(self, name, value):
self[name] = value
-
-def parse_config_file(path, **kwargs):
+
+
+def parse_config_file(path, **opts):
cf = {}
execfile(path, cf, cf)
- cf.update(**kwargs)
-
+ cf.update(**opts)
return _O(cf)
-@time_func
+
def argsort_best(arr, best_k, reverse=False):
- """Fast computation of the best k elements in an array using a simple randomized
+ """Fast computation of the best k elements in an array using a simple randomized
algorithm.
- """
+ """
def get_best_threshold(arr, threshold=0, sample_size=1000):
if len(arr) >= sample_size:
sample = random.sample(arr, sample_size)
- new_threshold = np.mean(sample)
+ new_threshold = scipy.mean(sample)
else:
new_threshold = arr.mean()
-
+
new_arr = arr[(arr >= new_threshold).nonzero()[0]]
if len(new_arr) <= best_k:
return threshold
if new_threshold == threshold:
return threshold
else:
return get_best_threshold(new_arr, new_threshold)
-
+
threshold = get_best_threshold(arr)
best_indexes = (arr >= threshold).nonzero()[0]
-
+
if (arr[best_indexes] == threshold).all():
best_indexes = best_indexes[:best_k]
-
- return np.array(sorted(best_indexes, key=lambda i: arr[i], reverse=reverse))[:best_k]
+
+ return scipy.array(sorted(best_indexes, key=lambda i: arr[i], reverse=reverse))[:best_k]
+
+
+def get_all_sub_dirs(path):
+ paths = []
+ d = os.path.dirname(path)
+ while d not in ('', '/'):
+ paths.append(d)
+ d = os.path.dirname(d)
+ if '.' not in paths:
+ paths.append('.')
+ return paths
+
+
+def save_attrs(obj, attr_names):
+ return dict((k, copy.deepcopy(v)) for k, v in obj.__dict__.items() if k in attr_names)
+
+
+def load_attrs(obj, attrs):
+ for k, v in attrs.items():
+ if k in obj.__dict__:
+ obj.__dict__[k] = v
View
33 tests/test_argsort_best.py
@@ -1,38 +1,45 @@
import numpy as np
-import random
import sys
from simsearch import utils
-@utils.time_func
+
+@utils.show_time_taken
def argsort(arr):
arr.argsort(0)
-def test(arr, k):
- best_indexes = utils.argsort_best(arr, k, reverse=True)
+
+@utils.show_time_taken
+def argsort_best(arr, best_k, reverse=False):
+ return utils.argsort_best(arr, best_k, reverse)
+
+def test(arr, k):
+ best_indexes = argsort_best(arr, k, reverse=True)
+
print 'Array = %s' % arr
print 'Best indexes = %s' % best_indexes
print 'Best elements = %s' % arr[best_indexes]
print 'Number of indexes = %s' % len(best_indexes)
print 'Best element = %s' % np.max(arr)
- print 'Took %.2f sec.' % utils.argsort_best.time_taken
-
+ print 'Took %.2f sec.' % argsort_best.time_taken
+
argsort(arr)
- print 'To be compared with full sorting takes %.2f sec.' % argsort.time_taken
+ print 'To be compared with full sorting takes %.2f sec.' % argsort.time_taken
+
def main(arr_size, k):
arr = np.array(xrange(arr_size))
test(arr, k)
-
+
arr = np.random.sample(arr_size)
test(arr, k)
-
- arr = np.ones(arr_size)
+
+ arr = np.ones(arr_size)
test(arr, k)
-
+
if __name__ == '__main__':
if len(sys.argv) != 3:
- print 'Usage: python %s size_array number_of_k_elements' % sys.argv[0]
+ print 'Usage: python %s size_array number_of_k_elements' % sys.argv[0]
else:
- main(*map(int, sys.argv[1:]))
+ main(*map(int, sys.argv[1:]))
View
11 tests/test_simsphinx.py
@@ -1,11 +0,0 @@
-from simsearch import simsphinx
-
-s = '(@similar 6876876--sad--asd--asd--saddas) @-genre 3423423 @-year 2009'
-q = simsphinx.QuerySimilar()
-
-q.Parse(s)
-
-print repr(q)
-print q.user
-print q.sphinx
-print q.uniq
View
47 tools/client.py
@@ -1,47 +0,0 @@
-# Author: Alex Ksikes
-
-import urllib
-
-from simsearch import bsets
-
-class SimilaritySearchClient:
- def __init__(self, server_port=8000):
- self.base_url = 'http://localhost:%s/?' % server_port
-
- def query(self, item_ids):
- url = self.base_url + '&'.join('similar=%s' % id for id in item_ids)
- txt = urllib.urlopen(url).read()
- return bsets.ResultSet.loads(txt)
-
-def usage():
- print 'Usage: python client.py [options]'
- print
- print 'Description:'
- print ' Query a similarity index served on the default port 8000.'
- print
- print '-p, --port <number> -> query an index served in a different port'
- print '-h, --help -> this help message'
-
-import sys, getopt
-def main():
- try:
- opts, args = getopt.getopt(sys.argv[1:], 'p:h', ['port=', 'help'])
- except getopt.GetoptError:
- usage(); sys.exit(2)
-
- port = 8000
- for o, a in opts:
- if o in ('-p', '--port'):
- port = int(a)
- elif o in ('-h', '--help'):
- usage(); sys.exit()
-
- cl = SimilaritySearchClient(port)
-
- while(True):
- print 'Enter some item ids:'
- item_ids = map(int, raw_input().split())
- print cl.query(item_ids)
-
-if __name__ == "__main__":
- main()
View
56 tools/index_features.py
@@ -1,45 +1,57 @@
#! /usr/bin/env python
+import sys
+import getopt
+import simsearch
+
+from simsearch import utils
+
+
+def make_index(config_path, **opts):
+ opts = utils.parse_config_file(config_path, **opts)
+ index = simsearch.FileIndex(opts.index_path, mode=opts.mode)
+ iter_feat = simsearch.BagOfWordsIter(opts.db_params, opts.sql_features, opts.get('limit', 0))
+ simsearch.Indexer(index, iter_feat).index_data()
-from simsearch import bsets, utils
-def make_index(config_path, index_name, **options):
- cf = utils.parse_config_file(config_path, **options)
-
- indexer = bsets.Indexer(cf)
- indexer.index_dataset(cf.get('limit'))
- indexer.save_index(index_name)
-
def usage():
print 'Usage: python index_features.py [options] config_path'
print
- print 'Description:'
- print ' Creates a similarity search index called "index.dat" given a config file.'
- print
- print 'Options:'
- print ' -o, --out : different index name than "./index.dat"'
+ print 'Description:'
+ print ' Creates a similarity search index given a configuration file.'
+ print ' This uses bag of words features and will create an index called'
+ print ' ./sim-index/ unless otherwise speciied.'
+ print
+ print 'Options:'
+ print ' -o, --out : path to the index (default ./sim-index/)'
+ print ' -m, --mode : "write" or "append" to the index (defaut write)'
print ' -l, --limit : loop only over the first "limit" number of items'
print ' -h, --help : this help message'
-import sys, getopt
+
def main():
try:
- opts, args = getopt.getopt(sys.argv[1:], 'o:l:h', ['out=', 'limit=', 'help'])
+ opts, args = getopt.getopt(sys.argv[1:],
+ 'o:m:v:l:h',
+ ['out=', 'mode=', 'verbose=', 'limit=', 'help'])
except getopt.GetoptError:
usage(); sys.exit(2)
-
- index_name, options = 'index.dat', {}
+
+ _opts = dict(index_path='sim-index', mode='write')
for o, a in opts:
if o in ('-o', '--out'):
- index_name = a
+ _opts['index_path'] = a
+ if o in ('-m', '--mode'):
+ if a in ('append', 'write'):
+ _opts['mode'] = a
elif o in ('-l', '--limit'):
- options['limit'] = int(a)
+ _opts['limit'] = int(a)
elif o in ('-h', '--help'):
usage(); sys.exit()
-
+
if len(args) < 1:
usage()
else:
- make_index(args[0], index_name, **options)
-
+ make_index(args[0], **_opts)
+
if __name__ == '__main__':
main()
View
42 tools/query_index.py
@@ -1,64 +1,68 @@
#! /usr/bin/env python
+import sys
+import getopt
+import simsearch
-from simsearch import bsets
def query(index_path, matching_keywords=False):
- computed_index = bsets.ComputedIndex.load(index_path)
- query_handler = bsets.QueryHandler(computed_index)
-
+ index = simsearch.ComputedIndex(index_path)
+ query_handler = simsearch.QueryHandler(index)
+
while(True):
- sample_ids = ' '.join(map(str, computed_index.get_sample_item_ids()))
+ sample_ids = ' '.join(map(str, query_handler.get_sample_item_ids()))
print '>> Enter some item ids: (try %s)' % sample_ids
-
+
item_ids = map(int, raw_input().split())
result_set = query_handler.query(item_ids, max_results=10000)
-
+
print result_set
-
+
if matching_keywords:
ids = [id for id, sc in result_set.log_scores][0:10]
show_matching_keywords(ids, query_handler)
+
def show_matching_keywords(ids, query_handler):
item_scores = query_handler.get_detailed_scores(ids)
-
+
print 'Top matching keywords (%.2f sec.):' % \
query_handler.get_detailed_scores.time_taken
-
+
for scores, id in zip(item_scores, ids):
print '*' * 80
print 'id = %s' % id
- print ' '.join('%s - %.2f' % (t, s) for t, s in scores['scores'])
+ print ' '.join('%s - %.2f' % (t, s) for t, s in scores['scores'])
print '*' * 80
-
+
+
def usage():
print 'Usage: python query_index.py index_path'
print
- print 'Description:'
+ print 'Description:'
print ' Load and then query a similarity search index.'
- print
- print 'Options:'
+ print
+ print 'Options:'
print ' -v, --verbose : also show matching keywords'
print ' -h, --help : this help message'
-import sys, getopt
+
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], 'vh', ['verbose', 'help'])
except getopt.GetoptError:
usage(); sys.exit(2)
-
+
verbose = False
for o, a in opts:
if o in ('-v', '--verbose'):
verbose = True
elif o in ('-h', '--help'):
usage(); sys.exit()
-
+
if len(args) < 1:
usage()
else:
query(args[0], verbose)
-
+
if __name__ == '__main__':
main()
View
94 tools/server.py
@@ -1,94 +0,0 @@
-# Author: Alex Ksikes
-
-import re
-from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
-
-from simsearch import bsets, utils
-logging = utils.basic_logger
-
-class RequestHandler(BaseHTTPRequestHandler):
- p_url = re.compile('similar=(\d+)')
-
- def _writeheaders(self):
- self.send_response(200)
- self.send_header("Content-type", "text/html; charset=utf-8")
- self.end_headers()
-
- def do_GET(self):
- cls = RequestHandler
- item_ids = map(int, cls.p_url.findall(self.path))
-
- self._writeheaders()
- if item_ids:
- results = self.do_query(item_ids)
- self.wfile.write(results.dumps())
-
- def do_query(self, item_ids):
- sever_cls = SimilaritySearchServer
- return bsets.QueryHandler(sever_cls.computed_index).query(item_ids, sever_cls.config.max_items)
-
-class SimilaritySearchServer(utils.ThreadingMixIn, HTTPServer):
- allow_reuse_address = 1
-
- def __init__(self, index_path, config=dict()):
- self.index_path = index_path
- self.server_port = config.get('server_port', 8000)
- SimilaritySearchServer.config = config
-
- server_address = ('', self.server_port)
- HTTPServer.__init__(self, server_address, RequestHandler)
-
- def load_index(self):
- cls = SimilaritySearchServer
- cls.computed_index = bsets.ComputedIndex.load(self.index_path)
-
- def serve_forever(self):
- logging.info('Listening on port %s', self.server_port)
- HTTPServer.serve_forever(self)
- # or? ThreadingMixIn.serve_forever()
-
-def run_server(index_path, config):
- server = SimilaritySearchServer(index_path, config)
- server.load_index()
- server.serve_forever()
-
-def usage():
- print 'Usage: python server.py [options] index_path'
- print
- print 'Description:'
- print ' Serve a given similarity search index.'
- print
- print '-c, --config <path> -> use the given configuration file'
- print '-p, --port <number> -> serve on port (default of 8000)'
- print
- print '-h, --help -> this help message'
-
-
-import sys, getopt
-def main():
- try:
- opts, args = getopt.getopt(sys.argv[1:], 'c:p:h', ['config=', 'port=', 'help'])
- except getopt.GetoptError:
- usage(); sys.exit(2)
-
- config_path = ''
- options = dict(server_port=8000, max_items=10000)
- for o, a in opts:
- if o in ('-c', '--config'):
- config_path = a
- elif o in ('-p', '--port'):
- options['server_port'] = int(a)
- elif o in ('-h', '--help'):
- usage(); sys.exit()
-
- if config_path:
- cf = utils.parse_config_file(config_path, **options)
- else:
- cf = utils._O(options)
- if len(args) < 1:
- usage()
- else:
- run_server(args[0], cf)
-
-if __name__ == "__main__":
- main()
View
240 tutorial/README.md
@@ -0,0 +1,240 @@
+In this tutorial, we will show how to use SimSearch to find similar movies. The dataset is taken from a scrape of the top 400 movies found on IMDb. We assume the current working directory to be the "tutorial" directory. All the code samples can be found in the file "./test.py".
+
+Loading the Data in the Database
+--------------------------------
+
+First thing we need is some data. The dataset in this example is the same as featured in the [fSphinx tutorial][0]. If you don't already have it, create a MySQL database called "fsphinx" with user and password "fsphinx".
+
+In a MySQL shell type:
+
+ create database fsphinx character set utf8;
+ create user 'fsphinx'@'localhost' identified by 'fsphinx';
+ grant ALL on fsphinx.* to 'fsphinx'@'localhost';
+
+Now let's load the data into this database:
+
+ mysql -u fsphinx -D fsphinx -p < ./sql/imdb_top400.data.sql
+
+Creating the Index
+------------------
+
+In this toy example we will consider two movies to be similar if they share "specific" plot keywords. Let's first have a quick look at our movies. In a mysql shell type:
+
+ use fsphinx;
+ select imdb_id, title from titles limit 5;
+
+ +---------+--------------------------+
+ | imdb_id | title |
+ +---------+--------------------------+
+ | 111161 | The Shawshank Redemption |
+ | 61811 | In the Heat of the Night |
+ | 369702 | Mar adentro |
+ | 56172 | Lawrence of Arabia |
+ | 107048 | Groundhog Day |
+ +---------+--------------------------+
+
+Now let's create an index and add some keywords of interest:
+
+ import simsearch
+ from pprint import pprint
+
+ # creating the index in './data/sim-index/'
+ index = simsearch.FileIndex('./data/sim-index', mode='write')
+
+ # adding some features for the item id 111161 and 107048
+ index.add(111161, 'prison')
+ index.add(111161, 'murder')
+ index.add(111161, 'shawshank')
+ index.add(107048, 'weatherman')
+ index.add(107048, 'weather forecasting')
+ index.close()
+
+SimSearch has created 4 files called .xco, .yxo, .ids and .fts in ./data/sim-index/. The files .xco and .yco are the x and y coordinates of the binary matrix. This matrix represents the presence of a feature for a given item id. The file .ids keep track of the item ids with respect to their index in this matrix. The .fts keep track of the feature values. The line number of the file is the actual matrix index.
+
+If we'd like to build a larger index from a database, we would use an indexer. Let's build an index of all the plot keywords found on IMDb for this database.
+
+ # let's create our index
+ index = simsearch.FileIndex('./data/sim-index', mode='write')
+
+ # our database parameters
+ db_params = {'user':'fsphinx', 'passwd':'fsphinx', 'db':'fsphinx'}
+
+ # an iterator to provide the indexer with (id, feature value)
+ bag_of_words_iter = simsearch.BagOfWordsIter(
+ db_params = db_params,
+ sql_features = ['select imdb_id, plot_keyword from plot_keywords']
+ )
+
+ # create the index provisionned by our iterator
+ indexer = simsearch.Indexer(index, bag_of_words_iter)
+
+ # and finally index all the items in our database
+ indexer.index_data()
+
+ 2012-10-03 11:34:11,600 - INFO - SQL: select imdb_id, plot_keyword from plot_keywords
+ 2012-10-03 11:34:12,894 - INFO - Done processing the dataset.
+ 2012-10-03 11:34:12,894 - INFO - Number of items: 424
+ 2012-10-03 11:34:12,895 - INFO - Number of features: 13607
+ 2012-10-03 11:34:12,895 - INFO - 1.29 sec.
+
+It is important to note that the bag of words iterator is just an example. The indexer can take any iterator as long as the couple (item\_id, feature\_value) is returned. The id must always be an integer and the feature_value is a unique string representation of that feature value. However please note that you do not need to use these tools. In fact if you can directly create the matrix in .xco and .yco format, SimSearch can read it and perform its magic. For example the matrix could represent user's preferences. In this case the matrix would be the coordinates (item\_id, user\_id) indicating that user_id liked item_id. In this case the items are thought to be similar if they share a set of users liking them (the "you may also like" Amazon feature ...).
+
+Querying the Index
+------------------
+
+Now we are ready to query this index and understand why things match. At its core SimSearch performs a sparse matrix multiplication. For speed efficiency the matrix must be converted into [CSR][4] and loaded in memory. This computed index is then queried using QueryHandler object.
+
+ # let's create a computed index from our file index
+ index = simsearch.ComputedIndex('./data/sim-index/')
+
+ # and a query handler to query it
+ handler = simsearch.QueryHandler(index)
+
+ # now let's see what is similar to "The Shawshank Redemption" (item id 111161)
+ results = handler.query(111161)
+ print results
+
+ You looked for item ids (after cleaning up): 111161
+ Found 100 in 0.00 sec. (showing top 10 here):
+ id = 111161, log score = 18087.2975693
+ id = 455275, log score = 17787.5833743
+ id = 107207, log score = 17784.619186
+ id = 367279, log score = 17782.0579555
+ id = 804503, log score = 17780.7218639
+ id = 795176, log score = 17779.8914104
+ id = 290978, log score = 17777.6663835
+ id = 51808, log score = 17777.0082114
+ id = 861739, log score = 17776.2298019
+ id = 55031, log score = 17776.1551032
+
+SimSearch does not have a storage engine. Instead we have to query our database to see what these movies are:
+
+ select imdb_id, title from titles where imdb_id in (111161,36868,120586,455275,117666,40746,118421,405508,318997,107207) order by field(imdb_id, 111161,36868,120586,455275,117666,40746,118421,405508,318997,107207);
+
+ +---------+------------------------------+
+ | imdb_id | title |
+ +---------+------------------------------+
+ | 111161 | The Shawshank Redemption |
+ | 455275 | Prison Break |
+ | 107207 | In the Name of the Father |
+ | 367279 | Arrested Development |
+ | 804503 | Mad Men |
+ | 795176 | Planet Earth |
+ | 290978 | The Office |
+ | 51808 | Kakushi-toride no san-akunin |
+ | 861739 | Tropa de Elite |
+ | 55031 | Judgment at Nuremberg |
+ +---------+------------------------------+
+
+Ok obvisouly it matched itself, but why did "Prison Break" and "In the Name of the Father" matched?
+
+ # let's get detailed scores for the movie id 455275 and 107207
+ scores = handler.get_detailed_scores([455275, 107207], max_terms=5)
+ pprint(scores)
+
+ [{'scores': [(u'Prison Break', 3.9889840465642745),
+ (u'Prison Escape', 3.4431615807611875),
+ (u'Prison Guard', 3.3141860046725258),
+ (u'Jail', 1.906534983820483),
+ (u'Prison', 1.8838747581358608)],
+ 'total_score': 7.2857111578648492},
+ {'scores': [(u'Wrongful Imprisonment', 3.5927355935610334),
+ (u'False Accusation', 2.6005086594980238),
+ (u'Courtroom', 2.2857779746776647),
+ (u'Prison', 1.8838747581358608),
+ (u'Political Conflict', -0.4062528198464137)],
+ 'total_score': 4.3215228336074638}]
+
+Of course things would be much more interesting if we could index all movies in IMDb and consider other feature types such as directors or actors or preference data.
+
+Note that the query handler is not thread safe. It is mearly meant to be used once and thrown away after each new query. However the computed index is and should be loaded somewhere in memory so it can be reused for subsequent queries. Also note that SimSearch is not limited to single item queries, you can just as quickly perform multiple item queries. Care to know what the movies "Lilo & Stitch" and "Up" [have in common][1]?
+
+Although this is a toy example, SimSearch has been shown to perform quite well on millions of documents each having hundreds of thousands of possible feature values. There are also future plans to implement distributed search as well as real time indexing.
+
+Combining Full Text Search with Similarity Search
+-------------------------------------------------
+
+Ok this is rather interesting, however sometimes we'd like to combine full text with item based search. For example we'd like to search for specific keywords and order these results based on how similar they are to a given set of items. This is accomplished by using the simsphinx module. The full text search query is handled by [Sphinx][2] so a little bit of setting up is necessary first.
+
+First you need to install [Sphinx][2] and [fSphinx][3].
+
+After you have installed Sphinx, let it index data (assuming Sphinx indexer is in /user/local/sphinx/):
+
+ /usr/local/sphinx/bin/indexer -c ./config/indexer.conf --all
+
+And now let searchd serve the index:
+
+ /usr/local/sphinx/bin/searchd -c ./config/indexer.conf
+
+Note that the "indexer.conf" must have an attribute called "log_scores_attr" set to 1 and declared as a float.
+
+ # log_score_attr must be set to 1
+ sql_query = \
+ select *,\
+ 1 as log_score_attr,\
+ from table
+
+ # log_score_attr will hold the scores of the matching items
+ sql_attr_float = log_score_attr
+
+We are now ready to combine full text search with item based search.
+
+ # creating a sphinx client to handle full text search
+ cl = simsearch.SimClient(handler)
+
+A SimClient really is an FSphincClient which itself is a SphinxClient.
+
+ # assuming searchd is running on 9315
+ cl.SetServer('localhost', 9315)
+
+ # telling fsphinx how to fetch the results
+ db = fsphinx.utils.database(dbn='mysql', **db_params)
+
+ cl.AttachDBFetch(fsphinx.DBFetch(db, sql='''
+ select imdb_id as id, title
+ from titles
+ where imdb_id in ($id)
+ order by field(imdb_id, $id)'''
+ ))
+
+ # order the results solely by similarity using the log_score_attr
+ cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, 'log_score_attr')
+
+ # enable us to search within fields
+ cl.SetMatchMode(sphinxapi.SPH_MATCH_EXTENDED2)
+
+ # searching for all animation movies re-ranked by similarity to "The Shawshank Redemption"
+ results = cl.Query('@genres animation @similar 111161')
+
+On seeing the query term "@similar 111161", the client performs a similarity search and then sets the log_score_attr accordingly. Let's have a look at these results:
+
+ # looking at the results with similarity search
+ print results
+
+ matches: (25/25 documents in 0.000 sec.)
+ 1. document=112691, weight=1618
+ ...
+ @sim_scores=[(u'Wrongful Imprisonment', 3.5927355935610334), (u'Prison Escape', 3.4431615807611875), (u'Prison', 1.8838747581358608), (u'Window Washer', -0.4062528198464137), (u'Sheep Rustling', -0.4062528198464137)], release_date_attr=829119600, genre_attr=[3, 5, 6, 9, 19], log_score_attr=17772.2988281, nb_votes_attr=16397
+ id=112691
+ title=Wallace and Gromit in A Close Shave
+ 2. document=417299, weight=1586
+ ...
+ @sim_scores=[(u'Redemption', 1.8838747581358608), (u'Friendship', 0.9769153536905899), (u'Tribe', -0.4062528198464137), (u'Psychic Child', -0.4062528198464137), (u'Flying Animal', -0.4062528198464137)], release_date_attr=1108972800, genre_attr=[2, 3, 9, 10], log_score_attr=17771.71875, nb_votes_attr=10432
+ id=417299
+ title=Avatar: The Last Airbender
+ 3. document=198781, weight=1618
+ ...
+ @sim_scores=[(u'Redemption', 1.8838747581358608), (u'Friend', 1.5656352897757075), (u'Friendship', 0.9769153536905899), (u'Pig Latin', -0.4062528198464137), (u'Hazmat Suit', -0.4062528198464137)], release_date_attr=1016611200, genre_attr=[2, 3, 5, 9, 10], log_score_attr=17766.1152344, nb_votes_attr=99627
+ id=198781
+ title=Monsters, Inc.
+
+Again note that a SimClient is not thread safe. It is mearly meant to be used once or sequentially after each each request. In a web application you will need to create a new client for each new request. You can use SimClient.Clone on each new request for this purpose or you can create a new client from a config file with SimClient.FromConfig.
+
+That's pretty much it. I hope you'll enjoy using SimSearch and please don't forget to leave [feedback][5].
+
+[0]: https://github.com/alexksikes/fSphinx/blob/master/tutorial/
+[1]: http://imdb.cloudmining.net/search?q=%28%40similar+1049413+url--c2%2Fc29a902a5426d4917c0ca2d72a769e5b--title--Up%29++%28%40similar+198781+url--0b%2F0b994b7d73e0ccfd928bd1dfb2d02ce3--title--Monsters%2C+Inc.%29
+[2]: http://sphinxsearch.com
+[3]: https://github.com/alexksikes/fSphinx
+[4]: http://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR_or_CRS.29
+[5]: https://mail.google.com/mail/?view=cm&fs=1&tf=1&to=alex.ksikes@gmail.com&su=SimSearch
View
118 tutorial/config/indexer.conf
@@ -0,0 +1,118 @@
+source items
+{
+ type = mysql
+ sql_host = localhost
+ sql_user = fsphinx
+ sql_pass = fsphinx
+ sql_db = fsphinx
+ sql_port = 3306
+
+ sql_query_pre = set character_set_results = utf8
+ # make sure we don't chop off fields with multiple values
+ sql_query_pre = set group_concat_max_len = 50000
+
+ # search in all fields including facets
+ sql_query = \
+ select \
+ imdb_id, \
+ filename, \
+ title, \
+ year, \
+ plot, \
+ also_known_as, \
+ imdb_id as id, \
+ certification, \
+ \
+ year as year_attr, \
+ user_rating as user_rating_attr, \
+ nb_votes as nb_votes_attr, \
+ unix_timestamp(release_date) as release_date_attr, \
+ runtime as runtime_attr, \
+ \
+ 1 as log_score_attr, \
+ \
+ (select group_concat(distinct genre) from genres as g where g.imdb_id = t.imdb_id) as genres, \
+ (select group_concat(distinct director_name) from directors as d where d.imdb_id = t.imdb_id) as directors, \
+ (select group_concat(distinct actor_name) from casts as c where c.imdb_id = t.imdb_id) as actors, \
+ (select group_concat(distinct plot_keyword) from plot_keywords as p where p.imdb_id = t.imdb_id) as plot_keywords\
+ from titles as t
+
+ # sort by year, user_ratings * nb_votes, release_date, runtime
+ sql_attr_float = user_rating_attr
+ sql_attr_uint = nb_votes_attr
+ sql_attr_timestamp = release_date_attr
+ sql_attr_uint = runtime_attr
+
+ # for similarity search
+ sql_attr_float = log_score_attr
+
+ # facets are year, directors, actors, genres, keywords
+ sql_attr_uint = year_attr
+ sql_attr_multi = \
+ uint genre_attr from query; \
+ select g.imdb_id, t.id from genres as g, genre_terms as t where g.genre = t.genre
+ sql_attr_multi = \
+ uint director_attr from query; \
+ select imdb_id, imdb_director_id from directors
+ sql_attr_multi = \
+ uint actor_attr from query; \
+ select imdb_id, imdb_actor_id from casts
+ sql_attr_multi = \
+ uint plot_keyword_attr from query; \
+ select p.imdb_id, t.id from plot_keywords as p, plot_keyword_terms as t where p.plot_keyword = t.plot_keyword
+
+ sql_query_info = \
+ select \
+ imdb_id, \
+ filename, \
+ title, \
+ year, \
+ plot, \
+ also_known_as, \
+ imdb_id as id, \
+ certification, \
+ \
+ year as year_attr, \
+ user_rating as user_rating_attr, \
+ nb_votes as nb_votes_attr, \
+ unix_timestamp(release_date) as release_date_attr, \
+ runtime as runtime_attr, \
+ \