Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).
Out of the box, PySparNN supports Cosine Distance (i.e. 1 - cosine_similarity).
- Designed to be efficient on sparse data (memory & cpu).
- Implemented leveraging existing python libraries (scipy & numpy).
- Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
- Supports incremental insertion of elements.
The most comparable library to PySparNN is scikit-learn's LSHForest module. As of this writing, PySparNN is ~4x faster on the 20newsgroups dataset (as a sparse vector). A more robust benchmarking on sparse data is desired. Here is the comparison. Here is another comparison on the larger Enron email dataset.
import pysparnn.cluster_index as ci import numpy as np from scipy.sparse import csr_matrix features = np.random.binomial(1, 0.01, size=(1000, 20000)) features = csr_matrix(features) # build the search index! data_to_return = range(1000) cp = ci.MultiClusterIndex(features, data_to_return) cp.search(features[:5], k=1, return_distance=False) >> [, , , , ]
import pysparnn.cluster_index as ci from sklearn.feature_extraction.text import TfidfVectorizer data = [ 'hello world', 'oh hello there', 'Play it', 'Play it again Sam', ] tv = TfidfVectorizer() tv.fit(data) features_vec = tv.transform(data) # build the search index! cp = ci.MultiClusterIndex(features_vec, data) # search the index with a sparse matrix search_data = [ 'oh there', 'Play it again Frank' ] search_features_vec = tv.transform(search_data) cp.search(search_features_vec, k=1, k_clusters=2, return_distance=False) >> [['oh hello there'], ['Play it again Sam']]
PySparNN requires numpy and scipy. Tested with numpy 1.11.2 and scipy 0.18.1.
# clone pysparnn cd pysparnn pip install -r requirements.txt python setup.py install
How PySparNN works
Searching for a document in an collection of D documents is naively O(D) (assuming documents are constant sized).
However! we can create a tree structure where the first level is O(sqrt(D)) and each of the leaves are also O(sqrt(D)) - on average.
We randomly pick sqrt(D) candidate items to be in the top level. Then -- each document in the full list of D documents is assigned to the closest candidate in the top level.
This breaks up one O(D) search into two O(sqrt(D)) searches which is much much faster when D is big!
This generalizes to h levels. The runtime becomes: O(h * h_root(D))
See the CONTRIBUTING file for how to help out.
PySparNN is BSD-licensed. We also provide an additional patent grant.