<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#PySparNN" data-toc-modified-id="PySparNN-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>PySparNN</a></span><ul class="toc-item"><li><span><a href="#PysParNN-also-works-with-scipy-coo-matrices" data-toc-modified-id="PysParNN-also-works-with-scipy-coo-matrices-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>PysParNN also works with scipy coo matrices</a></span></li><li><span><a href="#'Performant'-example" data-toc-modified-id="'Performant'-example-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>'Performant' example</a></span></li><li><span><a href="#Insert-elements" data-toc-modified-id="Insert-elements-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Insert elements</a></span></li><li><span><a href="#Important-notes:" data-toc-modified-id="Important-notes:-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Important notes:</a></span></li></ul></li></ul></div>

# PySparNN


- Git: https://github.com/facebookresearch/pysparnn

- Summary: Approximate Nearest Neighbor Search for Sparse Data in Python




In [16]:
import pysparnn.cluster_index as ci

import numpy as np
import scipy
from scipy.sparse import csr_matrix

features = np.random.binomial(1, 0.01, size=(1000, 20000))
csr_features = csr_matrix(features)
csr_features

<1000x20000 sparse matrix of type '<class 'numpy.int64'>'
	with 199873 stored elements in Compressed Sparse Row format>

In [17]:
# build the search index!
data_to_return = range(1000)
cp = ci.MultiClusterIndex(csr_features, data_to_return)
cp

<pysparnn.cluster_index.MultiClusterIndex at 0x1296b1730>

In [18]:
cp.search(csr_features[:5], k=1, return_distance=False)

[[0.0], [1.0], [2.0], [3.0], [4.0]]

## PysParNN also works with scipy coo matrices

In [19]:
# build the search index!
data_to_return = range(1000)
coo_features = scipy.sparse.coo_matrix(features)
cp = ci.MultiClusterIndex(coo_features, data_to_return)
cp.search(csr_features[:5], k=1, return_distance=False)

[[0.0], [1.0], [2.0], [3.0], [4.0]]

## 'Performant' example

In [99]:
import pysparnn.cluster_index as ci

from sklearn.feature_extraction.text import TfidfVectorizer

data = [
    'hello world',
    'oh hello there',
    'Play it',
    'Play it again Sam',
]   

keys = range(len(data))

tv = TfidfVectorizer()
tv.fit(data)

features_vec = tv.transform(data)

# build the search index!
cp = ci.MultiClusterIndex(features_vec, data)


In [21]:
# search the index with a sparse matrix
search_data = [
    'oh there',
    'Play it again Frank'
]

search_features_vec = tv.transform(search_data)
cp.search(search_features_vec, k=5, k_clusters=2, return_distance=False)

[['oh hello there', 'hello world', 'Play it', 'Play it again Sam'],
 ['Play it again Sam', 'Play it', 'hello world', 'oh hello there']]

In Jina we would use keys instead of the raw data directly

In [22]:
keys = [int(x) for x in range(len(data))]

In [205]:
cp = ci.MultiClusterIndex(features_vec, keys)
cp.search(search_features_vec[0], k=5, k_clusters=2, return_distance=True)

[[(0.12656138024977548, 1.0), (1.0, 0.0), (1.0, 2.0), (1.0, 3.0)]]

## Insert elements

In [24]:
cp = ci.MultiClusterIndex(features_vec, data)

In [25]:
record = "Hello Play it again"
record_feat = tv.transform([record])

In [26]:
cp.insert(feature=record_feat,record=record)

In [27]:
cp.search(record_feat, k=2)

[[('0.0', 'Hello Play it again'),
  ('0.26407460986755116', 'Play it again Sam')]]

## Important notes: 

In [185]:
cp = ci.MultiClusterIndex(features_vec, data)
record = "Hello Play it again"
record_feat = tv.transform([record])
cp.insert(feature=record_feat,record=record)

Notice a little issue: when inserting twice an element the algorithm
does not take into account it is already there the record.

- If we run again `cp.insert(feature=record_feat,record=record)` 
- And then `cp.search(record_feat, k=2)` we see a single item

In [177]:
cp.insert(feature=record_feat,record=record)

In [179]:
cp.search(record_feat, k=3)

[[('0.0', 'Hello Play it again'),
  ('0.26407460986755116', 'Play it again Sam')]]

## Using Pysparnn in Jina

In [194]:
from jina.executors.indexers.vector import BaseVectorIndexer


class PysparnnIndexer(BaseVectorIndexer):
    """
    :class:`PysparnnIndexer` Approximate Nearest Neighbor Search for Sparse Data in Python using PySparNN.


    """

    def __init__(self, k_clusters=2, num_indexes=None, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.k_clusters = k_clusters
        self.num_indexes = num_indexes

    def post_init(self):
        self.index = {}
        self.mci = None

    def _build_advanced_index(self):
        keys = []
        indexed_vectors = []
        import pysparnn.cluster_index as ci
        for key, vector in self.index.items():
            keys.append(key)
            indexed_vectors.append(vector)
        
        self.mci = ci.MultiClusterIndex(scipy.sparse.vstack(indexed_vectors), keys)

    def query(self, vectors, top_k, *args, **kwargs):

        if not self.mci:
            self._build_advanced_index()

        print('build advanced index done')
        index_distance_pairs = self.mci.search(vectors,
                                        k=top_k,
                                        k_clusters=self.k_clusters,
                                        num_indexes=self.num_indexes,
                                        return_distance=True)
        distances, indices = zip(*index_distance_pairs[0])
    
        return indices, distances

    def add(self, keys, vectors, *args, **kwargs):
        if self.mci is not None:
            raise Exception(' Not possible query while indexing')
        for key, vector in zip(keys, vectors):
            self.index[key] = vector

    def update(
            self, keys, vectors, *args, **kwargs
    ) -> None:
        if self.mci is not None:
            raise Exception(' Not possible query while indexing')
        for key, vector in zip(keys, vectors):
            self.index[key] = vector

    def delete(self, keys, *args, **kwargs) -> None:
        if self.mci is not None:
            raise Exception(' Not possible query while indexing')
        for key in keys:
            del self.index[key]


In [195]:
indexer = PysparnnIndexer()

PysparnnIndexer@3355[I]:post_init may take some time...
PysparnnIndexer@3355[I]:post_init may take some time takes 0 seconds (0.00s)


In [196]:
indexer.post_init()

In [197]:
for index in range(len(data)):
    indexer.add(keys=[index], vectors=[features_vec[index]])

In [198]:
indexer._build_advanced_index()

In [200]:
indices, distances = indexer.query(search_features_vec[0], top_k=4)

build advanced index done


In [202]:
indices

(1.0, 0.0, 2.0, 3.0)

In [203]:
distances

(0.12656138024977548, 1.0, 1.0, 1.0)

#### Understanding `_build_advanced_index`

In [132]:
indexer = PysparnnIndexer()
indexer.post_init()

PysparnnIndexer@3355[I]:post_init may take some time...
PysparnnIndexer@3355[I]:post_init may take some time takes 0 seconds (0.00s)


In [133]:
for index in range(len(data)):
    indexer.add(keys=[index], vectors=[features_vec[index]])

In [134]:
indexer.index

{0: <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 2 stored elements in Compressed Sparse Row format>,
 1: <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 3 stored elements in Compressed Sparse Row format>,
 2: <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 2 stored elements in Compressed Sparse Row format>,
 3: <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 4 stored elements in Compressed Sparse Row format>}

You can `fit` the indexer using `_build_advanced_index`

In [135]:
keys = []
indexed_vectors = []
import pysparnn.cluster_index as ci
for key, vector in indexer.index.items():
    keys.append(key)
    indexed_vectors.append(vector)

In [136]:
indexed_vectors

[<1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 2 stored elements in Compressed Sparse Row format>,
 <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 3 stored elements in Compressed Sparse Row format>,
 <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 2 stored elements in Compressed Sparse Row format>,
 <1x8 sparse matrix of type '<class 'numpy.float64'>'
 	with 4 stored elements in Compressed Sparse Row format>]

In [138]:
aux = ci.MultiClusterIndex(scipy.sparse.vstack(indexed_vectors), keys)

In [139]:
keys

[0, 1, 2, 3]