<a href="https://colab.research.google.com/github/Yash-Kamtekar/Approximate-nearest-neighbor/blob/main/Approximate_nearest_neighbor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

importing all the necessary libraries

In [32]:
import numpy as np
import pickle
import pandas as pd

Importing the lightfm library to import the dataset.
First we need to install the library.

In [33]:
pip install lightfm



In [34]:
from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score

importing the movielens dataset and getting the train and test data.

In [35]:
movie_lens = fetch_movielens()

train = movie_lens['train']
test = movie_lens['test']

There are 2 models that lightfm uses and we will use both to see which one is better.

1. let us train the model using Bayesian Personalised Ranking (bpr) and look at its accuracy.

In [25]:
model = LightFM(learning_rate=0.05, loss='bpr')
model.fit(train, epochs=10)

bpr_precision_train = precision_at_k(model, train, k=10).mean()
bpr_precision_test = precision_at_k(model, test, k=10, train_interactions=train).mean()

bpr_auc_train = auc_score(model, train).mean()
bpr_auc_test = auc_score(model, test, train_interactions=train).mean()

print('Precision: train %.2f, test %.2f.' % (bpr_precision_train, bpr_precision_test))
print('AUC: train %.2f, test %.2f.' % (bpr_auc_train, bpr_auc_test))

Precision: train 0.60, test 0.20.
AUC: train 0.90, test 0.88.


2. Now, let us train the model using Weighted Approximate-Rank Pairwise (warp) and look at its accuracy.

In [48]:
model = LightFM(learning_rate=0.05, loss='warp')
model.fit_partial(train, epochs=10)

warp_precision_train = precision_at_k(model, train, k=10).mean()
warp_precision_test = precision_at_k(model, test, k=10, train_interactions=train).mean()

warp_auc_train = auc_score(model, train).mean()
warp_auc_test = auc_score(model, test, train_interactions=train).mean()

print('Precision: train %.2f, test %.2f.' % (warp_precision_train, warp_precision_test))
print('AUC: train %.2f, test %.2f.' % (warp_auc_train, warp_auc_test))

Precision: train 0.60, test 0.22.
AUC: train 0.93, test 0.93.


we can clearly get slightly higher precision in warp than bpr.

In [49]:
item_vectors = movie_lens['item_features'] * model.item_embeddings
item_vectors

array([[ 0.2405493 ,  0.02754697,  0.6042577 , ...,  0.757712  ,
         0.06605501,  0.5899007 ],
       [ 0.08846308,  0.28421962,  0.55485255, ...,  0.33963293,
         0.44817957,  0.38973722],
       [-0.16057768,  0.571251  ,  0.13750643, ...,  0.41625515,
         0.09895659,  0.23993613],
       ...,
       [-0.49230114,  0.22580701, -0.510393  , ..., -0.31490573,
        -0.58756775, -0.48926398],
       [-0.36014026,  0.19565772, -0.26121157, ..., -0.3110757 ,
        -0.3958605 , -0.47949326],
       [-0.45961455,  0.09483821, -0.3228401 , ..., -0.21541038,
        -0.26523745, -0.50422144]], dtype=float32)

let us store this data in a variable.
and also save it in a pickle file.

In [50]:
with open('movie_lens.pickle', 'wb') as f:
    pickle.dump({"name": movie_lens['item_feature_labels'], "vector": item_vectors}, f)

data = ({"name": movie_lens['item_feature_labels'], "vector": item_vectors})
data

{'name': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
        'Sliding Doors (1998)', 'You So Crazy (1994)',
        'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object),
 'vector': array([[ 0.2405493 ,  0.02754697,  0.6042577 , ...,  0.757712  ,
          0.06605501,  0.5899007 ],
        [ 0.08846308,  0.28421962,  0.55485255, ...,  0.33963293,
          0.44817957,  0.38973722],
        [-0.16057768,  0.571251  ,  0.13750643, ...,  0.41625515,
          0.09895659,  0.23993613],
        ...,
        [-0.49230114,  0.22580701, -0.510393  , ..., -0.31490573,
         -0.58756775, -0.48926398],
        [-0.36014026,  0.19565772, -0.26121157, ..., -0.3110757 ,
         -0.3958605 , -0.47949326],
        [-0.45961455,  0.09483821, -0.3228401 , ..., -0.21541038,
         -0.26523745, -0.50422144]], dtype=float32)}

# **Locality Sensitive Hashing**

lets install faiss and import it.

In [51]:
!pip install faiss-gpu
import faiss



Creating index class

In [52]:
class LSHIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels    
   
    def build(self, num_bits=8):
        self.index = faiss.IndexLSH(self.dimension, num_bits)
        self.index.add(self.vectors)
          
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]

index = LSHIndex(data["vector"], data["name"])
index.build()

In [53]:
movie_vector, movie_name = data['vector'][90:91], data['name'][90]
simlar_movie_questions = '\n* '.join(index.query(movie_vector))
print(f"The most similar movies to {movie_name} are:\n* {simlar_movie_questions}")

The most similar movies to Nightmare Before Christmas, The (1993) are:
* What's Eating Gilbert Grape (1993)
* While You Were Sleeping (1995)
* Die Hard (1988)
* Nightmare Before Christmas, The (1993)
* Fish Called Wanda, A (1988)
* Groundhog Day (1993)
* Cinderella (1950)
* Sound of Music, The (1965)
* Searching for Bobby Fischer (1993)
* Mr. Holland's Opus (1995)


# **Exhaustive Search**

Creating index class

In [59]:
class ExhaustiveIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels    
   
    def build(self):
        self.index = faiss.IndexFlatL2(self.dimension,)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]


index = ExhaustiveIndex(data["vector"], data["name"])
index.build()

In [60]:
movie_vector, movie_name = data['vector'][90:91], data['name'][90]
simlar_movie_questions = '\n* '.join(index.query(movie_vector))
print(f"The most similar movie to {movie_name} are:\n* {simlar_movie_questions}")

The most similar movie to Nightmare Before Christmas, The (1993) are:
* Nightmare Before Christmas, The (1993)
* Beauty and the Beast (1991)
* Cinderella (1950)
* Pink Floyd - The Wall (1982)
* Aladdin (1992)
* Princess Bride, The (1987)
* Sword in the Stone, The (1963)
* Monty Python's Life of Brian (1979)
* Interview with the Vampire (1994)
* Braveheart (1995)


# **Product Quantization**

Creating index class

In [61]:
class IVPQIndex():
    def __init__(self, vectors, labels):
        self.dimention = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels


    def build(self, number_of_partition=8, search_in_x_partitions=2, subvector_size=8):
        quantizer = faiss.IndexFlatL2(self.dimention)
        self.index = faiss.IndexIVFPQ(quantizer, self.dimention, number_of_partition, search_in_x_partitions, subvector_size)
        self.index.train(self.vectors)
        self.index.add(self.vectors)


    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]


index = IVPQIndex(data["vector"], data["name"])
index.build()

In [67]:
movie_vector, movie_name = data['vector'][90:91], data['name'][90]
simlar_movie_questions = '\n* '.join(index.query(movie_vector))
print(f"The most similar movie to {movie_name} are:\n* {simlar_movie_questions}")

The most similar movie to Nightmare Before Christmas, The (1993) are:
* Nightmare Before Christmas, The (1993)
* Beauty and the Beast (1991)
* Cinderella (1950)
* Braveheart (1995)
* Monty Python's Life of Brian (1979)
* Princess Bride, The (1987)
* Aladdin (1992)
* Abyss, The (1989)
* Stand by Me (1986)
* Evil Dead II (1987)


# **Trees and Graph**

lets install annoy and import it.

In [70]:
!pip install annoy
import annoy

Collecting annoy
  Downloading annoy-1.17.0.tar.gz (646 kB)
[?25l[K     |▌                               | 10 kB 19.0 MB/s eta 0:00:01[K     |█                               | 20 kB 26.1 MB/s eta 0:00:01[K     |█▌                              | 30 kB 12.4 MB/s eta 0:00:01[K     |██                              | 40 kB 9.9 MB/s eta 0:00:01[K     |██▌                             | 51 kB 5.4 MB/s eta 0:00:01[K     |███                             | 61 kB 5.9 MB/s eta 0:00:01[K     |███▌                            | 71 kB 5.8 MB/s eta 0:00:01[K     |████                            | 81 kB 6.5 MB/s eta 0:00:01[K     |████▋                           | 92 kB 5.4 MB/s eta 0:00:01[K     |█████                           | 102 kB 5.2 MB/s eta 0:00:01[K     |█████▋                          | 112 kB 5.2 MB/s eta 0:00:01[K     |██████                          | 122 kB 5.2 MB/s eta 0:00:01[K     |██████▋                         | 133 kB 5.2 MB/s eta 0:00:01[K     |███████

Creating index class

In [71]:
class AnnoyIndex():
    def __init__(self, vectors, labels):
        self.dimention = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels


    def build(self, number_of_trees=5):
        self.index = annoy.AnnoyIndex(self.dimention)
        for i, vec in enumerate(self.vectors):
            self.index.add_item(i, vec.tolist())
        self.index.build(number_of_trees)

    def query(self, vector, k=10):
        indices = self.index.get_nns_by_vector(vector.tolist(), k)
        return [self.labels[i] for i in indices]


index = AnnoyIndex(data["vector"], data["name"])
index.build()

  if __name__ == '__main__':


In [73]:
movie_vector, movie_name = data['vector'][90], data['name'][90]
similar_movie_questions = '\n* '.join(index.query(movie_vector))
print(f"The most similar movie to {movie_name} are:\n* {simlar_movie_questions}")

The most similar movie to Nightmare Before Christmas, The (1993) are:
* Nightmare Before Christmas, The (1993)
* Beauty and the Beast (1991)
* Cinderella (1950)
* Braveheart (1995)
* Monty Python's Life of Brian (1979)
* Princess Bride, The (1987)
* Aladdin (1992)
* Abyss, The (1989)
* Stand by Me (1986)
* Evil Dead II (1987)


# **Hierarchical Navigable Small World Algorithm**

lets install nmslib and import it.

In [77]:
!pip install nmslib
import nmslib



Creating index class

In [78]:
class HNSWIndex():
    def __init__(self, vectors, labels):
        self.dimention = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels


    def build(self):
        self.index = nmslib.init(method='hnsw', space='cosinesimil')
        self.index.addDataPointBatch(self.vectors)
        self.index.createIndex({'post': 2})

    def query(self, vector, k=10):
        indices = self.index.knnQuery(vector, k=k)
        return [self.labels[i] for i in indices[0]]


index = HNSWIndex(data["vector"], data["name"])
index.build()

In [79]:
movie_vector, movie_name = data['vector'][90], data['name'][90]
simlar_movie_questions = '\n* '.join(index.query(movie_vector))
print(f"The most similar stack to {movie_name} are:\n* {simlar_movie_questions}")

The most similar stack to Nightmare Before Christmas, The (1993) are:
* Nightmare Before Christmas, The (1993)
* Beauty and the Beast (1991)
* Cinderella (1950)
* Pink Floyd - The Wall (1982)
* Sword in the Stone, The (1963)
* Aladdin (1992)
* Interview with the Vampire (1994)
* Princess Bride, The (1987)
* Monty Python's Life of Brian (1979)
* Stand by Me (1986)
