# Project: Movies recommendation
## Part-1 Checkout the index/algorithms for recall

#### Objective:

* Gain general understanding of the algorithms
* Change the parameters and observe impact on behavior of various algorithms
* Select the algorithm

https://github.com/facebookresearch/faiss/wiki/Getting-started

https://cheatsheet.md/python-cheatsheet/faiss-python-api

#### Supported Distance metric

https://faiss.ai/cpp_api/file/MetricType_8h.html#_CPPv4N5faiss10MetricTypeE


#### Notebook hangs?
It may happen on machines with low resource availability.

* Try to restart local notebook
        - jupyter notebook stop 
        - jupyter notebook start
* If that doesn't help, use *Google collab*
* Upload this notebook
* Run the cell below to install the required packages

In [None]:
## Needed for notebook on Google Colab
# !pip install faiss-gpu datasets

In [1]:
import faiss
from datasets import load_dataset
import pandas as pd
import numpy as np

## 1. Load dataset acloudfan/embedded_movies_small

In [2]:
movies_dataset_name = 'acloudfan/embedded_movies_small'

movies_dataset = load_dataset(movies_dataset_name)

# This will hold the data for movies, will be cross referenced for details
movies_dataset_train = movies_dataset['train']
# Embeddings need to be in numpy array with dtype=float32
movies_dataset_train_np = np.array(movies_dataset_train['plot_embedding']).astype(np.float32)

# This will hold the details for test dataset
movies_dataset_test = movies_dataset['test']
movies_dataset_test_np = np.array(movies_dataset_test['plot_embedding']).astype(np.float32)

In [3]:
# Check the embedding dimension
embeddings_dimension = len(movies_dataset_test_np[0])

embeddings_dimension

1536

## 2. Utility methods 

Print the recommended movies for ez read.

In [4]:
# Utility method to print the movie information
def  print_movie(movie):
    print('title = ', movie['title'])
    print('genres = ',movie['genres'])
    print('fullplot = ', movie['fullplot'])

# Utility method to run search and print results
# Returns the indexes
def   query_embeddings(faiss_index, k, test_index):
    query_embedding = movies_dataset_test_np[test_index]
    query_embedding = np.expand_dims(query_embedding, axis=0)

    result_indexes = []
    
    print('Query Result')
    print('-----------------')
    print_movie(movies_dataset_test[test_index])

    distances, movie_indexes = faiss_index.search(query_embedding, k)

    print(distances)
    for i, movie_index in enumerate(movie_indexes[0]):
        result_indexes.append(movie_index)
        print(i,'--',movie_index,'-- Distance = ', distances[0][i],'---')
        print_movie(movies_dataset_train[movie_index.item()])

    return result_indexes
        

## 3. Setup FlatL2 index (Baseline)

#### Baseline metrics = Exact Search
Brute forcess KNN, using the Eucledian/L2 distance measurements

##### FAISS API Reference
https://faiss.ai/cpp_api/file/IndexFlat_8h.html#_CPPv4N5faiss9IndexFlatE

In [5]:
# Create the index
flatl2_index = faiss.IndexFlatL2(embeddings_dimension)

# Add the training embeddings to the index
flatl2_index.add(movies_dataset_train_np)

# Check if index needs training
flatl2_index.is_trained

True

In [6]:
%%time

# Change the index to try out different movies ~400 rows
test_movie_index = 14

# Change the value of k as needed
k = 2

# Test for a few movies
baseline_result_indexes = query_embeddings(flatl2_index,k,test_movie_index)

print('-----Query Movie----')
print(movies_dataset_train[test_movie_index]['fullplot'])
print('--------------------')
print("Baseline indexes = ", baseline_result_indexes)

Query Result
-----------------
title =  The Accidental Spy
genres =  ['Action', 'Comedy', 'Thriller']
fullplot =  This action movie unfolds with the story of Bei, a salesman at a workout equipment store, who harbors dreams of adventures. It all starts when on one normal dull day, Bei follows his instincts to trail two suspicious looking men into an alley. When he realizes that these men are robbing a jewelry store, he jumps into action to foil their plans. Soon after Bei meets Liu, a private investigator who convinces Bei that he may be the long-lost son of a rich Korean businessman. In no time, Bei is on his way to fulfill his dreams of adventure and fortune travelling to Korea and even exotic Turkey. As Bei is drawn deeper into the game of cat and mouse, he realizes he has become the key to locating a lung cancer virus. With an assortment of characters fighting him along the way, will Bei succeed in finding the virus himself?
[[0.32972398 0.3340392 ]]
0 -- 917 -- Distance =  0.329723

## 4. Setup IVVFLAT

CPP class information.

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVFFlat.html

In [None]:
# How many cells to be created

# Change to this number will change the performance & recall
nlist_ivfflat = 200

# Quantizer = maintains the actual vector space. Think of it as the index used by IVF algorithm. It can be PQ as well (covered below)
quantizer = faiss.IndexFlatL2(embeddings_dimension)

# Index creation
index_ivfflat = faiss.IndexIVFFlat(quantizer, embeddings_dimension, nlist_ivfflat)

# index
print("Is trained: ", index_ivfflat.is_trained)

### Train the index

In [None]:
%%time

# Now add the embeddings to the index
index_ivfflat.train(movies_dataset_train_np)
index_ivfflat.add(movies_dataset_train_np)

print("Is trained: ", index_ivfflat.is_trained)

### Run test query

In [None]:
%%time

# Set the number of cells to be searched
nprobes = 1

index_ivfflat.nprobe = nprobes

# Test for a few movies
result_indexes = query_embeddings(index_ivfflat,k,test_movie_index)

print('----------')
print('Baseline indexes = ', baseline_result_indexes)
print('Query indexes = ', result_indexes)

## 5. LSH

* Change the nbits to see the effect on build time

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexLSH.html

In [None]:
%%time

# Bucket resoultion = Size of the hashcode in bits
# Change the nbits to see the difference in the Index Build Time"
nbits =5*embeddings_dimension

# initialize index and add vectors
index_lsh = faiss.IndexLSH(embeddings_dimension, nbits)

print("is trained : ", index_lsh.is_trained)

index_lsh.train(movies_dataset_train_np)
index_lsh.add(movies_dataset_train_np)


### Run test query

In [None]:
%%time

# Test for a few movies
result_indexes = query_embeddings(index_lsh,k,test_movie_index)

print('----------')
print('Baseline indexes = ', baseline_result_indexes)
print('Query indexes = ', result_indexes)

## 6. Product Quantization

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexPQ.html#

IndexPQ(int d, size_t M, size_t nbits, MetricType metric = METRIC_L2)

- d – dimensionality of the input vectors
- 
M – number of subquantize
- 

nbits – number of bit per subvector i
ndex

In [None]:
# Number of sub spaces
m = 128

# Number of bits per subquantizer K = 2**nbits
nbits = 3

# embedding dimension must be divisble by number of vector spaces
assert embeddings_dimension % m == 0

# Create the index
index_pq = faiss.IndexPQ(embeddings_dimension, m, nbits)

print("is_trained :", index_pq.is_trained)


### Train index

In [None]:
%%time

index_pq.train(movies_dataset_train_np)
index_pq.add(movies_dataset_train_np)

print("is_trained :", index_pq.is_trained)
print("ntotal : ", index_pq.ntotal)

### Run test query

In [None]:
%%time

# Test for a few movies
result_indexes = query_embeddings(index_pq,k,test_movie_index)

print('----------')
print('Baseline indexes = ', baseline_result_indexes)
print('Query indexes = ', result_indexes)

## 7. IVF + PQ

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVFPQ.html

IndexIVFPQ(Index *quantizer, size_t d, size_t nlist, size_t M, size_t nbits_per_idx, MetricType metric = METRIC_L2)

In [None]:
# Change to this number will change the performance & recall
nlist_ivfpq = 100

# Number of subspaces
number_subspaces = 8  

# Number of bits in each centroid
number_bits_per_centroid = 4

quantizer = faiss.IndexFlatL2(embeddings_dimension)  # we keep the same L2 distance flat index
index_ivfpq = faiss.IndexIVFPQ(quantizer, embeddings_dimension, nlist_ivfpq, number_subspaces, number_bits_per_centroid) 

print("Is trained: ", index_ivfpq.is_trained)

### Train the index

In [None]:
%%time

# Now add the embeddings to the index
index_ivfpq.train(movies_dataset_train_np)
index_ivfpq.add(movies_dataset_train_np)

print("Is trained: ", index_ivfpq.is_trained)
print("ntotal : ", index_ivfpq.ntotal)

### Run test query

In [None]:
%%time


# Test for a few movies
result_indexes = query_embeddings(index_ivfpq,k,test_movie_index)

print('----------')
print('Baseline indexes = ', baseline_result_indexes)
print('Query indexes = ', result_indexes)

## 7. HNSW with Flat Quantizer

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexHNSWFlat.html

#### IndexHNSWFlat.hnsw

https://faiss.ai/cpp_api/file/HNSW_8h.html#_CPPv4N5faiss4HNSWE

https://github.com/facebookresearch/faiss/blob/main/benchs/bench_hnsw.py



In [None]:
# Bidirectional links 
m = 32

index_hnsw_flat = faiss.IndexHNSWFlat(embeddings_dimension, m)

# layer_depth_construction
efConstruction = 64

# layer_depth_in_search 
efSearch = 32

index_hnsw_flat.hnsw.efConstruction = efConstruction 
index_hnsw_flat.hnsw.efSearch = efSearch # 


print("Is trained: ", index_hnsw_flat.is_trained)

index_hnsw_flat.add(movies_dataset_train_np)

### Run test query

In [None]:
%%time

# Test for a few movies
result_indexes = query_embeddings(index_hnsw_flat,k,test_movie_index)

print('----------')
print('Baseline indexes = ', baseline_result_indexes)
print('Query indexes = ', result_indexes)

## 8. HNSW with Scalar Quantizer

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexHNSWSQ.html

In [None]:
%%time

# also set M so that the vectors and links both use 128 bytes per
# entry (total 256 bytes)

index_hnsw_sq = faiss.IndexHNSWSQ(embeddings_dimension, faiss.ScalarQuantizer.QT_8bit, 16)

index_hnsw_sq.train(movies_dataset_train_np)
index_hnsw_sq.add(movies_dataset_train_np)

### Run test query

In [None]:
%%time

# Test for a few movies
result_indexes = query_embeddings(index_hnsw_sq,k,test_movie_index)

print('----------')
print('Baseline indexes = ', baseline_result_indexes)
print('Query indexes = ', result_indexes)