## Step 3: Create FAISS indexes

In this notebook, we are going to use FAISS to create indexes for our searching engine.  
There are 2 ouputs for this notebook:
- *apparel_15to25_review_cosine.faissindex*, the index for the development dataset
- *apparel_10to14_review_cosine.faissindex*, the index for the evaluation dataset

In [1]:
# install the faiss-cpu when new env is provisioned
#!pip install faiss-cpu
#conda install -c pytorch faiss-cpu=1.7.4 mkl=2021

In [2]:
import faiss
import pandas as pd
import numpy as np

## Load the dataframe pickle files 


In [3]:
df_development = pd.read_pickle('../resources/data/apparel_15to25_embedding.pkl')
df_evaluation = pd.read_pickle('../resources/data/apparel_10to14_embedding.pkl')

## Create the FAISS index
We'll create the IndexFlatIP. IP stands for "inner product". If we have normalized vectors, then the inner product becomes cosine similarity.

Refernce:  
https://www.pinecone.io/learn/faiss-tutorial/
https://github.com/facebookresearch/faiss/wiki/Getting-started  
https://ai.plainenglish.io/speeding-up-similarity-search-in-recommender-systems-using-faiss-basics-part-i-ec1b5e92c92d


### Create the index for the development dataset

In [4]:
# initialize the IndexFlatIP with the embedding dimension 
index_dev = faiss.IndexFlatIP(len(df_development['embedding'][0]))
index_dev.is_trained

True

In [5]:
# create the embeddings array
# this array is required by faiss to be float32 
embeddings_dev = np.array(df_development['embedding'].to_list(), dtype='float32')
embeddings_dev.shape

(89103, 1536)

In [6]:
# Normalize the embeddings and add to to the index
faiss.normalize_L2(embeddings_dev)
index_dev.add(embeddings_dev)
index_dev.ntotal

89103

In [7]:
# Save the index 
# uncomment below line to save the index
#faiss.write_index(index_dev, '../resources/binary/apparel_15to25_review_cosine.faissindex')

### Create the index for the evaluation dataset

In [8]:
# initialize the IndexFlatIP with the embedding dimension 
index_eva = faiss.IndexFlatIP(len(df_evaluation['embedding'][0]))
index_eva.is_trained

True

In [9]:
# create the embeddings array
# this array is required by faiss to be float32 
embeddings_eva = np.array(df_evaluation['embedding'].to_list(), dtype='float32')
embeddings_eva.shape

(88918, 1536)

In [10]:
# Normalize the embeddings and add to to the index
faiss.normalize_L2(embeddings_eva)
index_eva.add(embeddings_eva)
index_eva.ntotal

88918

In [11]:
# Save the index 
# uncomment below line to save the index
#faiss.write_index(index_eva, '../resources/binary/apparel_10to14_review_cosine.faissindex')