Please use the Local ./Dockerfile environment 

In [None]:
# install the faiss-cpu when new env is provisioned
#!pip install faiss-cpu

In [None]:
# install the openai when new env is provisioned
#!pip install openai

In [1]:
import faiss
import openai
import pandas as pd
import numpy as np

In [2]:
pd.__version__

'1.5.3'

### Load the dataframe pickle

This dataframe is created with the amazon_reviews_us_Apparel_v1_00.tsv.gz dataset with following filtering criteria:
1. Year 2015
2. Products with 30 - 40 reviews
3. review_lenght > 10

The embedding is created with OpenAI, model="text-embedding-ada-002"

In [4]:
df_apparel = pd.read_pickle('../resources/data/apparel_10to14_embedding.pkl')

In [5]:
df_apparel.head(3)

Unnamed: 0,product_id,product_title,product_category,star_rating,review_id,review_headline,review_body,review_length,review_count,text,embedding
0,B00001QHXX,Richard Nixon Mask,Apparel,3.0,R2HBUQ97RV5JVR,Only get this if you want to freak people out....,I got this mask for a company party and it fre...,43,10,Richard Nixon Mask. I got this mask for a comp...,"[-0.03131880611181259, 0.0008081794367171824, ..."
1,B00001QHXX,Richard Nixon Mask,Apparel,4.0,RHCH92YNAS282,Four Stars,"Nice mask, will have to do some fitting work. ...",18,10,"Richard Nixon Mask. Nice mask, will have to do...","[-0.029398588463664055, 0.00031407203641720116..."
2,B00001QHXX,Richard Nixon Mask,Apparel,5.0,R1OHYB07D0WE35,Bargain,"Even though its a bit large, I can't help but ...",28,10,Richard Nixon Mask. Even though its a bit larg...,"[-0.022034794092178345, -0.008237365633249283,..."


### Create the FAISS index
We'll create the IndexFlatL2. This is the basic Euclidean distance for similarity measurement  

Refernce:  
https://www.pinecone.io/learn/faiss-tutorial/
https://github.com/facebookresearch/faiss/wiki/Getting-started  
https://ai.plainenglish.io/speeding-up-similarity-search-in-recommender-systems-using-faiss-basics-part-i-ec1b5e92c92d


In [6]:
# initialize the IndexFlatIP with the embedding dimension 
index = faiss.IndexFlatIP(len(df_apparel['embedding'][0]))
index.is_trained

True

In [7]:
# create the embeddings array
# this array is required by faiss to be float32 
embeddings = np.array(df_apparel['embedding'].to_list(), dtype='float32')
embeddings.shape

(88918, 1536)

In [8]:
embeddings[0]

array([-0.03131881,  0.00080818,  0.01593621, ..., -0.00513083,
       -0.00486061, -0.01712253], dtype=float32)

In [9]:
# Normalize the embeddings and add to to the index
faiss.normalize_L2(embeddings)
index.add(embeddings)
index.ntotal

88918

In [10]:
embeddings[0]

array([-0.0313188 ,  0.00080818,  0.01593621, ..., -0.00513083,
       -0.00486061, -0.01712253], dtype=float32)

In [11]:
# Save the index 
faiss.write_index(index, '../resources/binary/apparel_10to14_review_cosine.faissindex')