<a href="https://colab.research.google.com/github/datatom4891/GenAI-Experiements/blob/main/FAISS_Euclidean_Distance_Similarity_Vector_Index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install sentence_transformers transformers faiss-cpu

In [2]:
import os
from google.colab import drive

In [3]:
import pandas as pd
import numpy as np
import faiss as fs
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

### Mount Google Drive to load news catcher dataset

In [4]:
drive.mount('/content/gdrive')
dataset_folder = os.path.join(os.getcwd(),'gdrive','MyDrive','Datasets')
news_catcher_csv = os.path.join(dataset_folder,'labelled_newscatcher_dataset.csv')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Read news catcher dataset into a dataframe

In [5]:
df = pd.read_csv(news_catcher_csv, sep=';')
df.head()

Unnamed: 0,topic,link,domain,published_date,title,lang
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en


### Distribution of news article titles by topic

In [6]:
df['topic'].value_counts()

TECHNOLOGY       15000
HEALTH           15000
WORLD            15000
ENTERTAINMENT    15000
SPORTS           15000
BUSINESS         15000
NATION           15000
SCIENCE           3774
Name: topic, dtype: int64

## Create a subset of the full sample by topic: SPORTS

In [7]:
df_sports = df[df['topic']=='SPORTS'].copy().reset_index(drop=True)
df_sports['id'] = df_sports.index + 1
df_sports = df_sports.set_index('id')
df_sports.head()

Unnamed: 0_level_0,topic,link,domain,published_date,title,lang
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,SPORTS,https://www.goal.com/en-za/news/bafana-bafana-...,goal.com,2020-08-16 18:24:00,Bafana Bafana international Tau scores on debu...,en
2,SPORTS,https://www.skysports.com/football/news/11787/...,skysports.com,2020-08-08 16:02:16,Leigh Griffiths 'reminded of his responsibilit...,en
3,SPORTS,https://www.rangersnews.uk/news/will-club-figu...,rangersnews.uk,2020-08-17 16:31:07,Will club figure put money where his mouth is ...,en
4,SPORTS,https://www.pensionplanpuppets.com/2020/8/10/2...,pensionplanpuppets.com,2020-08-10 17:00:00,Who has played their last game as a Maple Leaf?,en
5,SPORTS,https://www.teamtalk.com/news/super-agent-kia-...,teamtalk.com,2020-08-05 17:31:05,Super-agent determined to engineer moves to Ar...,en


In [8]:
id_list = df_sports.index.values.tolist()
vector_model_embedding = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
title_embeddings = vector_model_embedding.encode(df_sports['title'].tolist())

print("Tensor Dimensions: {} | Tensor Shape: {}".format(title_embeddings.ndim, title_embeddings.shape))

Tensor Dimensions: 2 | Tensor Shape: (15000, 384)


# **Create Vector Indexes using FAISS**

## **IndexFlat L2**

Brute force search that basically compares the query vector to all document vectors in the index using Euclidean distance (L2) as the similarity measure. It's very accurate but not fast, speed of search diminshes depending on the number of vectors in the index. FAISS doesn't build this index using a ML algorithm underneath the hood.

I am going to create a flat index with IDs using the id column from df_sports, which I made the index of the df_sports dataframe. This acts like a foreign key that will allow me to connect the results of queries against the index to rows in the df_sports dataframe. If this will enable me to add columns from the dataframe as additional information to suppliment the results returned by the index.

**Exact Vector Index without IDs**

Instantiate an instance of IndexFlatL2 by passing the vector length of the vectors in your index. This creates an empty vector index that you can add your vector embedding to without IDs.



```
vector_index = fs.IndexFLatL2(embedding_dimension)
vector_index.add(title_embeddings)
```





**Exact Vector Index with IDs**

To create a Vector Index with IDs of your choosing:


1.   First create an empty instance of IndexFlatL2
2.   Then use the empty IndexFlatL2 instance along with the list of IDs for each vector to create a empty index that accomodates IDs for each vector, using IndexIDMap.
3.   Add the text embeddings and it's corresponding list of IDs to vector index created in step 2


```
empty_vector_index = fs.IndexFlatL2(text_embeddings.shape[1])
populated_vector_index = fs.IndexIDMap(empty_vector_index)
populated_vector_index.add_with_ids(text_embeddings, id_list)
```





In [9]:
empty_vector_index = fs.IndexFlatL2(title_embeddings.shape[1])
populated_vector_index = fs.IndexIDMap(empty_vector_index)
populated_vector_index.add_with_ids(title_embeddings, id_list)

A query string against the vector index has to be encoded before performing a query against the index. The model that was used to transform the text in the vector index into vector embeddings should be used to encode the query string from natural language (human language) into vector embeddings (a sequence of vectors)

In [10]:
top_k = 5
query_string = ["manchester united player transactions"]
query_vector = vector_model_embedding.encode(query_string)

In [11]:
%%time

top_k_distance_scores, top_k_indices = populated_vector_index.search(query_vector, top_k)

print(top_k_distance_scores)
print(top_k_indices)

[[0.5696254  0.61809546 0.67908823 0.6896554  0.7441226 ]]
[[ 4703 12781  8195  8815   693]]
CPU times: user 7.06 ms, sys: 1 ms, total: 8.06 ms
Wall time: 5.85 ms


In [12]:
results = df_sports.loc[top_k_indices[0]]
results['similarity_score'] = top_k_distance_scores[0]
results

Unnamed: 0_level_0,topic,link,domain,published_date,title,lang,similarity_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4703,SPORTS,https://www.mirror.co.uk/sport/football/news/p...,mirror.co.uk,2020-08-10 08:11:58,Premier League transfers: Completed deals at M...,en,0.569625
12781,SPORTS,https://www.skysports.com/football/news/11095/...,skysports.com,2020-08-06 07:22:43,Manchester United transfer news and rumours,en,0.618095
8195,SPORTS,https://www.manchestereveningnews.co.uk/sport/...,manchestereveningnews.co.uk,2020-08-15 19:45:00,Man United 'to sell five players this summer' ...,en,0.679088
8815,SPORTS,https://www.skysports.com/football/news/11679/...,skysports.com,2020-08-15 14:59:52,Manchester City transfer news and rumours,en,0.689655
693,SPORTS,https://www.dailymail.co.uk/sport/football/art...,dailymail.co.uk,2020-08-17 18:04:31,Manchester United 'will have to sell to buy fu...,en,0.744123


In [13]:
for idx in results.index:
  id = idx
  similarity_score = round(results.loc[idx]['similarity_score'],2)
  article_title = results.loc[idx]['title']

  print(article_title)
  print(similarity_score)
  print("-------------------------------------------------------")

Premier League transfers: Completed deals at Man Utd, Chelsea, Man City and more
0.57
-------------------------------------------------------
Manchester United transfer news and rumours
0.62
-------------------------------------------------------
Man United 'to sell five players this summer' and more rumours
0.68
-------------------------------------------------------
Manchester City transfer news and rumours
0.69
-------------------------------------------------------
Manchester United 'will have to sell to buy further players if they sign Jadon Sancho'
0.74
-------------------------------------------------------


**Approximate Vector Index with IDs**

In [14]:
n_cells = 50
quantizing_index = fs.IndexFlatL2(title_embeddings.shape[1])
empty_partioned_index = fs.IndexIVFFlat(quantizing_index, title_embeddings.shape[1], n_cells)

populated_partioned_index = fs.IndexIDMap(empty_partioned_index)
populated_partioned_index.train(title_embeddings)
populated_partioned_index.add_with_ids(title_embeddings, id_list)

In [15]:
%%time

top_k_distance_scores2, top_k_indices2 = populated_partioned_index.search(query_vector, top_k)

print(top_k_distance_scores2)
print(top_k_indices2)

[[0.5696254  0.61809546 0.67908823 0.6896554  0.7441226 ]]
[[ 4703 12781  8195  8815   693]]
CPU times: user 6.16 ms, sys: 0 ns, total: 6.16 ms
Wall time: 5.38 ms


In [16]:
results2 = df_sports.loc[top_k_indices2[0]]
results2['similarity_score'] = top_k_distance_scores2[0]

for idx in results2.index:
  id = idx
  similarity_score = round(results2.loc[idx]['similarity_score'],2)
  article_title = results2.loc[idx]['title']

  print(article_title)
  print(similarity_score)
  print("-------------------------------------------------------")

Premier League transfers: Completed deals at Man Utd, Chelsea, Man City and more
0.57
-------------------------------------------------------
Manchester United transfer news and rumours
0.62
-------------------------------------------------------
Man United 'to sell five players this summer' and more rumours
0.68
-------------------------------------------------------
Manchester City transfer news and rumours
0.69
-------------------------------------------------------
Manchester United 'will have to sell to buy further players if they sign Jadon Sancho'
0.74
-------------------------------------------------------


In [17]:
results2

Unnamed: 0_level_0,topic,link,domain,published_date,title,lang,similarity_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4703,SPORTS,https://www.mirror.co.uk/sport/football/news/p...,mirror.co.uk,2020-08-10 08:11:58,Premier League transfers: Completed deals at M...,en,0.569625
12781,SPORTS,https://www.skysports.com/football/news/11095/...,skysports.com,2020-08-06 07:22:43,Manchester United transfer news and rumours,en,0.618095
8195,SPORTS,https://www.manchestereveningnews.co.uk/sport/...,manchestereveningnews.co.uk,2020-08-15 19:45:00,Man United 'to sell five players this summer' ...,en,0.679088
8815,SPORTS,https://www.skysports.com/football/news/11679/...,skysports.com,2020-08-15 14:59:52,Manchester City transfer news and rumours,en,0.689655
693,SPORTS,https://www.dailymail.co.uk/sport/football/art...,dailymail.co.uk,2020-08-17 18:04:31,Manchester United 'will have to sell to buy fu...,en,0.744123
