In [None]:
!pip install faiss-cpu
!pip install sentence-transformers

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

In [2]:
df = pd.read_csv("data.csv")
df.shape

(8, 2)

In [3]:
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [4]:
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


<b>Encoding Text to Vectors</b>

In [5]:
encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)

<p>Now, in vectors, we have 8 vectors, and size of each vector is 768</p>

In [6]:
vectors.shape

(8, 768)

<b>Building FAISS Index for vectors</b>

In [10]:
dim = vectors.shape[1]
dim

768

In [21]:
import faiss

"""
The L2 norm calculates the distance of the vector coordinates from the origin of the vector space.
As such, it is also known as the Euclidean norm as it is calculated as the Euclidean distance from the origin.
The result is a positive distance value.
This is what IndexFlatL2 uses.
"""
index = faiss.IndexFlatL2(dim) #we created an empty index

<b>Normalize the source vectors</b>

In [25]:
index.add(vectors) #Now, we input our 8 vectors, and FAISS internally created some Data Structure, that allows us to do some fast similarity search.
index

<faiss.swigfaiss.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x0000020F7ADA6280> >

<b>Encode search text using same encorder and normalize the output vector</b>

In [28]:
search_query = "I want to be healthy" #our search vector
vec = encoder.encode(search_query) #encoding search query into vector
vec.shape

(768,)

In [29]:
import numpy as np
svec = np.array(vec).reshape(1,-1) #creating 2D array as search expects 2D array.
svec.shape

(1, 768)

<b>Searching for similar vector in the FAISS index created</b>

In [32]:
distances, I = index.search(svec, k=2) #we want 2 similar vectors, hence k=2.

In [34]:
distances

array([[1.3456718, 1.4885883]], dtype=float32)

In [39]:
I #I is an array and it gives the related sentence's index, here it is in index 0 and 1.

array([[1, 0]], dtype=int64)

In [40]:
df.loc[[1,0]] #revealing sentences at index 1 and 0.

Unnamed: 0,text,category
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
0,Meditation and yoga can improve mental health,Health


As seen from our search query, direct key word search is not applied to the dataset.
<br>Instead, semantic search is applied, and the related sentences which are converted to vectors, are displayed using some vector similarity, which in this case is FAISS (Facebook AI Similarity Search).

In [38]:
search_query

'I want to be healthy'