# <p style="text-align: center;">  Semantic Searcrh - Proof of Concept </p>
#### <p style="text-align: right;"> Andrei-Cristian Rad, DataScotus </p>

This notebook presents a proof of concept for a semantic search engine. 

The notebook facilitates testing the search feature by querying a document database. As a result, it will return the most similar documents to the input query.

# Setup

### Install Dependencies 

Install the Python dependencies required by this notebook

In [1]:
!pip install --disable-pip-version-check -q pandas
!pip install --disable-pip-version-check -q numpy
!pip install --disable-pip-version-check -q nltk
!pip install --disable-pip-version-check -q sentence-transformers

### Load Data

Load the CSV data from the source file into a Pandas dataframe

In [2]:
import pandas as pd

data_path = '../data/articles.csv'
df = pd.read_csv(data_path)
df.head(5)

Unnamed: 0,id,date,title,full_body
0,11,2020-12-05 20:28:00,PS5 GPU : The PS5 Pro could potentially have a...,It's only been 2 weeks since the worldwide lau...
1,12,2020-12-04 14:48:00,AMD Smart Access Memory Enabled on Intel Z490 ...,Several motherboard makers have started adding...
2,13,2020-12-02 16:00:00,Snapdragon 888 fully unveiled: the first with ...,"After yesterday's preview, Qualcomm fully unve..."
3,14,2020-11-20 22:16:00,AMD Will Bring Smart Access Memory Support to ...,"When AMD announced its Smart Access Memory, it..."
4,15,2020-11-25 19:32:00,The best mini gaming PC: the top small PC buil...,PCs don't have to be enormous machines and bui...


The data provided contains 4 columns, described by the following properties:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1753 entries, 0 to 1752
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1753 non-null   int64 
 1   date       1753 non-null   object
 2   title      1752 non-null   object
 3   full_body  1753 non-null   object
dtypes: int64(1), object(3)
memory usage: 54.9+ KB


For the text similarity, the only data we are interested in is the body of the text, therefore we can drop all the other columns.

In [4]:
# Rename the full_body column to document for more clarity
df.rename(columns={'full_body': 'document'}, inplace=True)

# Drop the columns that will not be used for similarity
df.drop(columns=['id', 'date', 'title'], inplace=True)

In [5]:
df.head(5)

Unnamed: 0,document
0,It's only been 2 weeks since the worldwide lau...
1,Several motherboard makers have started adding...
2,"After yesterday's preview, Qualcomm fully unve..."
3,"When AMD announced its Smart Access Memory, it..."
4,PCs don't have to be enormous machines and bui...


In [6]:
df.shape

SyntaxError: invalid syntax (Temp/ipykernel_30204/280453993.py, line 1)

## Preprocess Data

For the given text data, the preprocessing consists of the following steps:

* **Duplicates removal** - When displaying results, we do not want to get the same document more than once in the top results.
* **Stop-word and punctuation removal** - Taking out the very common words (i.e. the, and) and the punctuation does not change the meaning of the text but helps with the processing speed.

### Drop duplicates

In [7]:
df.drop_duplicates(keep='first', inplace=True)
df.shape

(1641, 1)

112 duplicate documents removed.

In [8]:
# The model used later does not require case normalization, but for some models it increases the performance.
# df['document'] = df['document'].apply(lambda doc : doc.lower()) 
df.head(5)

Unnamed: 0,document
0,It's only been 2 weeks since the worldwide lau...
1,Several motherboard makers have started adding...
2,"After yesterday's preview, Qualcomm fully unve..."
3,"When AMD announced its Smart Access Memory, it..."
4,PCs don't have to be enormous machines and bui...


In [11]:
clean_df = df.copy()

clean_df['document'] = clean_df['document'].apply(lambda doc: preprocess(doc))
clean_df.head(5)

Unnamed: 0,document
0,Its 2 weeks since worldwide launch PS5 However...
1,Several motherboard makers started adding supp...
2,After yesterdays preview Qualcomm fully unveil...
3,When AMD announced Smart Access Memory sounded...
4,PCs enormous machines building compact PC easi...


## Save Preprocessed Data

In [12]:
clean_df.to_csv('../data/articles_clean.csv', sep=',')

# Model Selection

The model used for this demonstration is part of the **sentence transformers** familiy of models. These models encode the meaning of the whole sentence/document in a fixed-size dense vector of floats (called the document embedding), which is in turn can be compared to other such embeddings to calculate a similarity score.  

## Motivation 

Transformer/BERT based embedding methods have the advantage of adaptability, in the sense that the embedding of a word is not only influenced by the surrounding words during pre-training (as for word2vec/doc2vec models), but also at inference time.

The advantage of these models when compared to more vanilla ones such as TF-IDF is that they work with the semantics of the words, and can find similarities in documents even if they do not contain *exactly* the terms in the query. This is very important, especially in domain-specific cases, where a multitude of terms refer to the same concept/abstraction.

On the other hand, these models are usually considerable in size and processing time. To mitigate that, the model used for this demonstration is the **distilroberta-base** model. Compared to the best performing model provided by this API, it has a performance drop of 2%, but it comes with 30% less processing time.

A description of all their models can be found [here](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0).

In [13]:
clean_df = pd.read_csv('../data/articles_clean.csv', sep=',')

Import the SentenceTransformer API, create the SentenceTrasnformer object and embed the documents.

In [14]:
from sentence_transformers import SentenceTransformer

embedding_net = SentenceTransformer('stsb-distilroberta-base-v2')
document_embeddings = embedding_net.encode(clean_df['document'])

In [15]:
document_embeddings.shape

(1641, 768)

Now each of the 1641 documents was transformed in a 768-dimensional vector which encapsulates its meaning.

### Similarity Metric

Considering we are working with dense embedding vectors, the most appropriate metric is the cosine similarity. 

In order to see to which of the documents our query is more familiar to, we apply a vector-based similarity metric which calculates the "angle" between the embedding vectors. 

The smaller the angle between the embedding vectors is, the more similar are the corresponding documents.

In [16]:
import numpy as np

from typing import Union

def cosine_similarity(v1: Union[np.ndarray, list, float], 
                      v2: Union[np.ndarray, list, float]) -> float:

    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))


def get_similar_documents(search_query: str, 
                          n_documents: int = 5) -> pd.Series:
    
    # Encode the query using the same network, in order to obtain an embedding.
    query_embedding = embedding_net.encode(search_query)

    # Compute the cosine similarities between the query embedding and the embedding of every document in the database.
    cosine_similarities = [cosine_similarity(query_embedding, document_embedding) for document_embedding in document_embeddings]

    # Index each embedding (to retrieve based on the index after sorting)
    indexed_similarities = [(k, v) for k, v in enumerate(cosine_similarities)]

    # Sort the tuples by the similarity value, but in reverse order (most similar first)
    sorted_indexed_similarities = sorted(indexed_similarities, key=lambda x: x[1], reverse=True)
    
    # Take the indices of the first n_documents most similar documents and return the corresponding rows in the dataframe
    indices = list(map(lambda x: x[0], sorted_indexed_similarities))[:n_documents]
    
    # Return the original documents, not the processed ones
    return df.iloc[indices]

## Proof-of-Concept Demo

To perform the search, just replace the text after *search_query = [...]* and run the cell.

The first cell returns the first 5 most similar documents as a dataframe/series, while the second cell prints the most similar document for verification. 

Although the search results may not contain the exact words in the query, they contain semantically similar terms, which makes their emebdding vector closer to the query.

In [17]:
search_query = "AMD Big Navi"

most_similar_documents = get_similar_documents(search_query)

most_similar_documents

Unnamed: 0,document
8,AMD has announced the long-awaited Big Navi gr...
78,"AMD's RDNA2 GPUs land this morning, in the for..."
1240,Fresh rumors are spreading about AMD's next-ge...
79,"GIGABYTE TECHNOLOGY Co. Ltd, a leading manufac..."
9,AMD has announced its much anticipated Big Nav...


In [23]:
most_similar_documents.iloc[0].document

'AMD has announced the long-awaited Big Navi graphics card. The Radeon RX6800 costs $ 579 and the top-end 6900 XT costs $ 999, about 50% lower than the similarly performing GeForce RTX 3090. As with competing RTX 30 series components, there are a few things you need to know about new GPUs before deciding which GPU to buy.  Big Navi improves performance at 1440p compared to 4K  The Radeon RX 6000 graphics card supports 4K parts, but typically works more consistently at 1440p compared to its rival GeForce RTX 30 series products. It\'s primarily because NVIDIA\'s GPU has a wider bus (320 bits for the RTX 3080, 384 bits for the 3090) and GDDR6X memory, both of which offer significantly more bandwidth than the Navi 2x part. Because. Bandwidth is more important at high resolutions because large texture assets are always buffered with the VRAM buffer.  The Radeon RX 6800 XT is a sweet spot and could be the next bestseller  All three GPUs are decent products, but in terms of price / performanc

One possible reason for which document no. 9