# Semantic Search Index
Source: most of the code is copied or adapted from The Fuzzy Scientist's [LLMs Mastery: Complete Guide to Transformers & Generative AI](https://udemy.com/course/llms-mastery-complete-guide-to-transformers-generative-ai) course on Udemy. The course is a lot more extensive than what is presented here and should be followed to understand all the concepts and the full context.

The goal of this notebook is to develop a simple search engine to retrieve article summaries from a dataset using natural language queries. 

## Import Dataset

In [15]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
import torch

In [2]:
dataset = load_dataset('multi_news', split='test')

Downloading data: 100%|██████████| 295M/295M [00:36<00:00, 8.03MB/s] 
Downloading data: 100%|██████████| 28.3M/28.3M [00:10<00:00, 2.82MB/s]
Downloading data: 100%|██████████| 39.5M/39.5M [00:09<00:00, 4.02MB/s]
Downloading data: 100%|██████████| 40.1M/40.1M [00:09<00:00, 4.04MB/s]


Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [3]:
df = dataset.to_pandas().sample(2000, random_state=42)
df.head()

Unnamed: 0,document,summary
4830,Tweet with a location \n \n You can add locati...,– Denis Finley has taken to Twitter to call Po...
1255,CNN host Piers Morgan just called to discuss h...,– CNN's Piers Morgan thinks gun-rights propone...
80,White House communications director Anthony Sc...,– New White House communications director Anth...
3044,CLOSE Scientists say they've found archaeologi...,– Scientists say they have the first physical ...
4486,Click image above to view graphic \n \n Althou...,– Scientists are calling it a breakthrough and...


## Model preparation
We use a sentence transformer model that is trained to give a single-vector representation of a single sentence. I.e. it does not provide word-level representations.

In [6]:
# Initiate the model
model = SentenceTransformer('all-miniLM-L6-v2')

# Retrieve the embeddings of the summaries from the dataset
passage_embeddings = list(model.encode(df['summary'].to_list(), show_progress_bar=True))

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

## Query system
The goal is to retrieve the Top 3 articles with the highest similarity to a query provided by the user.

With all the summaries being encoded, we can prepare a query that is encoded with the exact same model - i.e. that will have the same format as the embeddings of the dataset. This will return a vector representing the query sentence that can be compared with the embeddings of the summaries to identify similar sentences and meanings within the context of the articles.

In [45]:
# Define the search query
query = "Find me some articles about technology and artifical intelligence"

In [46]:
# Number of characters to display for the summary extract
n_char = 100

# Number of items (summaries) to return
k = 3

In [99]:
# Retrieve the embeddings of the query sentence to compare it with the summaries
query_embeddings = model.encode(query)

# Calculate the similarities (cosine distance) between the query eand each of the
# article summaries in the dataset. This is a line vector with as many dimensions
# as there are articles in the DataFrame extracted from the dataset.
similarities = util.cos_sim(query_embeddings, passage_embeddings)

# torch.topk: return the top k values
# .flatten(): transform the vector in a list of values
# .indices: return the indices (alternatively, use .values to retrieve the values)
top_indices = torch.topk(similarities.flatten(), k).indices

# Retrieve the top passages from the DataFrame
# Use a list comprehension to get the first n_char characters for each summary
# x.item() is used to retrieve the value stored in the element x of the tensor
top_relevant_passages = [df.iloc[x.item()]['summary'][:n_char] + '...' for x in top_indices]

In [59]:
# Present the results in a coherent way
print(f'Your query: "{query}"')
print()
print('I found the following results for you:')
print("--------------")
for summary in top_relevant_passages:
    print(f'* "{summary[2:]}"')

Your query: "Find me some articles about technology and artifical intelligence"

I found the following results for you:
--------------
* "If brain size relative to body size determines IQ, the venerable shrew would be the smartest creat..."
* "Are you a "digital native" or a "digital immigrant," and does it make a difference? Research recen..."
* "Using methods borrowed from Google, a group of researchers has analyzed all Wikipedia pages and de..."


In [101]:
# Keep 12 words in each summary
n_words = 12

# Retrieve the top passages from the DataFrame
# List comprehension:
# - retrieve the summary column for each index in top_indices
# - split the string based on spaces
# - drop the first element (which is only a dash '-') and keep the first n_words elements
# - join the string back together with spaces " "
# - add '...' at the end of the string
top_relevant_passages = [" ".join(df.iloc[x.item()]['summary'].split(' ')[1:n_words+1]) + '...' for x in top_indices]

# Present the results in a coherent way
print(f'Your query: "{query}"')
print()
print('I found the following results for you:')
print("--------------")
for summary in top_relevant_passages:
    print(f'* "{summary}"')

Your query: "Find me some articles about technology and artifical intelligence"

I found the following results for you:
--------------
* "If brain size relative to body size determines IQ, the venerable shrew..."
* "Are you a "digital native" or a "digital immigrant," and does it..."
* "Using methods borrowed from Google, a group of researchers has analyzed all..."
