# Building a Semantic Search Engine to Search for Queries with Transformers

# Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, that only finds documents based on lexical matches, semantic search can also find synonyms.


## Background
The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embedding from your corpus are found. These entries should have a high semantic overlap with the query.

![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)


## Python

For small corpora (up to about 100k entries) we can compute the cosine-similarity between the query and all entries in the corpus.

For small corpora with few example sentences we compute the embeddings for the corpus as well as for our query.

We then use the [util.pytorch_cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries.

For large corpora, sorting all scores would take too much time. Hence, we can use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries.

[Reference](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search)


## Objective

For today's objective we will create a corpus of around 50000 question titles asked on Quora from an open dataset. Your task will be to compute sentence embeddings and then try to retrieve top 5 similar questions from the corpus for a few example queries mentioned below.

Use [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) which provides a scalable way to generate document embeddings using transformers



## Load Dependencies

In [3]:
#!pip install transformers

In [4]:
#!pip install -U sentence-transformers

In [5]:
import transformers

In [6]:
import pandas as pd
import numpy as np

## Download and Load Corpus of Questions

In [7]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

--2024-06-15 14:09:13--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Auflösen des Hostnamens qim.fs.quoracdn.net (qim.fs.quoracdn.net)… 162.159.152.17, 162.159.153.247
Verbindungsaufbau zu qim.fs.quoracdn.net (qim.fs.quoracdn.net)|162.159.152.17|:80 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 301 Moved Permanently
Platz: https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv [folgend]
--2024-06-15 14:09:13--  https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Verbindungsaufbau zu qim.fs.quoracdn.net (qim.fs.quoracdn.net)|162.159.152.17|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 58176133 (55M) [text/tab-separated-values]
Wird in »quora_duplicate_questions.tsv.1« gespeichert.


2024-06-15 14:09:14 (56,6 MB/s) - »quora_duplicate_questions.tsv.1« gespeichert [58176133/58176133]



In [8]:
df = pd.read_csv('quora_duplicate_questions.tsv', sep='\t').head(25000)
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [9]:
corpus = df['question1'].tolist() + df['question2'].tolist()

In [10]:
len(corpus)

50000

In [11]:
corpus[0]

'What is the step by step guide to invest in share market in india?'

## Use Sentence Transformers and Generate Corpus Embeddings

__Hint:__ You can use this tutorial as a reference

[Semantic Search Tutorial](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search.py)

Also use the __`roberta-large-nli-stsb-mean-tokens`__ model to generate document embeddings

In [12]:
import torch

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("roberta-large-nli-stsb-mean-tokens")



In [15]:
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

In [16]:
corpus_embeddings.shape

torch.Size([50000, 1024])

## Create a function to return top K similar sentences for a given query

In [17]:
def return_similar_sentences(query, model_embedder, corpus_embeddings, top_k):
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = top_k
    
    query_embedding = model_embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    similarity_scores = model_embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """
  #<FILL THIS UP>


In [18]:
df.head(5)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## Perform Semantic Search on Sample Questions to get Similar Queries from the Corpus

In [20]:
s = 'What is the step by step guide to invest'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: What is the step by step guide to invest
Top 5 most similar sentences in corpus:
What is the step by step guide to invest in share market? (Score: 0.8431)
What are the best investment strategy for beginners? (Score: 0.7725)
What are the ways to get an investment for startup? (Score: 0.7692)
How do I invest in stock market? (Score: 0.7558)
How much money will I need to start investing in stock market? (Score: 0.7447)


In [22]:
s = 'What is Data Science?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: What is Data Science?
Top 5 most similar sentences in corpus:
What is data science (Score: 0.9840)
What is actually a data science? (Score: 0.9609)
What does a data scientist do? (Score: 0.8919)
What is big data science? (Score: 0.8633)
What is the difference between data science and data analysis? (Score: 0.7723)


In [23]:
s = 'What is natural language processing?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: What is natural language processing?
Top 5 most similar sentences in corpus:
How does natural language processing work? (Score: 0.9242)
Which are the best schools for studying natural language processing? (Score: 0.6843)
What is the english word for "अंत्योदय"? (Score: 0.6685)
What are natural numbers? (Score: 0.6590)
Who owns Natural Factors? (Score: 0.6589)


In [24]:
s = 'What is natural language processing?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: What is natural language processing?
Top 5 most similar sentences in corpus:
How does natural language processing work? (Score: 0.9242)
Which are the best schools for studying natural language processing? (Score: 0.6843)
What is the english word for "अंत्योदय"? (Score: 0.6685)
What are natural numbers? (Score: 0.6590)
Who owns Natural Factors? (Score: 0.6589)


In [25]:
s = 'Best Harry Potter Movie?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: Best Harry Potter Movie?
Top 5 most similar sentences in corpus:
Which Harry Potter movie is the best? (Score: 0.9560)
Which is the best Harry Potter movie? (Score: 0.9456)
Which is your favourite Harry Potter movie and why? (Score: 0.8769)
Where were the Harry Potter movies shot? (Score: 0.8664)
Where was Harry Potter filmed? (Score: 0.8336)


In [26]:
s = 'What is the best smartphone?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: What is the best smartphone?
Top 5 most similar sentences in corpus:
What are the best smartphones? (Score: 0.9829)
What are the best smartphones? (Score: 0.9829)
What is the best smartphone to date? (Score: 0.9759)
What are the best Smartphones tech gadgets? (Score: 0.9262)
Which is the best smartphone to buy now? (Score: 0.9253)


In [27]:
s = 'What is the best starter pokemon?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: What is the best starter pokemon?
Top 5 most similar sentences in corpus:
How do you choose the right starter pokemon in any game? (Score: 0.8680)
What is the best Pokemon GO hack? (Score: 0.7935)
Which set of starter Pokemon would you choose considering all generations and why? (Score: 0.7794)
What are the best Pokemon hacks? (Score: 0.7574)
Which Pokemon evolve with Shiny Stones? (Score: 0.7405)


In [28]:
s = 'Batman or Superman?'
return_similar_sentences(query=s,
                         model_embedder=embedder,
                         corpus_embeddings=corpus_embeddings,
                         top_k=5)


Query: Batman or Superman?
Top 5 most similar sentences in corpus:
Why does Batman kill in Batman v Superman? (Score: 0.7654)
What does Batman do? (Score: 0.7581)
Is Batman insane? (Score: 0.7382)
Superheroes: Who would win in a fight between Batman and the Flash? (Score: 0.7381)
Who would win Batman vs Batman? (Score: 0.7156)
