<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D3_4_E2_Semantic_Search_Engine_for_Queries_with_Transfomers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1wxkbM60NlBlkbGK1JqUypKL24RrTiiYk" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center>Constructor Learning, 2023</center>

# Building a Semantic Search Engine to Search for Queries with Transformers

# Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, that only finds documents based on lexical matches, semantic search can also find synonyms.


## Background
The idea behind semantic search is to embedd all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embedding from your corpus are found. These entries should have a high semantic overlap with the query.

![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)


## Python

For small corpora (up to about 100k entries) we can compute the cosine-similarity between the query and all entries in the corpus.

For small corpora with few example sentences we compute the embeddings for the corpus as well as for our query.

We then use the [util.pytorch_cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries.

For large corpora, sorting all scores would take too much time. Hence, we can use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries.

[Reference](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search)


## Objective

For today's objective we will create a corpus of around 50000 question titles asked on Quora from an open dataset. Your task will be to compute sentence embeddings and then try to retrieve top 5 similar questions from the corpus for a few example queries mentioned below.

Use [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) which provides a scalable way to generate document embeddings using transformers



## Load Dependencies

In [None]:
!pip install transformers

In [None]:
!pip install -U sentence-transformers

In [None]:
import transformers

In [None]:
import pandas as pd
import numpy as np

## Download and Load Corpus of Questions

In [None]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

In [None]:
df = pd.read_csv('quora_duplicate_questions.tsv', sep='\t').head(25000)
df.head()

In [None]:
corpus = df['question1'].tolist() + df['question2'].tolist()

In [None]:
len(corpus)

## Use Sentence Transformers and Generate Corpus Embeddings

__Hint:__ You can use this tutorial as a reference

[Semantic Search Tutorial](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search.py)

Also use the __`roberta-large-nli-stsb-mean-tokens`__ model to generate document embeddings

In [None]:
corpus_embeddings = ??

In [None]:
corpus_embeddings.shape

## Create a function to return top K similar sentences for a given query

In [None]:


def return_similar_sentences(query, model_embedder, corpus_embeddings, top_k):
  <FILL THIS UP>


In [None]:
df.head(5)

## Perform Semantic Search on Sample Questions to get Similar Queries from the Corpus

In [None]:
s = 'What is the step by step guide to invest'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)

In [None]:
s = 'What is Data Science?'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)

In [None]:
s = 'What is natural language processing?'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)

In [None]:
s = 'Best Harry Potter Movie?'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)

In [None]:
s = 'What is the best smartphone?'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)

In [None]:
s = 'What is the best starter pokemon?'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)

In [None]:
s = 'Batman or Superman?'
return_similar_sentences(query=s,
                         model_embedder=?,
                         corpus_embeddings=?,
                         top_k=5)