<a href="https://colab.research.google.com/github/daniel-hain/workshop_london_nlp_2023/blob/main/notebooks/workshop_sbert_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Installing sentence transformer libary to work with SBERT
!pip install -qU transformers sentence-transformers

In [None]:
# standard stuff
import pandas as pd
import seaborn as sns

# Stuff we will need later
import os
import csv
import time

# Semantic Similarity


## Introduction

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms.

In fact, this type of search makes browsing more complete by understanding almost exactly what the user is trying to ask, instead of simply matching keywords to pages. The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embedding from your corpus is found. These entries should have a high semantic overlap with the query.

## Types of search: Symmetric vs. Asymmetric Semantic Search
A critical distinction for your setup is symmetric vs. asymmetric semantic search:

For **symmetric** semantic search your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.

For **asymmetric** semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.

# Toy example

A simple example of a couple of sentences. Imagine our task is to calculate semantic similarity between them:

In [None]:
sentences = ["purple is the best city in the forest",
             "there is an art to getting your way and throwing bananas on to the street is not it",
             "it is not often you find soggy bananas on the street",
             "green should have smelled more tranquil but somehow it just tasted rotten",
             "joyce enjoyed eating pancakes with ketchup",
             "as the asteroid hurtled toward earth becky was upset her dentist appointment had been canceled",
             "to get your way you must not bombard the road with yellow fruit" ]

## Cross- vs Bi-Encoder

![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png)

* **Bi-Encoders** produce for a given sentence a sentence embedding. We pass to a BERT independently the sentences A and B, which result in the sentence embeddings u and v. These sentence embedding can then be compared using cosine similarity:
* In contrast, for a **Cross-Encoder**, we pass both sentences simultaneously to the Transformer network. It produces than an output value between 0 and 1 indicating the similarity of the input sentence pair:

## BERT (Cross-Encoder)

Lets we'll take a look at how we can use transformer models (like BERT) to create sentence vectors for calculating similarity. Let's start by defining a few example sentences.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

In [None]:
if not torch.cuda.is_available():
  print("Warning: No GPU detected. Processing will be slow. Please add a GPU to this notebook")

Initialize our HF transformer model and tokenizer - using a pretrained BERT model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

Tokenize all of our sentences.

In [None]:
tokens = tokenizer(sentences,
                   max_length=128,
                   truncation=True,
                   padding='max_length',
                   return_tensors='pt')

In [None]:
tokens.keys()

In [None]:
tokens['input_ids'][1]

Process our tokenized tensors through the model.

In [None]:
outputs = model(**tokens)
outputs.keys()

Here we can see the final embedding layer, *last_hidden_state*.

In [None]:
embeddings = outputs.last_hidden_state
embeddings[0]

In [None]:
embeddings[0].shape

Here we have our vectors of length *768*, but we see that these are not *sentence vectors* because we have a vector representation for each token in our sequence (128 in total). We need to perform a mean pooling operation to create the sentence vector.

The first thing we do is multiply each value in our `embeddings` tensor by its respective `attention_mask` value. The `attention_mask` contains **1s** where we have 'real tokens' (eg not padding tokens), and 0s elsewhere - so this operation allows us to ignore non-real tokens.

In [None]:
mask = tokens['attention_mask'].unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

In [None]:
mask[0]

Now we have a masking array that has an equal shape to our output `embeddings` - we multiply those together to apply the masking operation on our outputs.

In [None]:
masked_embeddings = embeddings * mask
masked_embeddings[0]

Sum the remaining embeddings along axis 1 to get a total value in each of our 768 values.

In [None]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

Next, we count the number of values that should be given attention in each position of the tensor (+1 for real tokens, +0 for non-real).

In [None]:
counted = torch.clamp(mask.sum(1), min=1e-9)
counted.shape

Finally, we get our mean-pooled values as the `summed` embeddings divided by the number of values that should be given attention, `counted`.

In [None]:
mean_pooled = summed / counted
mean_pooled.shape

Now we have our sentence vectors, we can calculate the cosine similarity between each.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
# convert to numpy array from torch tensor
mean_pooled = mean_pooled.detach().numpy()

# calculate similarities (will store in array)
scores = np.zeros((mean_pooled.shape[0], mean_pooled.shape[0]))
for i in range(mean_pooled.shape[0]):
    scores[i, :] = cosine_similarity(
        [mean_pooled[i]],
        mean_pooled
    )[0]

In [None]:
scores

We can visualize these scores:

In [None]:
sns.heatmap(scores, annot=True)

## SBERT: sentence-transformers (bi-encoders)

The `sentence-transformers` library allows us to compress all of the above into just a few lines of code.

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

In [None]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

We encode the sentences (producing our mean-pooled sentence embeddings) like so:

In [None]:
sentence_embeddings = model.encode(sentences)

And calculate the cosine similarity just like before.

In [None]:
# calculate similarities (will store in array)
scores = np.zeros((sentence_embeddings.shape[0], sentence_embeddings.shape[0]))
for i in range(sentence_embeddings.shape[0]):
    scores[i, :] = cosine_similarity(
        [sentence_embeddings[i]],
        sentence_embeddings
    )[0]

In [None]:
sns.heatmap(scores, annot=True)

We can also writ a small function to find the most similar sentences to each others

In [None]:
#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(scores)-1):
    for j in range(i+1, len(scores)):
        pairs.append({'index': [i, j], 'score': scores[i][j]})

In [None]:
#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

In [None]:
for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

Or do some semantic search:

In [None]:
# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']

In [None]:
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(2, len(sentences))
for query in queries:
    query_embedding = model.encode(query)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, sentence_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop  most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(sentences[idx], "(Score: {:.4f})".format(score))

# Applications

## Semantic Search using SBERT on Quora Questions dataset

* We use the Quora Duplicate Questions dataset, which contains about 500k questions (we only use about 100k):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
* The main task the dataset is used is to identify duplicated questions.
* As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will return the closest questions in the corpus (questions in the corpus are mainly in English).


In [None]:
model = SentenceTransformer('quora-distilbert-multilingual')

In [None]:
# Set parameters for download
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000

In [None]:
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

In [None]:
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        if len(corpus_sentences) >= max_corpus_size:
            break

        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

In [None]:
# Embed the sentences
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)

In [None]:
###############################
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))

In [None]:
# Function that searches the corpus and prints the results
def search(inp_question):
    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings)
    end_time = time.time()
    hits = hits[0]  #Get the hits for the first query

    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time-start_time))
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))

In [None]:
search("How do i write a really good data science project?")

In [None]:
#German: How can I learn Python online?
search("Wie kann ich online python lernen?")

In [None]:
#Chinese: How can I learn Python online?
search("如何在线学习python")

In [None]:
#Danish:What should I do at the weekend
search("hvad skal jeg laver om weekenden")

In [None]:
# French: How can I learn data science really fast?
search("comment puis-je apprendre la science des données très rapidement?")

## Patent search

* Intelectual property search and retrieval has many corporate applications.
* It also has many applications in our research on technology mapping and forecasting.
Check our application for [patent classification](https://github.com/AI-Growth-Lab/PatentSBERTa)
* ALso, see former W2V application in [this paper](https://doi.org/10.1016/j.techfore.2022.121559)

In [None]:
!pip install datasets -q

In [None]:
from datasets import load_dataset
import datasets

In [None]:
# Load our patent dataset sample
patent_dataset = datasets.load_dataset("AI-Growth-Lab/patents_claims_1.5m_traim_test", split="test[:5000]")

In [None]:
model = SentenceTransformer('AI-Growth-Lab/PatentSBERTa')

In [None]:
patent_dataset = pd.DataFrame(patent_dataset)

In [None]:
patent_dataset.text.head()

In [None]:
embeddings = model.encode(patent_dataset.text, convert_to_tensor=True, show_progress_bar=True)

* No all is embedded. Lets try to retrieve it

In [None]:
# Function that searches the corpus and prints the results
def semantic_search(inp_question, n = 5):
    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, embeddings)
    end_time = time.time()
    hits = hits[0]  #Get the hits for the first query

    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time-start_time))
    for hit in hits[0:n]:
        print("\t{:.3f}\t{}".format(hit['score'], patent_dataset.text[hit['corpus_id']]))

In [None]:
# Query sentences:
queries = ['an apperatus that connects databases.']

In [None]:
semantic_search(queries)

## Literature search (Your turn)

Now it is your turn! We could also use this workflow for a search in academic literature.

That's your task now, do the following:

1. Define a not too big corpus of literature (<5k)
2. Download the metadata including abstracts on OpenAlex. (mostly C&P from previous notebook)
3. Using an appropriate transformer model, create embeddings for all abstracts.
4. Create a simple semantic search application.


