<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB3: Sentence Embeddings with SBERT
In this lab we are going to explore sentence embeddings using [Sentence Transformers a.k.a "SBERT"](https://sbert.net/).

This tool was created in 2018 but it is still actively used and developed. Its [GitHub repo](https://github.com/UKPLab/sentence-transformers/) has 16K stars and 200 contributors. As it names indicates it leverages transformers which use the "attention" mechanism introduced by Google in 2017. You can choose from a "wide" selection of over 5,000 :) pre-trained Sentence Transformers models available on ðŸ¤— Hugging Face. A common way of deciding what model to use is to check the [Massive Text Embeddings Benchmark (MTEB) leaderboard](https://huggingface.co/spaces/mteb/leaderboard)


## Install dependencies

The first step is to install the necessary libraries. In this case we will install the [sentence transformers](https://pypi.org/project/sentence-transformers/) Python library. This library is a popular collection of tools and models to compute embeddings for sentences, paragraphs and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERT.

In [None]:
!pip install sentence-transformers

Let's import a few things we need

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

## Load the model and create embeddings

We are going to create a model by pulling it from HuggingFace. This one is about 440MB so it should come down in a minute or less

In [None]:
model = SentenceTransformer('bert-base-nli-mean-tokens') #all-MiniLM-L6-v2

Let's create a small corpus with a few sentences and use the model to encode them

In [None]:
sentences = [
       "I ate dinner.",
       "We had a three-course meal.",
       "Brad came to dinner with us.",
       "He loves fish tacos.",
       "In the end, we all felt like we ate too much.",
       "We all agreed; it was a magnificent evening."]
sentence_embeddings = model.encode(sentences)

The sentences have now been encoded and we can examine them. Notice below that size of the vectors this model creates. Different models using different number of dimensions.

The variable "sentence_embeddings" we have just created is a list so what you are looking at is the embedding for the first sentence, ie the one with index=0. As you can see it includes also negative values.  

In [None]:
print('The size of a vector is', len(sentence_embeddings[0]))
print('This is the embedding vector for the first sentence', sentence_embeddings[0])

## Calculate similarities

We can use the "similarity" method to compare a query sentence to all the embeddings in our corpus. But first we need to convert our query into a vector.

In [None]:
query = ["I had pizza and pasta for dinner"]
query_embedding = model.encode(query)
similarities = model.similarity(query_embedding, sentence_embeddings)
for s in range(0,len(sentences)):
    print(f'{sentences[s]:<50}', " : ", similarities[0][s].detach().numpy())


The closer the similarity score is to 1 the closer semantically the query is to that sentence. Does the result make sense?


## Ideas to explore further

Try changing the queries and observe the results

### End of Lab 3