## This notebook is for testing and getting familiar with embedding models

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch

### Load a sentence transformer embedding model

In [None]:
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embedder

Notice that the above embedder has first a transformer layer and then a pooling layer.

### Try out the embedding model

In [None]:
wikipedia_text = "The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe"
embeddings = embedder.encode(wikipedia_text)
embeddings

Two different texts can be compared by using different metrics. Cosine similarity (inner product) is one popular metric. Try that out.

In [None]:
wikipedia_text_2 = "The EU has often been described as a sui generis political entity (without precedent or comparison) combining the characteristics of both a federation and a confederation."
embeddings_2 = embedder.encode(wikipedia_text_2)

# Calculate the similarity between the two embeddings
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

cosine(embeddings, embeddings_2)


In [None]:
text_about_python = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation."
embeddings_python = embedder.encode(text_about_python)

cosine(embeddings, embeddings_python)

### TODO

Calculate the similarity of some other text samples. Are the similarities intuitive for you?

## Calculate the similary by constructing the pooling "by hand"

This example is exactly the same as in Hugging Face model card.
* Tokenizer and pre-trained model are loaded
* Input text is tokenized
* Tokenized input is fed into the model
* Model output (hidden output) is pooled with `mean_pooling` function to produce the embeddings.

This is how the above SentenceTransformer works behind the scenes.


In [None]:

def create_embeddings(sentences, model_name):

    #Mean Pooling - Take attention mask into account for correct averaging
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


    # Load model from HuggingFace Hub
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling. In this case, max pooling.
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return sentence_embeddings



### Calculate first the similary between the sentences by using the msmarco-distilbert-base-v4 model

In [None]:
sentence_embeddings = create_embeddings([wikipedia_text, wikipedia_text_2, text_about_python], 'sentence-transformers/msmarco-distilbert-base-v4')

print("Similarity between wiki sentences:", cosine(sentence_embeddings[0], sentence_embeddings[1]))
print("Similarity between wiki and python sentences:", cosine(sentence_embeddings[0], sentence_embeddings[2]))


### Do the same thing with the distilbert base model

In [None]:
sentence_embeddings = create_embeddings([wikipedia_text, wikipedia_text_2, text_about_python], 'distilbert/distilbert-base-uncased')

print("Similarity between wiki sentences:", cosine(sentence_embeddings[0], sentence_embeddings[1]))
print("Similarity between wiki and python sentences:", cosine(sentence_embeddings[0], sentence_embeddings[2]))


### TODO

Could you use the untrained embedding model created from `distilbert/distilbert-base-uncased`for creating embeddings?

## TODO (You can continue also with these if you have time)

Advanced: Evaluate the model accucary by using for instance the dataset `mteb/stsbenchmark-sts`.

Advanced: Fine-tune your embedding model with the same dataset and see if the accuracy imporoved.