# Embedding space
Here we embed the articles in order to be able to find similar ones simply based on semantic content.

In [1]:
import numpy as np
import torch
from transformers import BertTokenizer, BertModel
from sklearn import metrics

  from .autonotebook import tqdm as notebook_tqdm


Here you create the articles dictionary based on txt files. If you only have PDFs, check out the notebook on OCR which explains not only how to apply OCR on images but also how to extract embedded OCR text from PDFs.

In [12]:
with open("data/embedding_data/grs-002_1984_76__216_d.txt","r") as fin:
    input_text_1 = fin.readlines()
    if len(input_text_1) == 1:
        input_text_1 = input_text_1[0].split(".")

with open("data/embedding_data/grs-002_1984_76__218_d.txt","r") as fin:
    input_text_2 = fin.readlines()
    if len(input_text_2) == 1:
        input_text_2 = input_text_2[0].split(".")

with open("data/embedding_data/grs-002_1984_76__219_d.txt","r") as fin:
    input_text_4 = fin.readlines()
    if len(input_text_4) == 1:
        input_text_4 = input_text_4[0].split(".")

with open("data/embedding_data/grs-002_1984_76__277_d.txt","r") as fin:
    input_text_3 = fin.readlines()
    if len(input_text_3) == 1:
        input_text_3 = input_text_3[0].split(".")

#create dictionary for each article
article_dict = {"article1": {"text": input_text_1},
                "article2": {"text": input_text_2},
                "article3": {"text": input_text_3},
                "article4": {"text": input_text_4}}

## BERT
This part is more for your understanding than anything else. BERT is not ideal for embedding large amounts of text.


Here we load the necessary libraries, tokenizer and model.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')
model = BertModel.from_pretrained('bert-base-german-cased',
                                  output_hidden_states = True)
model.eval()

This part checks if we have a GPU, and if we do it runs the computation on GPU. That makes the embedding much faster.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

We go through each article and embed it. To be exact, we embed each word and then average it out over a 
sentence. For four articles this takes 1 minute on CPU.

In [None]:
for article_i in article_dict:
    bert_emb = []
    for sentence in article_dict[article_i]["text"]:
        bert_tokens = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True).to(device)

        with torch.no_grad():
            outputs = model(bert_tokens["input_ids"], bert_tokens["attention_mask"])
            state = outputs["last_hidden_state"]
            emb = (state*bert_tokens["attention_mask"][:,:,None]).sum(dim=1) / bert_tokens["attention_mask"][:,:,None].sum(dim=1)
            bert_emb.append(emb)
    article_dict[article_i]["bert_embedding"] = bert_emb

We average everything each article says and create a matrix of the embeddings.

In [40]:
V = []
origin = np.array([[0]*768,[0]*768])
for article_i in article_dict.keys():
    mean = (sum(article_dict[article_i]["bert_embedding"]) / len(article_dict[article_i]["bert_embedding"]))
    V.append(mean)
V = torch.stack(V).cpu().detach().numpy()

We create the distance matrix of each article to each article.

In [41]:
distance_matrix = metrics.pairwise_distances(V[:,0,:])

#the distance of the article_i to itself is set to infinite
for i in range(len(distance_matrix)):
    distance_matrix[i][i] = np.inf

In [42]:
distance_matrix = distance_matrix.round(4)

In [43]:
distance_matrix

array([[   inf, 2.6734, 2.8058, 2.3782],
       [2.6734,    inf, 3.0038, 2.4659],
       [2.8058, 3.0038,    inf, 2.7493],
       [2.3782, 2.4659, 2.7493,    inf]], dtype=float32)

We can see that the distances are okay-ish but it doesn't really show how similar article1 and article4 are, it just looks like they're slightly more similar than article4 and the other two articles. Can we do better? Yes. So far we've averaged the embeddings twice, once over each sentence and then even over each article! A lot of information gets lost there.

## SentenceTransformer
The [SentenceTransformer](https://sbert.net/) is a fine-tuned version of BERT, where the authors of the paper trained a so-called siamese network in order to have a model specifically trained for similarity. Recall that BERT does similarity moreso out of the definition of a vector, rather than on purpose. 

Siamese networks train the vector embeddings using a positive and a negative example, and then try to jointly maximize the distance from our vector to the negative example, as well as minimize the distance from our vector to the positive example. Let's see how they do!

In [5]:
from sentence_transformers import SentenceTransformer
#https://sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-similarity-models


In [None]:
sentences = [
    " ".join(article_dict["article1"]["text"]),
    " ".join(article_dict["article2"]["text"]),
    " ".join(article_dict["article3"]["text"]),
    " ".join(article_dict["article4"]["text"])
]

In [13]:
mutlilingual_models = ["distiluse-base-multilingual-cased-v1", "distiluse-base-multilingual-cased-v2",
          "paraphrase-multilingual-MiniLM-L12-v2", "paraphrase-multilingual-mpnet-base-v2"]
for model_title in mutlilingual_models:
    model = SentenceTransformer("sentence-transformers/"+model_title)

    embeddings = model.encode(sentences)
    similarities = model.similarity(embeddings, embeddings)
    print(model_title)
    print(similarities)

distiluse-base-multilingual-cased-v1
tensor([[1.0000, 0.3065, 0.1706, 0.5132],
        [0.3065, 1.0000, 0.1807, 0.3198],
        [0.1706, 0.1807, 1.0000, 0.2846],
        [0.5132, 0.3198, 0.2846, 1.0000]])
distiluse-base-multilingual-cased-v2
tensor([[1.0000, 0.3283, 0.1343, 0.4413],
        [0.3283, 1.0000, 0.2912, 0.3344],
        [0.1343, 0.2912, 1.0000, 0.2602],
        [0.4413, 0.3344, 0.2602, 1.0000]])
paraphrase-multilingual-MiniLM-L12-v2
tensor([[1.0000, 0.6386, 0.3326, 0.6615],
        [0.6386, 1.0000, 0.3894, 0.5164],
        [0.3326, 0.3894, 1.0000, 0.4497],
        [0.6615, 0.5164, 0.4497, 1.0000]])
paraphrase-multilingual-mpnet-base-v2
tensor([[1.0000, 0.6006, 0.4287, 0.7956],
        [0.6006, 1.0000, 0.3881, 0.5237],
        [0.4287, 0.3881, 1.0000, 0.4295],
        [0.7956, 0.5237, 0.4295, 1.0000]])


We chose multilingual models since the text is in German. As you can see, we have four models which give four different results. Why is that? The explanation has to do with what they were trained for, and the tradeoff between speed and quality. 

[On their website](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models) they show the performance of each model on a sentence embedding vs semantic search task, as wekk as their speed. The models that seem to agree with human judgement the most, namely "paraphrase-multilingual-mpnet-base-v2", is the largest one, with the best average performance, but also fairly fast all things considered. Which model works best for your usecase should be evaluated carefully, as you usually only compute the vector embeddings once.