# Vector Embeddings
You might remember from the topic modeling notebook how I kept mentioning that these days, for most of what topic modeling used to be used for you would actually turn to vector embeddings. Here we are.

Vector embeddigs are used in order to make textual or image information machine readable, but they have an additional usage: They encode some sort of semantic meaning. As John Rupert Firth once said "You shall know a word by the company it keeps" and word embeddings do exactly that. Moreover, since they are vectors, you can do vector operations with them in order to see what the training data (the company our word keeps) semantically say about the word itself.

## Embedding space
Here we embed E-Periodica articles in order to be able to find similar ones simply based on semantic content.

In [None]:
import numpy as np
import torch
from transformers import BertTokenizer, BertModel
from sklearn import metrics

Here you create the articles dictionary based on .txt files. If you only have PDFs, check out the notebook on OCR which explains not only how to apply OCR on images but also how to extract embedded OCR text from PDFs.

In [None]:
with open("data/embedding_data/grs-001_1921_13__298_d.txt","r") as fin:
    input_text_1 = fin.readlines()
    if len(input_text_1) == 1:
        input_text_1 = input_text_1[0].split(".")

with open("data/embedding_data/grs-001_1921_13__393_d.txt","r") as fin:
    input_text_2 = fin.readlines()
    if len(input_text_2) == 1:
        input_text_2 = input_text_2[0].split(".")

with open("data/embedding_data/grs-001_1922_14__563_d.txt","r") as fin:
    input_text_4 = fin.readlines()
    if len(input_text_4) == 1:
        input_text_4 = input_text_4[0].split(".")

with open("data/embedding_data/grs-001_1923_15__447_d.txt","r") as fin:
    input_text_3 = fin.readlines()
    if len(input_text_3) == 1:
        input_text_3 = input_text_3[0].split(".")

#create dictionary for each article
article_dict = {"article1": {"text": input_text_1},
                "article2": {"text": input_text_2},
                "article3": {"text": input_text_3},
                "article4": {"text": input_text_4}}

## BERT
This part is more for your understanding than anything else. BERT is not ideal for embedding large amounts of text.

Here we load the necessary libraries, tokenizer and model.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')
model = BertModel.from_pretrained('bert-base-german-cased',
                                  output_hidden_states = True)
model.eval()

This part checks if we have a GPU, and if we do it runs the computation on GPU. That makes the embedding much faster.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

We go through each article and embed it. To be exact, we embed each word and then average it out over a 
sentence. For four articles this takes 1 minute on CPU.

In [None]:
for article_i in article_dict:
    bert_emb = []
    for sentence in article_dict[article_i]["text"]:
        bert_tokens = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True).to(device)

        with torch.no_grad():
            outputs = model(bert_tokens["input_ids"], bert_tokens["attention_mask"])
            state = outputs["last_hidden_state"]
            emb = (state*bert_tokens["attention_mask"][:,:,None]).sum(dim=1) / bert_tokens["attention_mask"][:,:,None].sum(dim=1)
            bert_emb.append(emb)
    article_dict[article_i]["bert_embedding"] = bert_emb

We average everything each article says and create a matrix of the embeddings.

In [None]:
V = []
origin = np.array([[0]*768,[0]*768])
for article_i in article_dict.keys():
    mean = (sum(article_dict[article_i]["bert_embedding"]) / len(article_dict[article_i]["bert_embedding"]))
    V.append(mean)
V = torch.stack(V).cpu().detach().numpy()

We create the distance matrix of each article to each article.

In [None]:
distance_matrix = metrics.pairwise_distances(V[:,0,:])

#the distance of the article_i to itself is set to infinite
for i in range(len(distance_matrix)):
    distance_matrix[i][i] = np.inf

In [None]:
distance_matrix = distance_matrix.round(4)

In [None]:
distance_matrix

We can see that the distances are okay-ish but it doesn't really show how similar article 1 and article 4 are, it just looks like they're slightly more similar than article 4 and the other two articles. 

**Can we do better?** Yes. So far we've averaged the embeddings twice, once over each sentence and then even over each article! A lot of information gets lost there.

## SentenceTransformer
The [SentenceTransformer](https://sbert.net/) is a fine-tuned version of BERT, where the authors of the paper trained a so-called Siamese network in order to have a model specifically trained for similarity. Recall that BERT does similarity moreso out of the definition of a vector, rather than on purpose. 

Siamese networks train the vector embeddings using a positive and a negative example, and then try to jointly maximize the distance from our vector to the negative example, as well as minimize the distance from our vector to the positive example. Let's see how they do!

In [None]:
from sentence_transformers import SentenceTransformer
#https://sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-similarity-models


In [None]:
sentences = [
    " ".join(article_dict["article1"]["text"]),
    " ".join(article_dict["article2"]["text"]),
    " ".join(article_dict["article3"]["text"]),
    " ".join(article_dict["article4"]["text"])
]

In [None]:
mutlilingual_models = ["distiluse-base-multilingual-cased-v1", "distiluse-base-multilingual-cased-v2",
                       "paraphrase-multilingual-MiniLM-L12-v2", "paraphrase-multilingual-mpnet-base-v2"]
for model_title in mutlilingual_models:
    model = SentenceTransformer("sentence-transformers/"+model_title)

    embeddings = model.encode(sentences)
    similarities = model.similarity(embeddings, embeddings)
    print(model_title)
    print(similarities)

We chose multilingual models since the text is in German. As you can see, we have four models which give four different results. Why is that? The explanation has to do with what they were trained for, and the trade-off between speed and quality. 

[On their website](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models) they show the performance of each model on a sentence embedding vs semantic search task, as well as their speed. The models that seem to agree with human judgment the most, namely "paraphrase-multilingual-mpnet-base-v2", is the largest one, with the best average performance, but also fairly fast all things considered. Which model works best for your use-case should be evaluated carefully, as you usually only compute the vector embeddings once.