## Working with a small language, a case for fine-tuning

There has been a lot of development in the natural language processing space, especially since ChatGPT was introduced by OpenAI, it seems like a lot of ressources are being spent making LLMs better, and as they become better, people have been improving other things as well such as embeddings. However I live in a small country (Denmark), with a small language with around 6 million people speaking the language, even if we are really generous and clump together Norwegian, Danish and Swedish went is only around 22 million speakers.  

If you are a big company or organisation who plans on training the next opensource model you’ll probably chose a bigger language like French, German or Spanish. And even if you are choosing train a multilingual model, there is a big chance, that danish will only be in a small fraction of the training data. 

Previously I've written about embeddings, and how they can be tool for you as a data engineer (Teaching a machine to read, how LLM's comprehend text). However if you cannot find a good embedding model for your data, it won’t help you that much. Therefore we need to solve this problem.aAs I see it there are two approaches to handle this. The first is to translate your documents to a bigger language like English. This approach has a lot of drawbacks, first the process of translating can be ressource intensive, and even more important, we loose a lot of information in the process.

The second approach, is to fine-tune our model with our own data. I personally believe, this to be a much better approach. First we can increase the performance by quite a lot, second we are not just making the model better at working with danish text. We are making the model work better with our Danish text. Let’s say we have a lot of contracts we want to embed, the type of language used would be different than if we were embedding code documentation, or clinical records. I fine-tuned model will given the same ressources, perform better than a generalised one. Since it has had more time training on the type of data which is important to us. 

In the next part I will go through the steps you need to complete to get to the fine-tuning model.

### Data Preparation

One of the most important things when finetuning data, would be to gather data. But before we start data gathering, we need to think about what we are building.

Let’s say you are building an application where you can send in a list of symptoms and retrieve the 10 most likely diagnoses. In this case we want a model which can match the embedding of the list of symptoms with the embeddings of the diagnosis. If on the other hand we wanted a model which could retrieve the applicabale laws when given a user questions. We would need to train our model to match the questions to the applicable law. 

In this example, we will create a model which can take a subject (Crime, Unemployment, Climate Change) and match it with public speeches made by the Danish Prime Minister Mette Frederiksen. 
I have a folder with a 153 speeches by the Prime Minister. But because the speeches can be on a lot of different subjects I will need to break it into chunks. The naive way of doing this would be to just break it into an even number of characters (lets say 1.000) with some overlap between chunks. This can work really well. 
However I would encourage you to look at your data to see if there are any natural breaks, you can use for chunking. For instance if we were working with laws, we might chose to put each article into a chunk. 

When I look through my data, I noticed that for longer speeches lines like * * * * or - - - - would be inserted as a seperator between different sections. I could therefor use this as a delimiter. For speeches where this wasn't the case, I would just use the naive approachm but chunk by lines instead of characters. 

The following code is what I used to chunk my data.

In [8]:
### First lets get our dependencies imported
import glob
import regex as re
from tqdm import tqdm
import ollama
import uuid
import json
from typing import List
from collections import defaultdict
import pandas as pd
import psycopg as pg

from datasets import load_dataset, DatasetDict, Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator
from peft import LoraConfig, TaskType
from sklearn.model_selection import train_test_split

In [9]:
total = 0
chunk_delimiter = "<chunk>"
files = glob.glob("*", root_dir="taler")
speech_chunks = {}
with tqdm(total=len(files), desc="Processing Files") as fpbar:
    for file_name in glob.glob("*", root_dir="taler"):
        speech_name = file_name.replace(".txt", "")
        with open(f"taler/{file_name}", "r") as f:
            lines = f.readlines()
            total += len(lines)
        context_splits = []

        char_split = False
        for line in lines:
            if re.search("[a-zA-Z]", line) is None and "\n" != line:
                char_split = True
                break

        if char_split:
            context = []
            for line in lines:
                if re.search("[a-zA-Z]", line) is None and "\n" != line:
                    context_splits.append(" ".join(context))
                    context = []
                else:
                    context.append(line)
            if context != []:
                context_splits.append(" ".join(context))
        else:
            chunk_size = 10  # group size
            overlap = 2  # overlap size
            context_splits = [
                " ".join(lines[i : i + chunk_size])
                for i in range(0, len(lines), chunk_size - overlap)
            ]

        speech_chunks[speech_name] = context_splits
        fpbar.update(1)

Processing Files: 0it [00:00, ?it/s]


We now have a Python object which looks something like this 

```Python
{
    "speech-1": ["chunk 1", "chunk 2", "chunk 3"],
    "speech-2": ["chunk 1", "chunk 2", "chunk 3"],
    ...
    "speech-n": ["chunk 1", "chunk 2", "chunk 3"],
}
```

One might think that we can just feed this information into a magic AI model box and get a really neet embedding algorithm out of it. Unfortunately that is not the case. Like I mentioned before we want to have a model where a user can input a subject, and retrieve a correct chunk. This means we need not just the chunks, but the user input in this case different subjects. Since we do not have the user input, we need to generate them. This could be done manually by going through the chunks and carefully annotating all of them. This approach would definitely give the best result, however it would take up way to much time. Instead we'll generate synthetic data using an LLM. For this case we will use phi4 from microsoft running in ollama on my local machine. I've chosen this model because I believe it has a good tradeoff between compute and result. But test out different models for yourself. The approach to generate this data, is to feed the model with a chunk of text, and ask it to generate the subjects. 

The context I use is the following the ## part is not in the prompt but an english translation to non danish speakers: 

```
Kontekst er nedenfor. ## Context below

---------------------
{context_str}
---------------------

Givet den givne kontekst og ingen anden viden. ## Given the context and no other knowledge
Genere op til 5 emner som kan beskrive konteksten,. ## Generate up to 5 subjects which can describe the context
Hvis der ikke er emner som let kan beskrive konteksten, besvar med <|NAN|> ## If there are no subjects which easily describe the context reply with <|NAN|>

Du må kun svarer med emnerne formattet skal være: Emne 1|Emne 2|...|Emne n| ## You are only allowed to reply with subjects in the following format: Subject 1| Subject 2|...|Subject n|
```

Lets try it out with a toy example.


In [10]:
PROMPT_TEMPLATE = """\
Kontekst er nedenfor.

---------------------
{context_str}
---------------------

Givet den givne kontekst og ingen anden viden.
Genere op til 5 emner som kan beskrive konteksten,. 
Hvis der ikke er emner som let kan beskrive konteksten, besvar med <|NAN|>

Du må kun svarer med emnerne formattet skal være: Emne 1|Emne 2|...|Emne n|
"""

In [11]:
context = speech_chunks[
    "mette-frederiksens-aabningstale-ved-folketingets-aabningsdebat"
][1]
prompt = PROMPT_TEMPLATE.format(context_str=context)
res = ollama.chat("phi4", messages=[{"role": "user", "content": prompt}])
print(res.message.content)

KeyError: 'mette-frederiksens-aabningstale-ved-folketingets-aabningsdebat'

In our case this looks great. If the data doesn't look like you expect it too, you can expirement with a different prompt or maybe even model. 
For an overview of the models you available through ollama please visit the [ollama site](https://ollama.com/search).

In [None]:
dataset = {
    "speech": {},
    "queries": {},
    "corpus": {},
    "relevant_docs": {},
    "related_speech": {},
}

## This part is just to be able to track how far we are
total = 0
for speech, chunks in speech_chunks.items():
    total += len(chunks)

with tqdm(total=total, desc="Generating Queries") as pbar:
    for speech, chunks in speech_chunks.items():
        speech_id = str(uuid.uuid4())
        dataset["speech"][speech_id] = speech
        for chunk in chunks:
            content_id = str(uuid.uuid4())
            dataset["corpus"][content_id] = chunk
            dataset["related_speech"][content_id] = speech_id
            prompt = PROMPT_TEMPLATE.format(context_str=chunk)
            res = ollama.chat("phi4", messages=[{"role": "user", "content": prompt}])
            reply = res.message.content
            if "<|NAN|>" in reply:
                pbar.update(1)
                continue
            for query in reply.split("|"):
                query_id = str(uuid.uuid4())
                dataset["queries"][query_id] = query
                dataset["relevant_docs"][query_id] = [content_id]
            pbar.update(1)

with open("../data/base_data.json", "w") as f:
    f.write(json.dumps(dataset))

Generating Queries:  94%|█████████▍| 1256/1333 [2:21:07<08:39,  6.74s/it] 


This will take a while on my macbook pro with and m1 max processer and 32 GB of ram it takes approximately 2.5 hours. However if you can get access to machine with a bigger GPU like the A100 you can increase this speed significantly. You can also use the data I have generated which is in located in data/base_data.json

The code we have uses a relevant document setup, where we give each query (subject) and chunk a uuid and then have a seperate structure where each query is related to the relevant subjects. Next we need to do two things. 

We might have the same subject multiple times, we need to clean this up. We also need to think of how to organize the data for training our model. Additionally I've added a related speech object to lookup which speech a given chunk is taken from. This will come in use later. However the data is currently not in the best shape. Let's do some cleaning. 

In [None]:
try:
    data = dataset
except Exception as e:
    with open("../data/base_data.json", "r") as f:
        data = json.loads(f.read())

# Strips away redundant whitespace from the subjects and 
# removes subjects which are note actual subjects
queries = defaultdict(list)
for key, value in data["queries"].items():
    if value == "" or "--" in value:
        continue
    queries[data["queries"][key].strip()].append(key)

# We go through the list chunks, and remove parts that does
# That does not include any information. This part is very 
# dependent on your data. So look through and implement the
# rules that makes sense for you

corpus = defaultdict(list)
pop_keys = []
for key, value in data["corpus"].items():
    value = " ".join(value.replace("Tale\n \n \n \n \n \n \n ", "").split())
    if value.isspace() or len(value) <= 60:
        pop_keys.append(key)
    data["corpus"][key] = value
    corpus[value].append(key)

for key in pop_keys:
    data["corpus"].pop(key)
    data["related_speech"].pop(key)


# We create a new query structure, which removes duplicate queries
# and remap the relevant documents to the new query ids
relevant_docs = defaultdict(list)
new_queries = {}
for query in queries.keys():
    id = str(uuid.uuid4())
    new_queries[id] = query
    query_ids = queries[query]
    for query_id in query_ids:
        for doc_id in data["relevant_docs"][query_id]:
            if doc_id not in pop_keys:
                relevant_docs[id].append(doc_id)

clean_data = {}
clean_data["query"] = new_queries
clean_data["relevant_docs"] = relevant_docs
clean_data["related_speech"] = data["related_speech"]
clean_data["corpus"] = data["corpus"]

with open("../data/cleaned_base_data.json", "w") as out:
    json.dump(clean_data, out)

Once we have the clean data, we need to think about how we want to train our model. When training a model, we need to decide on a loss function. The loss function is what punishes the model when it makes the wrong decision and rewards the model when it makes the right decision. For the type of finetunning we want to do, it is very common to organise your data into triplets. The triplet data format, has an anchor (which in our case is the subject or query), a positive (which is a piece of context which is related to the anchor) and a negative (which is a piece of context which is unrelated). 

An example of such could be the following

**Anchor**: Defence Spending

**Positive**: The new air craft carrier is over budget, but will bring much needed capabilities

**Negative**: It's cold today, but the sun is out making nice if you wear the right layers.

However the astute student will notice that our data only has positives, we therefor needs to add negatives to our dataset. There can be several ways of doing this. One could simply be to do the same as before, and let an LLM generate negatives. Another could be to find a completely unrelated text corpus. Let's say if you are working with clinical records, you could download annual report from Goldman Sachs and use that text as your negative. I chose the second approach and downloaded a file with customer reviews. I then chose random review as my negative. 

You can see what I've done in the following piece of code:


In [None]:
with open("../data/cleaned_base_data.json", "r") as f:
    data = json.loads(f.read())

negatives: pd.DataFrame = pd.read_csv("../data/negatives.csv")["review_text"]

total = 0
for id, relevant_docs in data["relevant_docs"].items():
    total += len(relevant_docs)

triplets = []
with tqdm(total=total, desc="creating positive negative pairs") as pbar:
    for query_id, doc_ids in data["relevant_docs"].items():
        anchor = data["query"][query_id]
        for id in doc_ids:
            triplets.append(
                {
                    "anchor": anchor,
                    "positive": data["corpus"][id],
                    "negative": negatives.sample().values[0],
                }
            )
            pbar.update(1)

creating positive negative pairs: 100%|██████████| 5998/5998 [00:02<00:00, 2086.87it/s]


Now that we have our triplet data, the last thing is to split our data into Train, Test and Validation.
We use Train data to finetune our model, we use test data to estimate the performance of the model during training, and validition to ensure that we did not just find a model that fit our test data. 

In [13]:
train_triplet, val_triplet = train_test_split(pd.DataFrame(triplets), test_size=0.2)
train_triplet, test_triplet = train_test_split(
    pd.DataFrame(train_triplet), test_size=0.2
)

train_triplet.to_json("../data/triplet_data_train.json")
test_triplet.to_json("../data/triplet_data_test.json")
val_triplet.to_json("../data/triplet_data_val.json")

dataset: DatasetDict = {
    "train": Dataset.from_pandas(train_triplet, preserve_index=False),
    "test": Dataset.from_pandas(test_triplet, preserve_index=False),
    "validation": Dataset.from_pandas(val_triplet, preserve_index=False),
}

### Baseline and Valuation Strategy

Our data is now ready, but before we start finetuning our model, we should consider how we are going to decide whether or not the finetuned model is better than the baseline one. 

To do this we have to do the following steps:
1. Download a Baseline model
2. Find a valuation method
3. Valuate the Baseline model
4. Finetune the model
5. Valuate the finetuned model

The first thing we will do is to download the baseline model from HuggingFace, using the SentenceTransformer library. I have chosen the [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) model. I have chosen this model because it has a good baseline performance, and is small enough to train on my laptop.

In [20]:
model_name = "intfloat/multilingual-e5-small"
model = SentenceTransformer(model_name)

# Test the model
emb = model.encode("Hello World")
len(emb)

384

Now that we've downloaded our baseline model, lets spend some time thinking about how to best valuate our model. 

A good place to start is here: [SentenceTransformer Evaluator Classes](https://sbert.net/docs/sentence_transformer/training_overview.html#evaluator).

Going through the list there are several which are interesting, but the one that fits our data best is the the [Triplet Evaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator).

What the Triplet Evaluator does is that it compares similarity between the anchor and the positive as well as between the anchor and negative. It then returns a percentage which is the percentage of records where the anchor embedding was closer to the positive embedding than to the negative. So a result of 0.8 means that in 80 % of cases the anchor was closer to the positive than to the negative. 

We can run the model on our data using the following code. 

In [None]:
dev_evaluator = TripletEvaluator(
    anchors=dataset["test"]["anchor"],
    positives=dataset["test"]["positive"],
    negatives=dataset["test"]["negative"],
    name="dev_evaluator",
)

In [None]:
dev_evaluator(model)

{'dev_evaluator_cosine_accuracy': 0.9104166626930237}

With our test data we get an result of 0.91 using our baseline model. This means that in 91 % of the cases, the baseline model will place the anchor closer to the positive than the negative. 

This is great valuation to start with, but in reality this is not what we are interested in. We are interested in wether or not inputing a subject will give us related chunks of text. So lets try to write a custom valuation script. I will base my model on the Recall K metric. The metric simple test whether the anchor returns the positive in its top k results. 

We will change to test if any of the related documents are in the top k result. And we will have an additional metric to show how large a percentage of the relevant documents are in the top k results. 

Before we write this test. We have to store the embeddings somewhere so we can query them. We can use many databases for this. But I've chosen to use postgres with teh PG Vector extension. This is how to set it up in your postgres database.

First install the extension by running the following command:
```sql
CREATE EXTENSION vector;
```
Then we will create to tables to store our embeddings one for our baseline embeddings and one for the finetuned embeddings:

```sql
CREATE TABLE embeddings_base (id bigserial PRIMARY KEY, speech VARCHAR, context_id VARCHAR, context VARCHAR, embedding vector(384))
CREATE TABLE embeddings (id bigserial PRIMARY KEY, speech VARCHAR, context_id VARCHAR, context VARCHAR, embedding vector(384))
```

You might wonder about the *vector(384)* datatype. This just means we are saving a vector of size 384. If you are wondering what size your embedding is, you can simply run:
```python
len(model.encode("Hello"))
```

Now that our table is ready for embeddings, we can load them. To do this I use the following code. Lets do that using the following code. 

In [24]:
def write_to_db(speech: str, context_id: str, context: str, embedding: List, write_to_base: bool) -> None:
    conn = pg.connect("dbname=vector_rag user=postgres password=postgres")
    conn.autocommit = True
    cur = conn.cursor()
    if write_to_base:
        cur.execute(
            "INSERT INTO embeddings_base (speech, context_id, context, embedding) VALUES (%s, %s, %s, %s)",
            (speech, context_id, context, str(embedding)),
        )
    else:
        cur.execute(
            "INSERT INTO embeddings (speech, context_id, context, embedding) VALUES (%s, %s, %s, %s)",
            (speech, context_id, context, str(embedding)),
        )
    cur.close()
    conn.close()

In [None]:
with tqdm(total=len(data["corpus"].keys()), desc="Saving embeddings") as pbar:
    for id, context in data["corpus"].items():
        speech_name = data["related_speech"][id]
        embedding = model.encode(context).tolist()
        write_to_db(speech_name, id, context, embedding, True)
        pbar.update(1)

Then we can implement our recall k methods and test our baseline model

In [17]:
data["query_lk"] = {}
for key, value in data["query"].items():
    data["query_lk"][value] = key

def recall_k(query: str, model: SentenceTransformer, k: int, data: dict, check_base: bool, check_percentage: bool) -> float:
    query_id = data["query_lk"][query]
    expected_ids = data["relevant_docs"][query_id]
    embedded_query = model.encode(query).tolist()

    conn = pg.connect("dbname=vector_rag user=postgres password=postgres")
    conn.autocommit = True
    cur = conn.cursor()
    if check_base:
        cur.execute(
            "SELECT context_id FROM embeddings_base ORDER BY embedding <=> %s::vector LIMIT %s;",
            (str(embedded_query), str(k)),
        )
    else: 
        cur.execute(
            "SELECT context_id FROM embeddings ORDER BY embedding <=> %s::vector LIMIT %s;",
            (str(embedded_query), str(k)),
        )
        
    results = [row[0] for row in cur.fetchall()]

    cur.close()
    conn.close()

    if check_percentage:
        min_res = min(len(expected_ids), k)
        result = len(set(results) & set(expected_ids)) / min_res
    else:
        result = 1.0 if set(results) & set(expected_ids) else 0.0

    return result

recall_10 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 10, data, True, False), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@10 Metric: ", recall_10)
recall_4 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 4, data, True, False), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@4 Metric: ", recall_4)


recall_10 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 10, data, True, True), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@10 Metric %: ", recall_10)
recall_4 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 4, data, True, True), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@4 Metric %: ", recall_4)

Recall@10 Metric:  0.403125
Recall@4 Metric:  0.29270833333333335
Recall@10 Metric %:  0.3677571097883598
Recall@4 Metric %:  0.2636284722222222


With our baseline, we can now try an finetune our model to see if we can improve the performance. 

### Finetunning with limited compute and memory. 
Although we've chosen a very small model, my laptop will not have enough memory to train it locally. 
This gives me two options, the first is to just rent a bigger machine, but since I always found it more interesting to work within limitations. 
I've decided not to do that. Instead I've been researching different ways to do this. 

One of the solutions to solve this issue, is to do fine tunning using [Low Rank Adaptation (LoRA)](https://arxiv.org/pdf/2106.09685). 
When using Lora, we freeze all the parameters in the baseline model, and train a smaller adaption layer, which is then multiplied on
to the original weights. This greatly reduces the amount of parameters we train, and according to the original paper, can reduce the 
memory requirements threefold. 

Lets see the difference in trainable parameters before and after applying the lora adapter:

In [21]:
trainable_params = 0
all_params = 0

for name, param in model.named_parameters():
    all_params += param.numel()
    if param.requires_grad:
        trainable_params += param.numel()
        # print(f"{name}: shape={param.shape}, params={param.numel()}")

print("-" * 50)

print(f"Total model parameters:    {all_params:,}")
print(f"Total trainable parameters before LoRA: {trainable_params:,}")
print(f"Trainable Percentage trainable before LoRA:     {100 * trainable_params / all_params:.2f}%")

## Adding LoRA Adapter
peft_config = LoraConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    target_modules=["query", "key", "value", "dense"],
    lora_dropout=0.1,
)
model.add_adapter(peft_config)

trainable_params_lora = 0
for name, param in model.named_parameters():
    all_params += param.numel()
    if param.requires_grad:
        trainable_params_lora += param.numel()
print(f"Total trainable parameters after LoRA: {trainable_params_lora:,}")
print(f"Trainable Percentage trainable after LoRA:     {100 * trainable_params_lora / all_params:.2f}%")

--------------------------------------------------
Total model parameters:    117,653,760
Total trainable parameters before LoRA: 117,653,760
Trainable Percentage trainable before LoRA:     100.00%
Total trainable parameters after LoRA: 669,696
Trainable Percentage trainable after LoRA:     0.28%


As you can see we reduce the trainable parameters by 99.72% which is quite significant, and ensuring that this can run on my mac. 
Now lets get to the training. 

First we will need to decide on a loss function, to do that I decide to look at the [Sentence Transformer Loss Overview](https://sbert.net/docs/sentence_transformer/loss_overview.html), 
and found that multiple MultipleNegativesRankingLoss fits my data well.

Second we need to set some hyper parameters, in our training arguments. There is no one solution when chosing hyper parameters. 
What I've read is that we should just expirement. Below is what I've chosen. I would like to say that the low batch size of 8
is mostly to do with my memory limitations. 

In [22]:
loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="models/multilingual-e5-small-finetune-danish-subject",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    bf16=True,  
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch according to the documentation
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=25,
    logging_first_step=True,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    loss=loss,
)
trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss,Validation Loss
100,2.0809,1.892573
200,1.095,0.913434
300,0.8633,0.822999
400,0.8212,0.773398
500,0.6988,0.73371
600,0.7186,0.717924
700,0.8244,0.701685
800,0.798,0.69188
900,0.6338,0.687126
1000,0.7319,0.685912


TrainOutput(global_step=1440, training_loss=0.8778682687216335, metrics={'train_runtime': 2595.8262, 'train_samples_per_second': 4.436, 'train_steps_per_second': 0.555, 'total_flos': 0.0, 'train_loss': 0.8778682687216335, 'epoch': 3.0})

Now that we have a trained model and before we test if we I would like to say a bit about what I look for in the logging. First I look to see if I have a steadily falling validation loss over the training.
What can some times happen is that you have falling training loss, but that validation loss stops decreasing. This is usually due to overfitting the model. 
Second I see that even though my training loss is generally decreasing, it has some jumps. This is probably due too the small batch size. I cannot change this however,
due to memory limitations. 

Now we can run our test again, to see if performance have improved. To do this we have to reembed our corpus with the finetuned model. 

In [25]:
with tqdm(total=len(data["corpus"].keys()), desc="Saving embeddings") as pbar:
    for id, context in data["corpus"].items():
        speech_name = data["related_speech"][id]
        embedding = model.encode(context).tolist()
        write_to_db(speech_name, id, context, embedding, False)
        pbar.update(1)

Saving embeddings: 100%|██████████| 1173/1173 [06:09<00:00,  3.18it/s]


And finally we can test

In [26]:
test_triplet_eval = dev_evaluator(model)
print("Cosine Precision for Test Data: ", test_triplet_eval["dev_evaluator_cosine_accuracy"])
validation_evaluator = TripletEvaluator(
    anchors=dataset["validation"]["anchor"],
    positives=dataset["validation"]["positive"],
    negatives=dataset["validation"]["negative"],
    name="validation_evaluator",
)
val_triplet_eval = validation_evaluator(model)

print("Cosine Precision for Validation Data: ", val_triplet_eval["validation_evaluator_cosine_accuracy"])

recall_10 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 10, data, False, False), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@10 Metric: ", recall_10)
recall_4 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 4, data, False, False), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@4 Metric: ", recall_4)


recall_10 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 10, data, False, True), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@10 Metric %: ", recall_10)
recall_4 = (
    test_triplet.apply(
        lambda x: recall_k(x["anchor"], model, 4, data, False, True), axis=1
    ).sum()
    / test_triplet.shape[0]
)
print("Recall@4 Metric %: ", recall_4)

Cosine Precision for Test Data:  0.9937499761581421
Cosine Precision for Validation Data:  0.996666669845581
Recall@10 Metric:  0.5052083333333334
Recall@4 Metric:  0.378125
Recall@10 Metric %:  0.4798400297619047
Recall@4 Metric %:  0.3585069444444444


GREAT SUCCES!!!!! It looks like we've improved our model on all metrics. We did not have to send our data anywhere :D 