## Fine tuning our embedding model

### Thanks to [philschmid's](https://www.philschmid.de/fine-tune-embedding-model-for-rag) blog.

You can read the blog entry, this notebook is a copy with some modifications to our specific dataset.

In [None]:
# Install the requirements from the file
!pip install -r requirements/requirements_training.txt

In [4]:
from datasets import load_dataset
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim

### Load our dataset and prepare it for training

Let's download the dataset with the chunks of the documentation and the synthetic queries.

The dataset was made of triplets, so we will select those, split the content into train/test, and save the content to a json file locally.

In [31]:
# Load dataset from the hub
dataset = (
    load_dataset("plaguss/argilla_sdk_docs_queries", split="train")
    .select_columns(["anchor", "positive", "negative"])  # Select the relevant columns
    .add_column("id", range(len(dataset)))               # Add an id column to the dataset
    .train_test_split(test_size=0.1)                     # split dataset into a 10% test set
)
 
# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.10k/4.10k [00:00<00:00, 3.95MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137k/137k [00:00<00:00, 231kB/s]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 980/980 [00:00<00:00, 50210.32 examples/s]
Creating json from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

62366

### Load back the dataset

In [34]:
from datasets import load_dataset, concatenate_datasets

test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

Generating train split: 98 examples [00:00, 10590.04 examples/s]
Generating train split: 882 examples [00:00, 134025.65 examples/s]


Define the name of the model we want to fine tune, and some variables to determine the type of fine tuning, following the blog,
we are doing a [Matryoshka embedding model](https://huggingface.co/blog/matryoshka).

In [33]:
model_id = "BAAI/bge-base-en-v1.5"  # Hugging Face model ID
matryoshka_dimensions = [768, 512, 256, 128, 64]  # Important: large to small

device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"

## Prepare the evaluator

Let's define the evaluator for the model to see the initial base model we want to beat.

In [35]:
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)  # Our queries (qid => question)
 
# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)
 
# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)

### Evaluate the model

In [36]:
# Evaluate the model
results = evaluator(model)
 
# # COMMENT IN for full results
# print(results)
 
# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print
    print(f"{key}: {results[key]}")

dim_768_cosine_ndcg@10: 0.30804996520618816
dim_512_cosine_ndcg@10: 0.29105806175342075
dim_256_cosine_ndcg@10: 0.27984055715264694
dim_128_cosine_ndcg@10: 0.24651526191432124
dim_64_cosine_ndcg@10: 0.2384123532612535


### Load the model

In [37]:
from sentence_transformers import SentenceTransformerModelCardData, SentenceTransformer
 
# Hugging Face model ID: https://huggingface.co/BAAI/bge-base-en-v1.5
model_id = "BAAI/bge-base-en-v1.5"
 
# load model with SDPA for using Flash Attention 2
model = SentenceTransformer(
    model_id,
    #model_kwargs={"attn_implementation": "sdpa"},  # sdpa will be used by default if available
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="BGE base ArgillaSDK Matryoshka",
    ),
)

#### Define the loss

In [2]:
from sentence_transformers.losses import MatryoshkaLoss, TripletLoss
 
matryoshka_dimensions = [768, 512, 256, 128, 64]  # Important: large to small
inner_train_loss = TripletLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

## Define the training strategy

The training strategy was slightly modified from the original reference to run on a `Apple M2 Pro` instead of the original machine,
hence the change of values.

In [44]:
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers
  
# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="bge-base-argilla-sdk-matryoshka", # output directory and hugging face model ID
    num_train_epochs=3,                         # number of epochs
    per_device_train_batch_size=8,             # train batch size
    gradient_accumulation_steps=4,             # for a global batch size of 512
    per_device_eval_batch_size=4,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
# NOTE: In colab we can work with the optimizer at least, but neither tf32 nor bf16
#    optim="adamw_torch_fused",                  # use fused adamw optimizer
#    tf32=True,                                  # use tf32 precision
#    bf16=True,                                  # use bf16 precision
    #batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=5,                            # log every 10 steps
    save_total_limit=1,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_dim_512_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 512 dimension
)

### Remove None from datasets

The dataset can have some `None` values, remove them before starting the training

In [40]:
from datasets import Dataset

train_dataset_cleaned = train_dataset.select_columns(
    ["anchor", "positive", "negative"]
).to_pandas().dropna()
test_dataset_cleaned = test_dataset.select_columns(
    ["anchor", "positive", "negative"]
).to_pandas().dropna()

train_dataset_cleaned = Dataset.from_pandas(train_dataset_cleaned, preserve_index=False)
test_dataset_cleaned = Dataset.from_pandas(test_dataset_cleaned, preserve_index=False)

### Prepare the trainer

In [45]:
from sentence_transformers import SentenceTransformerTrainer
 
trainer = SentenceTransformerTrainer(
    model=model, # bg-base-en-v1
    args=args,  # training arguments
    train_dataset=train_dataset.select_columns(
        ["anchor", "positive", "negative"]
    ),  # training dataset
    loss=train_loss,
    evaluator=evaluator,
)

### Train the model, and save it publicly on your account in the Hugging Face Hub

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()
 
# save the best model
trainer.save_model()
 
# push model to hub
trainer.model.push_to_hub("bge-base-argilla-sdk-matryoshka")

#### Evaluate the final model on the same eval data to see the potential improvement

In [48]:
from sentence_transformers import SentenceTransformer

fine_tuned_model = SentenceTransformer(
    "plaguss/bge-base-argilla-sdk-matryoshka", device=device
)
# Evaluate the model
results = evaluator(fine_tuned_model)
 
# # COMMENT IN for full results
# print(results)
 
# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")



dim_768_cosine_ndcg@10: 0.3086125494748455
dim_512_cosine_ndcg@10: 0.29420081448590024
dim_256_cosine_ndcg@10: 0.2931450934182018
dim_128_cosine_ndcg@10: 0.2629197762336244
dim_64_cosine_ndcg@10: 0.2610977190273289
