# Fine-tune Embedding models for Retrieval Augmented Generation (RAG)
* Based on:
    * [Phil Schmid - Fine-tune Embedding models for Retrieval Augmented Generation (RAG)](https://www.philschmid.de/fine-tune-embedding-model-for-rag?fbclid=IwY2xjawHfd8xleHRuA2FlbQIxMAABHajMHJeyzUk_IKqC3lS1-eZ7cejyi96lN1pJEqk_UmMGQhnWSspl53eW6w_aem_IygaSnTlA8yKtG_yiCMBlQ)
    * [Phil Schmid's github](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-embedding-model-for-rag.ipynb)

* We'll Fine-tune an embedding model for a financial RAG applications using a synthetic dataset from the 2023_10 NVIDIA SEC Filing.
* We'll also leverage Matryoshka Representation Learning to boost efficiency.

In this notebook, we are going to:
1. Create & Prepare embedding dataset
2. Create baseline and evaluate pretrained model
3. Define loss function with Matryoshka Representation
4. Fine-tune embedding model with SentenceTransformersTrainer
5. Evaluate fine-tuned model against baseline


### What are the🪆Matryoshka Embeddings?

Matryoshka Representation Learning (MRL) is a technique designed to create embeddings that can be truncated to various dimensions without significant loss of performance. This approach frontloads important information into earlier dimensions of the embedding, allowing for efficient storage and processing while maintaining high accuracy in downstream tasks such as retrieval, classification, and clustering.

## Imports & settings

In [None]:
import getpass
import torch

from datasets import concatenate_datasets, load_dataset
from huggingface_hub import login
from sentence_transformers import (
    SentenceTransformerModelCardData,
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments)
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator)
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.util import cos_sim

## Login to HF 

In [None]:
login(token=getpass.getpass(prompt='Insert your HF token.'), add_to_git_credential=True) 

## Load dataset
We are going to use philschmid/finanical-rag-embedding-dataset, which includes 7,000 positive text pairs of questions and corresponding context from the 2023_10 NVIDIA SEC Filing.

The dataset has the following format:
```json
{"question": "<question>", "context": "<relevant context to answer>"}
{"question": "<question>", "context": "<relevant context to answer>"}
{"question": "<question>", "context": "<relevant context to answer>"}
```

In [None]:
# Load dataset from the hub
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")
 
# rename columns (to match what sentence-transforemrs expects)
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")
 
# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))
 
# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)
 
# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

## Create baseline and evaluate pretrained model
We will use the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) model as our starting point. BAAI/bge-base-en-v1.5 is one of the strongest open embedding models for it size, with only 109M parameters and a hidden dimension of 768 it achieves 63.55 on the MTEB Leaderboard.

In [None]:
model_id = "BAAI/bge-base-en-v1.5"  # Hugging Face model ID
matryoshka_dimensions = [768, 512, 256, 128, 64] # Important: large to small
 
# Load a model
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)
 
# load test dataset
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])
 
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)  # Our queries (qid => question)
 
# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
 
matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)
 
# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)

In [None]:
# Evaluate the model
results = evaluator(model)
 
# # COMMENT IN for full results
# print(results)
 
# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print
    print(f"{key}: {results[key]}")

Latest results:
|---|
|dim_768_cosine_ndcg@10: 0.744207805513057|
|dim_512_cosine_ndcg@10: 0.7374662163561584|
|dim_256_cosine_ndcg@10: 0.7299773584859578|
|dim_128_cosine_ndcg@10: 0.6960945771475592|
|dim_64_cosine_ndcg@10: 0.6351348491423877|

Now, let's see if we can improve this score by fine-tuning the model on our specific dataset.

## Define loss function with Matryoshka Representation
* For Positive Text pairs we can use the `MultipleNegativesRankingLoss` in combination with the `MatryoshkaLoss`.
* The `MultipleNegativesRankingLoss` is a great loss function if you only have positive pairs as it adds in batch negative samples to the loss function to have per sample n-1 negative samples.

In [None]:
# Hugging Face model ID: https://huggingface.co/BAAI/bge-base-en-v1.5
model_id = "BAAI/bge-base-en-v1.5"
 
# load model with SDPA for using Flash Attention 2
model = SentenceTransformer(
    model_id,
    model_kwargs={"attn_implementation": "sdpa"},
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="BGE base Financial Matryoshka",
    ),
)

In [None]:
matryoshka_dimensions = [768, 512, 256, 128, 64]  # Important: large to small
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##  Fine-tune embedding mode

In [None]:
# load train dataset again
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
 
# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="bge-base-financial-matryoshka", # output directory and hugging face model ID
    num_train_epochs=4,                         # number of epochs
    per_device_train_batch_size=8,              # train batch size
    gradient_accumulation_steps=8,              # for a global batch size of 512
    per_device_eval_batch_size=4,               # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=False,                                 # use tf32 precision
    bf16=True,                                  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 128 dimension
)

In [None]:
trainer = SentenceTransformerTrainer(
    model=model, # bg-base-en-v1
    args=args,  # training arguments
    train_dataset=train_dataset.select_columns(
        ["positive", "anchor"]
    ),  # training dataset
    loss=train_loss,
    evaluator=evaluator,
)

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()
 
# save the best model
trainer.save_model()
 
# push model to hub
trainer.model.push_to_hub("bge-base-financial-matryoshka")

##  Evaluate fine-tuned model against baseline

In [None]:
fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)
# Evaluate the model
results = evaluator(fine_tuned_model)
 
# # COMMENT IN for full results
# print(results)
 
# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")

Latest results:

|Original|Fine-tuned|
|---|---|
|dim_768_cosine_ndcg@10: 0.744207805513057|dim_768_cosine_ndcg@10: 0.7918768662814302|
|dim_512_cosine_ndcg@10: 0.7374662163561584|dim_512_cosine_ndcg@10: 0.7941777731436057|
|dim_256_cosine_ndcg@10: 0.7299773584859578|dim_256_cosine_ndcg@10: 0.7887004228679669|
|dim_128_cosine_ndcg@10: 0.6960945771475592|dim_128_cosine_ndcg@10: 0.7753739833293516|
|dim_64_cosine_ndcg@10: 0.6351348491423877|dim_64_cosine_ndcg@10: 0.7482035790215306|

The fine-tuned model outperforms the original model with embedding size 768 even when using embedding size 64.

# Some other examples

Here is an example that encodes sentences and then computes the distance between them for doing semantic search.

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [None]:
query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.cos_sim(query_embedding, passage_embedding))