# Fine-tune Embedding models using Sentence Transformers 3 for better RAG performance

## Introduction
In this blog post, we will walk through the process of fine-tuning embedding models using Sentence Transformers 3 to enhance Retrieval-Augmented Generation (RAG) performance.

## Install the Necessary Libraries
Install the following libraries:
- Pytorch
- Sentence Transformers (HF)
- Transformers (HF)
- Datasets (HF)

We are currently using Python 3.11.5.

In [1]:
!pip install --upgrade \
    "torch==2.1.2" \
    "tensorboard==2.17.0" \
    "sentence-transformers==3.0.1" \
    "datasets==2.19.1"  \
    "transformers==4.41.2" \
    "accelerate==0.31.0"

Collecting torch==2.1.2
  Downloading torch-2.1.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting tensorboard
  Downloading tensorboard-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Collecting filelock (from torch==2.1.2)
  Downloading filelock-3.15.4-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy (from torch==2.1.2)
  Downloading sympy-1.12.1-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch==2.1.2)
  Downloading networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting fsspec (from torch==2.1.2)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.1.2)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.1.2)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.1.2)
  Downloading n

After installing the necessary libraries, you should register on [Hugging Face](https://huggingface.co/join) as we are going to use Hugging Face Hub to push our models and training logs.

Get your access token [here](https://huggingface.co/settings/tokens)

In [2]:
# Log into your HF account and store your token (access key) on the disk
from huggingface_hub import login

# login(token="ADD YOUR TOKEN HERE", add_to_git_credential=True)
login(token="ADD YOUR TOKEN HERE", add_to_git_credential=False)

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Dataset preparation

The Hugging Face Hub has a lot of datasets that can be used to fine-tune embeddings models.You can take a look [here](https://sbert.net/docs/sentence_transformer/dataset_overview.html) at what sort of dataset structure should your dataset follow in order to be able to use it for fine-tunning embeddings.

We are going to use [enelpol/rag-mini-bioasq](https://huggingface.co/datasets/enelpol/rag-mini-bioasq), which includes 4,719 question-answer passages from the BioASQ challenges on biomedical semantic indexing and question answering (QA) [dataset for task b BioASQ11](http://participants-area.bioasq.org/datasets/), which can be used as *Positive Pair* configuration.

We have to load the dataset, and we can do it using the HF datasets library.


In [3]:
from datasets import load_dataset
 
# Load dataset from HF hub
train_dataset = load_dataset("enelpol/rag-mini-bioasq", name="question-answer-passages", split="train")
test_dataset = load_dataset("enelpol/rag-mini-bioasq", name="question-answer-passages", split="test")

print(train_dataset[0])
print(test_dataset[0])


Downloading readme: 100%|██████████| 1.76k/1.76k [00:00<00:00, 3.91MB/s]
Downloading data: 100%|██████████| 1.12M/1.12M [00:00<00:00, 1.27MB/s]
Downloading data: 100%|██████████| 187k/187k [00:00<00:00, 874kB/s]
Generating train split: 100%|██████████| 4012/4012 [00:00<00:00, 126521.01 examples/s]
Generating test split: 100%|██████████| 707/707 [00:00<00:00, 145418.44 examples/s]


{'question': 'What is the implication of histone lysine methylation in medulloblastoma?', 'answer': 'Aberrant patterns of H3K4, H3K9, and H3K27 histone lysine methylation were shown to result in histone code alterations, which induce changes in gene expression, and affect the proliferation rate of cells in medulloblastoma.', 'id': 1682, 'relevant_passage_ids': [23179372, 19270706, 23184418]}
{'question': 'Is capmatinib effective for glioblastoma?', 'answer': 'No. Combination of capmatinib buparlisib resulted in no clear activity in patients with recurrent PTEN-deficient glioblastoma.', 'id': 4213, 'relevant_passage_ids': [31776899]}


The dataset has the following format

```
{"question": "<question>", "answer": "<answer with some information>", "id": "<id>", "relevant_passage_ids": "<[list of ids of relevant passages]>"},
{"question": "<question>", "answer": "<answer with some information>", "id": "<id>", "relevant_passage_ids": "<[list of ids of relevant passages]>"},
{"question": "<question>", "answer": "<answer with some information>", "id": "<id>", "relevant_passage_ids": "<[list of ids of relevant passages]>"}, ...
```

Given that the format is a bit different to the format that we need to provide to 'Sentence-transformers', we have to select and rename the columns to match the expected format.

Once the formatting is ready, we save the train and test datasets to disk.

In [4]:
# Rename the columns
train_dataset = train_dataset.rename_column("question", "anchor")
train_dataset = train_dataset.rename_column("answer", "positive")
test_dataset = test_dataset.rename_column("question", "anchor")
test_dataset = test_dataset.rename_column("answer", "positive")

# Add "id" column if not present
if "id" not in train_dataset.column_names:
    train_dataset = train_dataset.add_column("id", range(len(train_dataset)))
if "id" not in test_dataset.column_names:    
    test_dataset = test_dataset.add_column("id", range(len(test_dataset)))


# save datasets to disk
train_dataset.to_json("train_dataset.json", orient="records")
test_dataset.to_json("test_dataset.json", orient="records")


Creating json from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 88.32ba/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 132.94ba/s]


306824

## Baseline and evaluation

Following dataset preparation, our next step is to establish a baseline method and evaluation protocol. This crucial step allows us to gauge the effectiveness of future model refinements against a known starting point. We'll assess how well a pre-existing model handles our specific data, and how it performs after fine-tuning it.

We've selected [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) as our base model that will be fine-tuned later. This model isn't particularly performant compared to other models of similar size, but let's see how far we can go with fine-tunning. With only 110 million parameters and a 768-dimensional embedding space, it obtains a score of 57.78 on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), which is lower than the performance of OpenAI's text-embedding-ada-002 which obtains a score of 60.99. We are also going to compare this model with the [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) which also has 109 million parameters and a 768-dimensional embedding space. The bge-base-en-v1.5 achieves an impressive score of 63.55 on the MTEB Leaderboard.

Given that we want to improve the Information Retrieval (IR) capabilities of the embeddings, to quantify performance, we will employ the [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator). This tool assesses how well our model can fetch the most relevant documents for given queries. It calculates various performance metrics, including Mean Reciprocal Rank (MRR), Recall@K, and Normalized Discounted Cumulative Gain (NDCG). A useful explanation of these IR metrics can be found [here](https://www.pinecone.io/learn/offline-evaluation/).


To conduct our evaluation, we will utilize a comprehensive document pool that combines both training and test data for the corpus, while queries will be sourced exclusively from the test set. This approach ensures the model is assessed on its ability to retrieve relevant documents from a larger corpus that includes unseen data, providing a more robust and realistic evaluation of its retrieval capabilities.

In [5]:
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets
 
model_id = "sentence-transformers/all-mpnet-base-v2"
large_model_id = "BAAI/bge-base-en-v1.5"

# Load the models
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)

large_model = SentenceTransformer(
    large_model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)
 

# load the train and test datasets. Concatenate them into a single corpus dataset only to add retrieval difficulty
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])
 
# Convert the datasets to dictionaries
corpus = dict(zip(corpus_dataset["id"], corpus_dataset["positive"]))
queries = dict(zip(test_dataset["id"], test_dataset["anchor"]))
 
# Create a mapping of the relevant documents for each query.
# In this case, we only have 1 relevant document per query
relevant_docs = {}
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
 
model_evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name=model_id,
    score_functions={"cosine": cos_sim},
)




A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/usr/local/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.11/site-packages/tornado

We use the 'model_evaluator' to evaluate the baseline model, the bge-base reference model, and later we will also use it to evaluate the fine-tuned model.

In [6]:
# Evaluate the models
model_results = model_evaluator(model)


In [7]:

large_model_results = model_evaluator(large_model)

In [8]:
model_results

{'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@1': 0.785007072135785,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@3': 0.8755304101838756,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@5': 0.8967468175388967,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@10': 0.9264497878359265,
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@1': np.float64(0.785007072135785),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@3': np.float64(0.2918434700612918),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@5': np.float64(0.17934936350777936),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@10': np.float64(0.09264497878359264),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@1': np.float64(0.785007072135785),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@3': np.float64(0.8755304101838756),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@5': np.float64(0.8967468175388967),
 'sentence-tran

In [9]:
large_model_results

{'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@1': 0.8514851485148515,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@3': 0.9349363507779349,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@5': 0.9490806223479491,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@10': 0.958981612446959,
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@1': np.float64(0.8514851485148515),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@3': np.float64(0.3116454502593117),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@5': np.float64(0.1898161244695898),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@10': np.float64(0.09589816124469587),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@1': np.float64(0.8514851485148515),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@3': np.float64(0.9349363507779349),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@5': np.float64(0.9490806223479491),
 'sentence-tra

## Define loss function that will be used for training

In this case, we are using the MultipleNegativesRankingLoss to fine-tune our embedding model. This choice is based on our dataset format, which consists of positive text pairs. You can take a look at [dataset format](https://sbert.net/docs/sentence_transformer/training_overview.html#dataset-format) information and [loss function](https://sbert.net/docs/sentence_transformer/loss_overview.html) information to determine which loss function to use based on your use case.


In [10]:
from sentence_transformers.losses import MultipleNegativesRankingLoss
 
model_id = "sentence-transformers/all-mpnet-base-v2"
 
model = SentenceTransformer(model_id)

train_loss = MultipleNegativesRankingLoss(model)

## Fine-tune embedding model with SentenceTransformersTrainer

Now that we've prepared our data and model, we're ready to fine-tune our embedding model using the SentenceTransformersTrainer.

To configure our training process, we'll use the SentenceTransformerTrainingArguments class. This tool allows us to specify various parameters that can impact training performance and help with tracking and debugging. We'll be using parameter values based on those recommended in the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/training_overview.html#training-arguments). However, it's important to note that these are just starting points. For optimal results, you should experiment with different values tailored to your specific dataset and task.


In [12]:
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
 
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="mpnet_base-bioasq",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # losses that use "in-batch negatives" benefit from no duplicates
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    run_name="mpnet-base-bioasq-basic-training-args",  # Will be used in W&B if `wandb` is installed
)

In [13]:
from sentence_transformers import SentenceTransformerTrainer
 
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.select_columns(["positive", "anchor"]),
    loss=train_loss,
    evaluator=model_evaluator,
)

Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [14]:
# start training the model
trainer.train()
 
#  The model will be saved to the hub and the output directory
trainer.save_model()

#Alternative to save the model: model.save_pretrained("models/mpnet-base-all-rest-of-name/final")
 
# push model to hub
trainer.model.push_to_hub("all-mpnet-base-v2-bioasq-1epoc-batch32-100")

Step,Training Loss,Validation Loss,Sentence-transformers/all-mpnet-base-v2 Cosine Accuracy@1,Sentence-transformers/all-mpnet-base-v2 Cosine Accuracy@3,Sentence-transformers/all-mpnet-base-v2 Cosine Accuracy@5,Sentence-transformers/all-mpnet-base-v2 Cosine Accuracy@10,Sentence-transformers/all-mpnet-base-v2 Cosine Precision@1,Sentence-transformers/all-mpnet-base-v2 Cosine Precision@3,Sentence-transformers/all-mpnet-base-v2 Cosine Precision@5,Sentence-transformers/all-mpnet-base-v2 Cosine Precision@10,Sentence-transformers/all-mpnet-base-v2 Cosine Recall@1,Sentence-transformers/all-mpnet-base-v2 Cosine Recall@3,Sentence-transformers/all-mpnet-base-v2 Cosine Recall@5,Sentence-transformers/all-mpnet-base-v2 Cosine Recall@10,Sentence-transformers/all-mpnet-base-v2 Cosine Ndcg@10,Sentence-transformers/all-mpnet-base-v2 Cosine Mrr@10,Sentence-transformers/all-mpnet-base-v2 Cosine Map@100
100,0.1155,No log,0.845827,0.934936,0.947666,0.960396,0.845827,0.311645,0.189533,0.09604,0.845827,0.934936,0.947666,0.960396,0.909272,0.892253,0.89366


model.safetensors: 100%|██████████| 438M/438M [00:18<00:00, 23.8MB/s] 


'https://huggingface.co/juanpablomesa/all-mpnet-base-v2-bioasq-1epoc-batch32-100/commit/b0a391ef14e024878ab2a3ebde7a3c6631d9ef78'

The training on 4k samples took around 1 minute on an Nvidia A10G instance of [Modal labs](https://modal.com/pricing). At the time of writing (July 2024), the instance costs 1.1 USD/hour which indicates a cost of less than 0.1 USD for the training.

What's pending now is the evaluation of the fine-tuned model using the 'model evaluator' from earlier.



In [15]:
from sentence_transformers import SentenceTransformer
 
fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)
# Evaluate the model
fine_tuned_results = model_evaluator(fine_tuned_model)
 
fine_tuned_results

{'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@1': 0.8458274398868458,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@3': 0.9335219236209336,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@5': 0.9476661951909476,
 'sentence-transformers/all-mpnet-base-v2_cosine_accuracy@10': 0.9618104667609618,
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@1': np.float64(0.8458274398868458),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@3': np.float64(0.31117397454031115),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@5': np.float64(0.1895332390381895),
 'sentence-transformers/all-mpnet-base-v2_cosine_precision@10': np.float64(0.09618104667609616),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@1': np.float64(0.8458274398868458),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@3': np.float64(0.9335219236209336),
 'sentence-transformers/all-mpnet-base-v2_cosine_recall@5': np.float64(0.9476661951909476),
 'sentence-t

If we focus on only a couple of metrics that are more relevant in our case, we get the following information:

| Model | MRR@10 | NDCG@10 |
|-------|--------|---------|
| all-mpnet-base-v2 (Baseline) | 0.8347 | 0.8571 |
| bge-base-en-v1.5 | 0.8965 | 0.9122 |
| all-mpnet-base-v2 Fine-tuned | 0.8919 | 0.9093 |

The fine-tuned model shows significant improvements over the baseline model, with a 6.85% increase in MRR@10 and a 6.09% increase in NDCG@10. It reached the performance level of the bge-base-en-v1.5 embeddings.



## Conclusion

Embedding models play a crucial role in the success of Retrieval-Augmented Generation (RAG) applications, as the quality of retrieved context directly impacts the generated answers. Using the Sentence Transformers 3 library, we fine-tuned the all-mpnet-base-v2 model on a biomedical question-answering dataset. The results show substantial improvements:

- MRR@10 increased from 0.8347 to 0.8919 (6.85% improvement)
- NDCG@10 improved from 0.8571 to 0.9093 (6.09% improvement)

Our fine-tuned model achieved performance comparable to the more advanced bge-base-en-v1.5 model despite starting from a lower baseline.

The fine-tuning process has become highly accessible and efficient. With only 4,719 question-answer pairs, we were able to achieve these improvements in approximately 1 minute of training time on an Nvidia A10G GPU. The estimated cost for this training was less than 0.1 USD, making it a cost-effective approach for enhancing domain-specific retrieval tasks.
This shows the value of customizing embedding models for specific domains or use cases. Significant performance gains can be realized even with a relatively small dataset and minimal training time. 

