# Embeddings - Fine Tuning Models

The goal is to research different techniques on how to fine-tune embedding models.

**Resources**
- [Sentence Transformer - Loss Functions](https://sbert.net/docs/sentence_transformer/loss_overview.html#custom-loss-functions)

# Notebook Setup

## Imports

In [1]:
# Import Standard Libraries
from datasets import load_dataset
from datasets import Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import (
    MultipleNegativesRankingLoss,
    CoSENTLoss
)
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

# Read Data

## All NLI - Pair Class

In [2]:
# Read data the All NLI "Pair Class" datasets for SentenceTransformerTrainer
all_nli_pair_class_train = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
all_nli_pair_class_test = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev")

In [3]:
all_nli_pair_class_train[5]

{'premise': 'Children smiling and waving at camera',
 'hypothesis': 'The kids are frowning',
 'label': 2}

The `label` is `{"0": "entailment", "1": "neutral", "2", "contradiction"}`.

## All NLI - Triplets

In [4]:
# Read data the All NLI "Triplets" datasets for SentenceTransformerTrainer
all_nli_triplets_dataset = load_dataset("sentence-transformers/all-nli", "triplet")
all_nli_triplets_train = all_nli_triplets_dataset["train"].select(range(1_000))
all_nli_triplets_eval = all_nli_triplets_dataset["dev"].select(range(300))
all_nli_triplets_test = all_nli_triplets_dataset["test"].select(range(300))

In [5]:
all_nli_triplets_train[5]

{'anchor': 'An older man is drinking orange juice at a restaurant.',
 'positive': 'A man is drinking juice.',
 'negative': 'Two women are at a restaurant drinking wine.'}

# Data Preparation

It is important to prepare the dataset in order to repsect a certain format expected by the Loss Function.

- **Sentence Transformer** - The Loss Function expected format is reported in the [Loss Table](https://sbert.net/docs/sentence_transformer/loss_overview.html) and *label* column is generally indicated as `label` or `score`

## Dataset - from_dict

In case your data needs to be prepared, you can use the `Dataset.from_dict` and construct the list of values to insert into your dataset.

In [6]:
# Initialise the data
anchors = []
positives = []

# Open a file, perform preprocessing, filtering, cleaning, etc.
# and append to the lists

dataset = Dataset.from_dict({
    "anchor": anchors,
    "positive": positives,
})

# SentenceTransformerTrainer

This library uses `datasets.Dataset ` ([Reference](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)) or `datasets.DatasetDict` ([Reference](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) instances for both training and evaluation.

They accept CSV, JSON, Parquet, Arrow or SQL.

Such datasets are marked with `setnence-transformers` in the HuggingFace Datasets Hub.

## Loss Functions - Cosine Sentence Similarity

In [7]:
# Instance model
model = SentenceTransformer("microsoft/mpnet-base")

# Fine-tuning dataset with 2 samples
# Data point: {text_1, text_2, expected_similarity}
dataset = Dataset.from_dict({
    "sentence1": ["It's nice weather outside today.", "He drove to work."],
    "sentence2": ["It's so sunny.", "She walked to the store."],
    "score": [1.0, 0.3]
})

# Cosine Sentence Loss -> Text 1, Text 2, Expected Similarity
loss_function = CoSENTLoss(model)

# Fine-tune
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=dataset,
    loss=loss_function
)
trainer.train()

No sentence-transformers model found with name microsoft/mpnet-base. Creating a new one with mean pooling.
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss


TrainOutput(global_step=3, training_loss=0.3688153823216756, metrics={'train_runtime': 2.5685, 'train_samples_per_second': 2.336, 'train_steps_per_second': 1.168, 'total_flos': 0.0, 'train_loss': 0.3688153823216756, 'epoch': 3.0})

## Fine-Tune

### Load base Model

In [8]:
# Load the model to fine-tune
model = SentenceTransformer(
    "microsoft/mpnet-base",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on AllNLI triplets",
    )
)

# Define the loss function
loss = MultipleNegativesRankingLoss(model)

No sentence-transformers model found with name microsoft/mpnet-base. Creating a new one with mean pooling.
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Fine-Tuning Arguments

In [9]:
# Define the training arguments
train_args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/mpnet-base-all-nli-triplet",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_ratio=0.1,
    fp16=False,  # GPU's specific
    bf16=True,  # GPU's specific
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    run_name="all-nli-triplet"
)

### Base Model Evaluation

The package `sentence_transformers.evaluation` offers several evaluation strategies for each specific use case. For example pair-class or triplets.

In [10]:
# Base model evaluator
base_evaluator = TripletEvaluator(
    anchors=all_nli_triplets_eval["anchor"],
    positives=all_nli_triplets_eval["positive"],
    negatives=all_nli_triplets_eval["negative"],
    name="all-nli-dev",
)
base_evaluator(model)

{'all-nli-dev_cosine_accuracy': 0.5933333039283752}

### Training

In [11]:
# Define the trainer and start training
trainer = SentenceTransformerTrainer(
    model=model,
    args=train_args,
    train_dataset=all_nli_triplets_train,
    eval_dataset=all_nli_triplets_eval,
    loss=loss,
    evaluator=base_evaluator,
)
trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss,Validation Loss


TrainOutput(global_step=63, training_loss=1.8162882971385168, metrics={'train_runtime': 35.3008, 'train_samples_per_second': 28.328, 'train_steps_per_second': 1.785, 'total_flos': 0.0, 'train_loss': 1.8162882971385168, 'epoch': 1.0})

### Fine-Tuned Model Evaluation

In [12]:
#Evaluate the fine-tuned model
test_evaluator = TripletEvaluator(
    anchors=all_nli_triplets_test["anchor"],
    positives=all_nli_triplets_test["positive"],
    negatives=all_nli_triplets_test["negative"],
    name="all-nli-test",
)
test_evaluator(model)

{'all-nli-test_cosine_accuracy': 0.8766666650772095}