# Embeddings - Fine Tuning Models

The goal is to research different techniques on how to fine-tune embedding models.

**Resources**
- [Sentence Transformer - Loss Functions](https://sbert.net/docs/sentence_transformer/loss_overview.html#custom-loss-functions)

# Notebook Setup

## Imports

In [1]:
# Import Standard Libraries
from datasets import load_dataset
from datasets import Dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import CoSENTLoss

# Read Data

## All NLI - Pair Class

In [2]:
# Read data the All NLI "Pair Class" datasets for SentenceTransformerTrainer
all_nli_pair_class_train = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
all_nli_pair_class_test = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev")

In [4]:
all_nli_pair_class_train[5]

{'premise': 'Children smiling and waving at camera',
 'hypothesis': 'The kids are frowning',
 'label': 2}

The `label` is `{"0": "entailment", "1": "neutral", "2", "contradiction"}`.

# Data Preparation

It is important to prepare the dataset in order to repsect a certain format expected by the Loss Function.

- **Sentence Transformer** - The Loss Function expected format is reported in the [Loss Table](https://sbert.net/docs/sentence_transformer/loss_overview.html) and *label* column is generally indicated as `label` or `score`

## Dataset - from_dict

In case your data needs to be prepared, you can use the `Dataset.from_dict` and construct the list of values to insert into your dataset.

In [6]:
# Initialise the data
anchors = []
positives = []

# Open a file, perform preprocessing, filtering, cleaning, etc.
# and append to the lists

dataset = Dataset.from_dict({
    "anchor": anchors,
    "positive": positives,
})

# SentenceTransformerTrainer

This library uses `datasets.Dataset ` ([Reference](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)) or `datasets.DatasetDict` ([Reference](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) instances for both training and evaluation.

They accept CSV, JSON, Parquet, Arrow or SQL.

Such datasets are marked with `setnence-transformers` in the HuggingFace Datasets Hub.

## Loss Functions - Cosine Sentence Similarity

In [2]:
# Instance model
model = SentenceTransformer("microsoft/mpnet-base")

# Fine-tuning dataset with 2 samples
# Data point: {text_1, text_2, expected_similarity}
dataset = Dataset.from_dict({
    "sentence1": ["It's nice weather outside today.", "He drove to work."],
    "sentence2": ["It's so sunny.", "She walked to the store."],
    "score": [1.0, 0.3]
})

# Instance loss
loss_function = CoSENTLoss(model)

# Fine-tune
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=dataset,
    loss=loss_function
)
trainer.train()

No sentence-transformers model found with name microsoft/mpnet-base. Creating a new one with mean pooling.
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss


TrainOutput(global_step=3, training_loss=0.3688153823216756, metrics={'train_runtime': 6.7357, 'train_samples_per_second': 0.891, 'train_steps_per_second': 0.445, 'total_flos': 0.0, 'train_loss': 0.3688153823216756, 'epoch': 3.0})

## Fine-Tune - FacebookAI/XLM Roberta Base

In [None]:
# Load the model to fine-tune
model = SentenceTransformer("FacebookAI/xlm-roberta-base")

# Cosine Sentence Loss -> Text 1, Text 2, Expected Similarity
loss_function = CoSENTLoss(model)