# Embeddings - Fine Tuning Models

The goal is to research different techniques on how to fine-tune embedding models.

**Resources**
- [Sentence Transformer - Loss Functions](https://sbert.net/docs/sentence_transformer/loss_overview.html#custom-loss-functions)

# Notebook Setup

## Imports

In [2]:
# Import Standard Libraries
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

# Read Data

## All NLI - Pair Class

In [2]:
# Read data the All NLI "Pair Class" datasets for SentenceTransformerTrainer
all_nli_pair_class_train = load_dataset("sentence-transformers/all-nli", "pair-class", split="train")
all_nli_pair_class_test = load_dataset("sentence-transformers/all-nli", "pair-class", split="dev")

In [4]:
all_nli_pair_class_train[5]

{'premise': 'Children smiling and waving at camera',
 'hypothesis': 'The kids are frowning',
 'label': 2}

The `label` is `{"0": "entailment", "1": "neutral", "2", "contradiction"}`.

## All NLI - Triplets

In [3]:
# Read data the All NLI "Triplets" datasets for SentenceTransformerTrainer
all_nli_triplets_dataset = load_dataset("sentence-transformers/all-nli", "triplet")
all_nli_triplets_train = all_nli_triplets_dataset["train"].select(range(100_000))
all_nli_triplets_eval = all_nli_triplets_dataset["dev"]
all_nli_triplets_test = all_nli_triplets_dataset["test"]

triplet/train-00000-of-00001.parquet:   0%|          | 0.00/38.4M [00:00<?, ?B/s]

triplet/dev-00000-of-00001.parquet:   0%|          | 0.00/782k [00:00<?, ?B/s]

triplet/test-00000-of-00001.parquet:   0%|          | 0.00/810k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/557850 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/6584 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6609 [00:00<?, ? examples/s]

In [4]:
all_nli_triplets_train[5]

{'anchor': 'An older man is drinking orange juice at a restaurant.',
 'positive': 'A man is drinking juice.',
 'negative': 'Two women are at a restaurant drinking wine.'}

# Data Preparation

It is important to prepare the dataset in order to repsect a certain format expected by the Loss Function.

- **Sentence Transformer** - The Loss Function expected format is reported in the [Loss Table](https://sbert.net/docs/sentence_transformer/loss_overview.html) and *label* column is generally indicated as `label` or `score`

## Dataset - from_dict

In case your data needs to be prepared, you can use the `Dataset.from_dict` and construct the list of values to insert into your dataset.

In [6]:
# Initialise the data
anchors = []
positives = []

# Open a file, perform preprocessing, filtering, cleaning, etc.
# and append to the lists

dataset = Dataset.from_dict({
    "anchor": anchors,
    "positive": positives,
})

# SentenceTransformerTrainer

This library uses `datasets.Dataset ` ([Reference](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset)) or `datasets.DatasetDict` ([Reference](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict)) instances for both training and evaluation.

They accept CSV, JSON, Parquet, Arrow or SQL.

Such datasets are marked with `setnence-transformers` in the HuggingFace Datasets Hub.

## Loss Functions - Cosine Sentence Similarity

In [2]:
# Instance model
model = SentenceTransformer("microsoft/mpnet-base")

# Fine-tuning dataset with 2 samples
# Data point: {text_1, text_2, expected_similarity}
dataset = Dataset.from_dict({
    "sentence1": ["It's nice weather outside today.", "He drove to work."],
    "sentence2": ["It's so sunny.", "She walked to the store."],
    "score": [1.0, 0.3]
})

# Cosine Sentence Loss -> Text 1, Text 2, Expected Similarity
loss_function = CoSENTLoss(model)

# Fine-tune
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=dataset,
    loss=loss_function
)
trainer.train()

No sentence-transformers model found with name microsoft/mpnet-base. Creating a new one with mean pooling.
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss


TrainOutput(global_step=3, training_loss=0.3688153823216756, metrics={'train_runtime': 6.7357, 'train_samples_per_second': 0.891, 'train_steps_per_second': 0.445, 'total_flos': 0.0, 'train_loss': 0.3688153823216756, 'epoch': 3.0})

## Fine-Tune

In [3]:
# Load the model to fine-tune
model = SentenceTransformer(
    "microsoft/mpnet-base",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on AllNLI triplets",
    )
)

# Define the loss function
loss = MultipleNegativesRankingLoss(model)

No sentence-transformers model found with name FacebookAI/xlm-roberta-base. Creating a new one with mean pooling.


config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [5]:
# Define the training arguments
train_args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/all-nli-pair-class",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_ratio=0.1,
    fp16=True,  # GPU's specific
    bf16=False,  # GPU's specific
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    run_name="all-nli-pair-class",
)