<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-for-Professionals/blob/main/Embeddings_Fine_Tuning_Embedding_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning an Embedding model

Fine-tuning an embedding model, like sentence-transformers/all-MiniLM-L6-v2, using the Transformers library can significantly improve performance for specific tasks or datasets. Below, I'll outline the steps and provide a code example for fine-tuning this model on a dataset from the Hugging Face Hub. We'll use a sentiment analysis dataset for this example, but you can replace it with any suitable dataset for your needs.

1. **Install Necessary Libraries**: Make sure you have Transformers, Datasets, and Sentence Transformers libraries installed in your environment.

2. **Load the Dataset**: We'll use the emotion dataset from Hugging Face as an example. It's a text classification dataset, which is a good fit for fine-tuning an embedding model.

3. **Preprocess the Dataset**: Tokenize the dataset and prepare it for training.

4. **Load the Pre-trained Model and Tokenizer**: We'll use the all-MiniLM-L6-v2 model.

5. **Training**: Define a training loop or use the Trainer API from Hugging Face to fine-tune the model.

6. **Evaluation**: Evaluate the fine-tuned model on a test set to see the improvements.

In [1]:
!pip install datasets sentence-transformers accelerate -U --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from sentence_transformers import models, SentenceTransformer


In [3]:
# Step 1.1: Load the dataset
dataset = load_dataset("emotion")

# Step 1.2: Preprocess the dataset
def preprocess_data(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenized_dataset = dataset.map(preprocess_data, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [4]:
# 1.3 Prepare the dataset for training
train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(5000)) # Smaller subset for training
eval_dataset = tokenized_dataset["validation"].shuffle(seed=42).select(range(1000)) # Subset for evaluation

In [5]:

# Step 2.1: Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", num_labels=6)

# Step 2.2: Training setup
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Step 2.3: Train the model
trainer.train()


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,1.2605,1.142817
2,0.4753,0.517019
3,0.2492,0.3398


TrainOutput(global_step=939, training_loss=0.9084070782564954, metrics={'train_runtime': 77.802, 'train_samples_per_second': 192.797, 'train_steps_per_second': 12.069, 'total_flos': 124389527040000.0, 'train_loss': 0.9084070782564954, 'epoch': 3.0})

This example code fine-tunes the all-MiniLM-L6-v2 model on the emotion dataset for sentiment analysis. Replace "emotion" with any other dataset name from Hugging Face that suits your project requirements. Make sure to adjust the num_labels parameter in the model loading step to match the number of classes in your dataset.

In [8]:
# Step 3.1: Evaluate the model
trainer.evaluate()

{'eval_loss': 0.33979976177215576,
 'eval_runtime': 1.554,
 'eval_samples_per_second': 643.513,
 'eval_steps_per_second': 10.296,
 'epoch': 3.0}

In [6]:


# Step 3.2: Save the fine-tuned model
model_save_path = "./fine_tuned_model"
model.save_pretrained(model_save_path)


('./fine_tuned_tokenizer/tokenizer_config.json',
 './fine_tuned_tokenizer/special_tokens_map.json',
 './fine_tuned_tokenizer/vocab.txt',
 './fine_tuned_tokenizer/added_tokens.json',
 './fine_tuned_tokenizer/tokenizer.json')

## Evaluating Effectiveness of Finetuned Embedding Model

Evaluating the effectiveness of a model before and after fine-tuning for semantic similarity tasks involves comparing the cosine similarity of sentence embeddings generated by the standard pre-trained model and the fine-tuned model. This process can help demonstrate the improvements in understanding semantic nuances after fine-tuning.

The following code example demonstrates how to compute the semantic similarity between two given sentences (s1 and s2) using both the original all-MiniLM-L6-v2 model and the fine-tuned version of the same model. We'll use the cosine similarity measure for this purpose.

In [11]:
from transformers import AutoTokenizer, AutoModel
import torch
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer


In [9]:

# Function to calculate cosine similarity
def cosine_similarity(v1, v2):
    return 1 - cosine(v1, v2)

# Function to get embeddings from a model
def get_embedding(model, tokenizer, sentence):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:,0,:].numpy()  # Use the [CLS] token's embeddings
    return embeddings



In [17]:
# Load tokenizer and model (standard model)
tokenizer_standard = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model_standard = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Assuming 'model_finetuned' is your fine-tuned model loaded similarly
# For demonstration, we're using the same model as a placeholder for the fine-tuned model
# Replace this with your actual fine-tuned model
model_finetuned = AutoModel.from_pretrained(model_save_path)

# Example sentences
s1 = "I have a pen."
s2 = "I own a pencil."

# Get embeddings from the standard model
embedding1_standard = get_embedding(model_standard, tokenizer_standard, s1)
embedding2_standard = get_embedding(model_standard, tokenizer_standard, s2)

# Calculate similarity with the standard model
similarity_standard = cosine_similarity(embedding1_standard[0], embedding2_standard[0])
print(f"Semantic similarity (standard model): {similarity_standard}")

# Get embeddings from the fine-tuned model
embedding1_finetuned = get_embedding(model_finetuned, tokenizer_standard, s1)  # Assuming same tokenizer
embedding2_finetuned = get_embedding(model_finetuned, tokenizer_standard, s2)

# Calculate similarity with the fine-tuned model
similarity_finetuned = cosine_similarity(embedding1_finetuned[0], embedding2_finetuned[0])
print(f"Semantic similarity (fine-tuned model): {similarity_finetuned}")

Semantic similarity (standard model): 0.794217050075531
Semantic similarity (fine-tuned model): 0.849801242351532


## Thank You