<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width=400px style="opacity:0.8">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# Matryoshka Embeddings

## Introduction

Matryoshka embeddings are an innovative approach in the field of machine learning and natural language processing, inspired by the concept of Russian nesting dolls. These embeddings are designed to prioritize and encapsulate the most significant information in the initial dimensions, allowing for efficient representation and processing of data even when truncated. This is particularly useful for capturing the hierarchical nature of natural language text, where important concepts and structures are nested within larger contexts.

## Understanding Matryoshka Embeddings

Matryoshka embeddings aim to store more important information in earlier dimensions, and less important information in later dimensions. This allows the embeddings to be truncated to smaller sizes without significant loss of information, making them highly efficient for various downstream tasks.

### Key Characteristics:
- **Variable Size**: Matryoshka embeddings can be truncated to different sizes, allowing for flexibility in storage and processing.
- **Efficiency**: They enable efficient shortlisting and reranking by using smaller embeddings for initial tasks and larger ones for detailed analysis.

## Applications of Matryoshka Embeddings

Matryoshka embeddings are particularly useful in scenarios where data efficiency and scalability are crucial, such as:
- **Natural Language Processing**: Enhancing tasks like nearest neighbor search and classification by using variable-size embeddings.
- **Data Storage**: Reducing storage requirements while maintaining performance by using truncated embeddings.

## Advantages of Matryoshka Embeddings

- **Scalability**: They allow for scaling solutions to desired storage costs and processing speeds.
- **Performance**: Despite truncation, they maintain high performance levels, making them suitable for a wide range of applications.

## Training Matryoshka Embeddings

Matryoshka embeddings are trained using a loss function that evaluates the quality of embeddings at various dimensions. This incentivizes the model to prioritize important information in the initial dimensions, ensuring that truncated embeddings remain effective.

## Detailed Explanation of Loss Functions 
In the `MatryoshkaLoss` fine-tuning done at [sbert_subjects_matryoshka.py](../../src/contrastive_loss/sbert_subjects_matryoshka.py), we are using an inner contrastive loss, `CoSENTLoss`, wrapped with `MatryoshkaLoss`.

### CoSENTLoss

`CoSENTLoss` is one of the popular choices for contrastive loss and has already been covered in the theory as well as in the previous lab exercise.

### MatryoshkaLoss

`MatryoshkaLoss` is a wrapper around an inner contrastive loss (like `CoSENTLoss`). It computes a total loss by taking a weighted sum of the losses at multiple truncated dimensions.

- **Multi-Dimensional Evaluation**: Unlike traditional loss functions that evaluate embeddings at a single dimension, `MatryoshkaLoss` evaluates them at multiple specified dimensions (e.g., 768, 512, 256, etc.).
- **Weighted Loss**: Each dimension's loss can be weighted differently, allowing for flexibility in how much importance is given to each dimension. In the example, equal weights are assigned to each dimension.
- **Hierarchical Information**: By applying the inner loss to multiple truncated dimensions, `MatryoshkaLoss` ensures that the most significant information is captured in the initial dimensions, aligning with the hierarchical nature of Matryoshka embeddings.

#### Pseudocode for MatryoshkaLoss Calculation

Here's a pseudocode representation of how `MatryoshkaLoss` might be calculated from a `base_loss`:

In [2]:
def calculate_matryoshka_loss(embeddings, base_loss, matryoshka_dims, matryoshka_weight):
    total_loss = 0
    for i in range(len(matryoshka_dims)):
        # Truncate the embeddings to the current dimension
        truncated_embeddings = truncate_embeddings(embeddings, matryoshka_dims[i])
        
        # Calculate the base loss for the truncated embeddings
        loss = base_loss(truncated_embeddings)
        
        # Weight the loss for the current dimension
        weighted_loss = matryoshka_weight[i] * loss
        
        # Accumulate the weighted loss
        total_loss += weighted_loss
    
    return total_loss

def truncate_embeddings(embeddings, dimension):
    # Truncate each embedding to the specified dimension
    return [embedding[:dimension] for embedding in embeddings]

### Training using `SentenceTransformer` available [here](../../src/contrastive_loss/sbert_subjects_matryoshka.py)

The training goal is to bring embeddings of sentences of same subject closer to each other and those of different subjects away from each other.  By bringing in Matryoshka loss, we are able to bring about this separation with embeddings even with significantly lower leading dimensions (the truncated dimensions).

## Inference with Matryoshka Embeddings

Once trained, Matryoshka embeddings can be used in inference just like any other embeddings. The key difference is the ability to truncate the embeddings to a desired size, which can significantly speed up downstream tasks such as retrieval and save on storage space.

A base model `BAAI/bge-base-en-v1.5` was fine-tuned on a dataset of subject chunks belonging to three different subjects using `MatryoshkaLoss` here.  We now compare the inference of the base model with the fine-tuned model below to demonstrate how well the fine-tuned model continues to perform well at lower dimensions as well.

### Imports

In [3]:
import torch
from sentence_transformers import SentenceTransformer
from contrastive_loss import config
from contrastive_loss.sbert_subjects_matryoshka import (convert_to_pair_dataset, 
                                                       sampled_dataset, 
                                                       get_evaluator,
                                                       get_train_test_lists,
                                                       tuples_list_to_dataset)



### Load the base model

In [4]:
device = "mps" if torch.backends.mps.is_available() else "cpu"
device = 'cuda' if torch.cuda.is_available() else device
# Get the base sentence transformer model
model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(model_name).to(device)

### Check cosine similarities for some sample sentences

In [5]:
sentence1 = "The rate of change of displacement is velocity"
sentence2 = "Kidney plays an important role in purifying blood"
sentence3 = "Many countries obtained their freedom by 1950"
sentence4 = "Force is proportional to mass"
sentence5 = "Vaccines train our immune system to create antibodies"
sentence6 = "World war 2 was a global conflict between two coalitions - the allies and the axis powers"

sentences = [sentence1, sentence4, sentence2, sentence5, sentence3, sentence6]

In [6]:
from rich import print as rprint

In [7]:
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
rprint(similarities)

### Create the evaluator to evaluate the model before and after training

In [8]:
# pick chunks labeled with subjects (biology, physics, history assigned to labels 0, 1, 2 respectively)
_, test = get_train_test_lists(cfg=config)

# Convert to Dataset format
test_dataset = tuples_list_to_dataset(test)

# Sample to max of 500 per label so that the paired dataset is having max of 1500*1499/2
test_dataset = sampled_dataset(test_dataset)

# Create the paired dataset consisting of (sentence1, sentence2, score) from the text/label dataset
test_dataset = convert_to_pair_dataset(test_dataset)

binary_acc_evaluator = get_evaluator(test_dataset=test_dataset)

Filter:   0%|          | 0/5331 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5331 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5331 [00:00<?, ? examples/s]

### Evaluate model before training

In [9]:
results = binary_acc_evaluator(model)
rprint(results)

### Load the fine-tuned Matryoshka model and repeat

In [16]:
import glob

results_dir = config["paths"]["results_dir"]
results_sub_dir = "subject-based-encoder-matryoshka"

# Find the latest checkpoint directory
checkpoint_pattern = f'{results_dir}/{results_sub_dir}/checkpoint-*' 
checkpoint_dirs = glob.glob(checkpoint_pattern)

# Sort by directory name (which includes the step number) and get the latest
latest_checkpoint = sorted(checkpoint_dirs, key=lambda x: int(x.split('checkpoint-')[-1]))[-1]
finetuned_model_dir = latest_checkpoint
print(f"Using latest checkpoint: {finetuned_model_dir}")

# Load the model
model = SentenceTransformer(finetuned_model_dir).to(device)

embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
rprint(similarities)

Using latest checkpoint: /Users/chandarl/results/subject-based-encoder-matryoshka/checkpoint-1500


### Evaluate model after training

In [11]:
results = binary_acc_evaluator(model)
rprint(results)

### Evaluate with truncated dimensions

This time, because of using the added Matryoshka loss terms, the sentence embeddings are trained to work well even with lower dimensions than what the original model was pre-trained for.

We will try with reduced dim of 64 - this just becomes a simple configurable parameter in the below `SentenceTransformer` call.  Of course, the model should have been first trained with these additional Matryoshka loss terms and embedding dimensions.  The training of these embeddings is covered at [`sbert_subjects_matryoshka.py`](../../src/contrastive_loss/sbert_subjects_matryoshka.py).

In [12]:
model = SentenceTransformer(finetuned_model_dir, truncate_dim=64).to(device)
embeddings = model.encode(sentences)
rprint(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
rprint(similarities)

In [13]:
results = binary_acc_evaluator(model)
rprint(results)

### Evaluate at even lower dimensions (8)

In [14]:
model = SentenceTransformer(finetuned_model_dir, truncate_dim=8).to(device)
results = binary_acc_evaluator(model)
rprint(results)

### At 4

In [17]:
model = SentenceTransformer(finetuned_model_dir, truncate_dim=4).to(device)
results = binary_acc_evaluator(model)
rprint(results)

### At 2

In [18]:
model = SentenceTransformer(finetuned_model_dir, truncate_dim=2).to(device)
results = binary_acc_evaluator(model)
rprint(results)

### And finally at 1!! (of course it does not do well here)

In [19]:
model = SentenceTransformer(finetuned_model_dir, truncate_dim=1).to(device)
results = binary_acc_evaluator(model)
rprint(results)

The key takeaway in this example is that the `f1 score` continues to be very high for all the lower dimensions all the way till 2.  Only when we go down to 1 dimension, the `f1 score` degrades significantly.  

Of course this is also because in this example, the goal of the sentence encoders was only to make text of the same subject near each other and of different subjects to be away from each other.  And the total number of distinct subjects here is only three.  Hence we are able to achieve this separation even with such a low dimensionality of 2.  This will not be the case in general.

## Conclusion

Matryoshka embeddings offer a powerful alternative to traditional embeddings, especially in scenarios where data efficiency and scalability are paramount. By leveraging their unique structure, they provide enhanced contextual understanding and flexibility, making them a valuable tool in the data scientist's toolkit.

## References

- [Matryoshka Representation Learning - Original Paper](https://arxiv.org/abs/2205.13147)
- [Matryoshka Embeddings - Sentence Transformers Documentation](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html)
- [MatryoshkaLoss Implementation on GitHub](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py)