<a href="https://colab.research.google.com/github/anastaszi/GenAI/blob/main/Fine_Tuning_LLMs_and_Embedding_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding Models + LLMs

As we saw in the previous notebooks, embedding models can be very useful when mapping natural language to vectors that are used by downstream LLMs. When it comes to fine-tuning a pipline, since pipelines often include mulitple models, several different models can be fine tuned to better account for the nuances in your data. LLMs may leverage pre-trained word embeddings as part of their input or initialization, allowing them to benefit from the semantic information captured by the embedding models. The embedding models provide a foundation for understanding the meanings and relationships of individual words, which LLMs can build upon to generate coherent and contextually appropriate text.


## Why Fine Tune an Embedding Model
For example, say you're working with very specific legal text - text that's very different than what any embedding model is trained on. If we were to use an embedding model out of the box, we'd likley lose the nuianced context in the legal data. In order to avoid that, we could instead fine-tune the last seveal layes of our embedding model to betted account for the context. Similar to CNNs for imgage recongnition, the lower layers of embedding model are good a learning general patters with words - things like parts of speech, basic syntax, basic grammar etc. Since the higher layers in the network learn much more contextual and task-specific information - things like contextualized representations, semantic relationships, etc. If we fine tune those layers, we can efficiently use computer to keep the general relationships learned by lower layers, while customoizing the higher layers to our task and context.

## Existing Solutions
Fine-tuning embedding models isn't breaking news, insitutions have been doing it for years and releasing the models to the public for consumption. It's always best to take a look a see if there is an existing model out there that's traing on a similar task for similar text. For example, [here](https://huggingface.co/ipuneetrathore/bert-base-cased-finetuned-finBERT) is a HuggingFace link to a fine tuned version of BERT


### Hardware Considerations
Often times in fine tune embedding models, the reserach and documentation will refer to using a GPU to fine tune. This is because often times fine-tuning implies using large amounts of data. Depending on data size and model size it's ceratinly possible to fine tune on CPUs. We'll use small data in the exmaple below, and a CPU will provide more than enough computing power.



Below example is adopted from [here](https://huggingface.co/blog/how-to-train-sentence-transformers)

We'll be using a Hugginface sample dataset, so we'll need to ensure the `datasets` library is installed

In [None]:
%pip install datasets

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


We'll be fine tuning a model from the `SentenceTransformer` library, namely the `distilroberta-base` model. `distilroberta-base` is a variant of the RoBERTa (Robustly Optimized BERT) model, which itself is based on the Transformer architecture. "distil" in the name stands for "distillation" and indicates that this model is a distilled version of the original RoBERTa model, aimed at being smaller and faster while maintaining a similar performance to the larger model.

The distilroberta-base model follows the Transformer architecture, which includes stacked self-attention layers and feed-forward neural networks. It consists of multiple layers, each having a certain number of attention heads and hidden units.

The "base" in the model name suggests that it is one of the base configurations available for the RoBERTa model. It has a smaller number of parameters compared to larger variants of the RoBERTa model, making it more lightweight.

Like other transformer-based models, `distilroberta-base` uses a subword tokenization technique called Byte-Pair Encoding (BPE), which breaks down words into subword units to handle out-of-vocabulary words and improve generalization.

In [None]:
from sentence_transformers import SentenceTransformer, models

## Step 1: use an existing language model
word_embedding_model = models.Transformer('distilroberta-base')

## Step 2: use a pool function over the token embeddings
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

## Join steps 1 and 2 using the modules argument
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We'll use a dataset that's built into `HuggingFace` - the `QQP_triplets` dataset. This dataset contains content from the `Quora` website. Each example is a dictionary with three keys (query, pos, and neg) containing a list each (triplets). The first key contains an anchor sentence, the second a positive sentence, and the third a list of negative sentences.

```
{"query": [anchor], "pos": [positive], "neg": [negative1, negative2, ..., negativeN]}
{"query": [anchor], "pos": [positive], "neg": [negative1, negative2, ..., negativeN]}
...
{"query": [anchor], "pos": [positive], "neg": [negative1, negative2, ..., negativeN]}
```

Note that this dataset is organized around semantic meaning - because of this we can use this to train models of semantic equivalence based on the true values of the data.

In [None]:
from datasets import load_dataset

dataset_id = "embedding-data/QQP_triplets"
dataset = load_dataset(dataset_id)

Found cached dataset json (/Users/marymoesta/.cache/huggingface/datasets/embedding-data___json/embedding-data--QQP_triplets-1f161ec5c28ee86f/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
dataset

Let's bring out some basic stats about the Quora data we're using for fine-tuning:

In [None]:
print(f"- The {dataset_id} dataset has {dataset['train'].num_rows} examples.")
print(f"- Each example is a {type(dataset['train'][0])} with a {type(dataset['train'][0]['set'])} as value.")
print(f"- Examples look like this: {dataset['train'][0]}")

- The embedding-data/QQP_triplets dataset has 101762 examples.
- Each example is a <class 'dict'> with a <class 'dict'> as value.
- Examples look like this: {'set': {'query': 'Why in India do we not have one on one political debate as in USA?', 'pos': ['Why cant we have a public debate between politicians in India like the one in US?'], 'neg': ['Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', 'Why do politicians, instead of having a decent debate on issues going in and around the world, end up fighting always?', 'Can educated politicians make a difference in India?', 'What are some unusual aspects about politics and government in India?', 'What is debate?', 'Why does civic public communication and discourse seem so hollow in modern India?', 'What is a Parliamentary debate?', "Why do we always have two candidates at the U.S. presidential debate. yet the ballot has about 7 candidates? Isn't that a misrepresentation of democracy?", 'Wh

As mentioned above, fine-tuning can be costly and a timely operation. In an effort to speed things up a bit for this course, we'll only use have of the data available. In the code below, we'll loop through the `num_examples` and appended them to a `train_examples` list for model fine tuning.

In [None]:
from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# For agility we only 1/2 of our available data
n_examples = dataset['train'].num_rows // 2

for i in range(n_examples):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example['query'], example['pos'][0], example['neg'][0]]))


We'll use `PyTorch` as the training framework - users can also use `Tensorflow` if they prefer - but with `PyTorch`, a `DataLoader` object is expected in the `.fit` call. We'll instantiate the `DataLoader`, setting `shuffle` to `True` indicating data will be shuffled before each epoch and teh `batch_size` to 16 indicating each `batch` will contain 16 records.

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Triplet loss is a loss function used in training neural networks for learning embeddings, particularly in tasks involving similarity or distance learning. It is commonly used in tasks like face recognition, image retrieval, and information retrieval, where the goal is to learn representations that can accurately measure the similarity or dissimilarity between data points.

The basic idea behind triplet loss is to encourage the neural network to map similar data points closer together in the embedding space and push dissimilar data points farther apart. The loss is computed using triplets of data points: an anchor point, a positive example (similar to the anchor), and a negative example (dissimilar to the anchor).

The goal of triplet loss is to minimize the loss value by adjusting the model's parameters during training. This encourages the embeddings of the anchor and positive examples to be closer than the embeddings of the anchor and negative examples by at least the specified margin.

By optimizing the triplet loss, the model learns to map similar data points closer together and separate dissimilar data points in the embedding space, enabling better similarity measurements and more effective retrieval or recognition tasks. However, collecting suitable triplets (i.e., anchor, positive, and negative examples) for training can be challenging and crucial for the success of triplet loss-based learning. Note that triplet loss is primarily used for tasks involving similarity or distance learning in embedding spaces (and sometimes image spaces).

In [None]:
from sentence_transformers import losses

train_loss = losses.TripletLoss(model=model)

Let's not fit the model - note we are only training for 1 epoch for the sake of time and compute resources. If you have the time and resources, you can try training fo

In [None]:
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/3181 [00:00<?, ?it/s]