## Training an embedding model

Let's try contrastive learning

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
target_model = SentenceTransformer("quora-distilbert-multilingual")

  from .autonotebook import tqdm as notebook_tqdm


Dataset to train: Quora

In [2]:
from datasets import load_dataset

dataset = load_dataset("quora", trust_remote_code=True)["train"]
dataset

Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 404290
})

In [3]:
# Collect all the questions, then deduplicate
corpus_questions = []
for d in dataset:
    corpus_questions.append(d["questions"]["text"][0])
    corpus_questions.append(d["questions"]["text"][1])
corpus_questions = list(set(corpus_questions))  # Remove duplicates
print(len(corpus_questions))

537362


In [4]:
questions_to_embed = 10_000

target_embeddings = target_model.encode(
    corpus_questions[:questions_to_embed],
    # show_progress_bar=True,
    convert_to_tensor=True,
)

current_embeddings = model.encode(
    corpus_questions[:questions_to_embed],
    # show_progress_bar=True,
    convert_to_tensor=True,
)

In [5]:
print(target_embeddings.shape)
print(current_embeddings.shape)

torch.Size([10000, 768])
torch.Size([10000, 384])


In [57]:
from sentence_transformers import util

def search(query, model, embeddings, n=5, show=True):
    query_embedding = model.encode(query, convert_to_tensor=True)
    results = util.semantic_search(query_embedding, embeddings, top_k=n) # this uses cosine similarity

    # We look at top n results
    if show:
        for result in results[0][:n]:
            print(
                "{:.3f}\t{}".format(result["score"], corpus_questions[result["corpus_id"]])
            )
    return results[0][:n]

In [7]:
python_query = "How can I learn Python online?"
search(python_query, target_model, target_embeddings, 10)

0.961	What is the best way to learn Python and django?
0.949	How do I run Python Code on Sublime Text 3?
0.949	How can I run Python 2.7 code if I have Python 3.4 installed?
0.948	Where do I learn python in Mumbai?
0.944	How can I learn to write idiomatic Python?
0.942	How long does it take to learn the basics of Python before you can do anything fun/interesting?
0.936	How is Python being used at Facebook?
0.928	Which is the best book to learn Python?
0.926	Should I learn c++ or Java before learning Python?
0.926	What is best book to learn GUI programming with Python?


In [8]:
search(python_query, model, current_embeddings, 10)

0.744	Which is the best books or online courses for learning python from basic to advanced?
0.647	Which is the best book to learn Python?
0.628	What is the best way to learn Python and django?
0.615	Where do I learn python in Mumbai?
0.597	How long does it take to learn the basics of Python before you can do anything fun/interesting?
0.581	How can I learn to write idiomatic Python?
0.559	What is best book to learn GUI programming with Python?
0.559	How do I learn programming for free?
0.521	Should I learn c++ or Java before learning Python?
0.512	What is the best place to learn Spanish online?


Observations: 

- the target model has a much "higher" score in the top 5
- difference in questions returned.

### Building a dataset to train

We want to build a contrastive learning dataset.

Can we actually "move the needle?"

Example dataset: https://huggingface.co/datasets/sentence-transformers/all-nli



In [9]:
sample_nli_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")

In [10]:
print(sample_nli_dataset)

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 557850
})


In [11]:
sample_nli_dataset[0]

{'anchor': 'A person on a horse jumps over a broken down airplane.',
 'positive': 'A person is outdoors, on a horse.',
 'negative': 'A person is at a diner, ordering an omelette.'}

Let's think of this as `query`, `good answer`, `bad answer` and use this to construct a toy example on the Quora dataset.

In [None]:
from datasets import Dataset


train_data_size = 1_000
# train_data_size = 100
queries = corpus_questions[questions_to_embed:questions_to_embed+train_data_size]


def build_example_from_query(query, n=5):
    # search and return top 5 from target and current models
    target_results = search(query, target_model, target_embeddings, n, show=False)
    current_results = search(query, model, current_embeddings, n, show=False)
    # take the last one from current results and pair with the first one with target results
    most_positive = target_results[0]["corpus_id"]
    most_negative = current_results[n-1]["corpus_id"]
    return {
        "anchor": query,
        "positive": corpus_questions[most_positive],
        "negative": corpus_questions[most_negative],
    }

# Build a training dataset with huggingface Dataset

def gen():
    for query in queries:
        yield build_example_from_query(query)
        
quora_train_ds = ds = Dataset.from_generator(gen)


# train_data = []
# # takes 40s on Mac
# for query in queries:
#     triplet = build_example_from_query(query)
#     train_data.append(triplet)
#     # print(triplet)
    
    
    

Generating train split: 1000 examples [00:35, 28.15 examples/s]


In [48]:
quora_train_ds[100]

{'anchor': 'Was Um Bongo ever available in the Congo?',
 'positive': 'What is Mongoose?',
 'negative': 'Does Brazil have a special economic zone?'}

### Train the model

With triplet loss

In [49]:
from sentence_transformers import SentenceTransformerTrainer, losses

In [None]:
loss = losses.TripletLoss(model=model)
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=quora_train_ds,
    loss=loss,
)

                                                                     

In [52]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=375, training_loss=4.7815302734375, metrics={'train_runtime': 89.1778, 'train_samples_per_second': 33.641, 'train_steps_per_second': 4.205, 'total_flos': 0.0, 'train_loss': 4.7815302734375, 'epoch': 3.0})

In [53]:
trainer.save_model(output_dir="./finetuned-quora-model")

Did we change the behavior of the model??

In [54]:
trained_model = SentenceTransformer("./finetuned-quora-model")
og_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [55]:
new_embeddings = trained_model.encode(
    corpus_questions[:questions_to_embed],
    # show_progress_bar=True,
    convert_to_tensor=True,
)

In [59]:
# compare some examples
python_query = "How can I learn Python online?"
_ = search(python_query, target_model, target_embeddings, 5)
print('---------------------------------------------------------')
_ = search(python_query, model, current_embeddings, 5)
print('---------------------------------------------------------')
_ = search(python_query, trained_model, new_embeddings, 5)

0.961	What is the best way to learn Python and django?
0.949	How do I run Python Code on Sublime Text 3?
0.949	How can I run Python 2.7 code if I have Python 3.4 installed?
0.948	Where do I learn python in Mumbai?
0.944	How can I learn to write idiomatic Python?
---------------------------------------------------------
0.662	Which is the best book to learn Python?
0.635	Which is the best books or online courses for learning python from basic to advanced?
0.602	How long does it take to learn the basics of Python before you can do anything fun/interesting?
0.602	What is the best way to learn Python and django?
0.598	How do I install a python program on a random PC?
---------------------------------------------------------
0.978	How can I learn to write idiomatic Python?
0.978	How long does it take to learn the basics of Python before you can do anything fun/interesting?
0.975	Which is the best books or online courses for learning python from basic to advanced?
0.975	Which is the best boo

In [62]:
query = corpus_questions[200001]
print(f'{query}')
print('---------------------------------------------------------')
_ = search(query, target_model, target_embeddings, 5)
print('---------------------------------------------------------')
_ = search(query, model, current_embeddings, 5)
print('---------------------------------------------------------')
_ = search(query, trained_model, new_embeddings, 5)

What u all think giving a bunch of assignments to complete in engineering is worthful.if not, then why it is not stopped yet?
---------------------------------------------------------
0.935	What are the best part time jobs while doing engineering?
0.934	I am having trouble getting a job in the Mechanical Engineering field.  My resume’ is excellent.  My interviewing I have been told is “very good”.  My references are good.  I had my resume’ professionally done.  Should I get a Masters?
0.930	I am 25 years old and still not finished my engineering degree. I still two more years remaining. What should I do?
0.928	Hi Ashwin, I wish to apply for masters in mechanical Engineering management. Will work experience play an important role?
0.922	Which is better, mechanical or computer science engineering in terms of jobs after graduation?
---------------------------------------------------------
0.579	Why are Germans so good at engineering?
0.575	What happens when we complete engineering in 5 ye

In [65]:
query = corpus_questions[300_007]
print(f'{query}')
print('---------------------------------------------------------')
_ = search(query, target_model, target_embeddings, 5)
print('---------------------------------------------------------')
_ = search(query, model, current_embeddings, 5)
print('---------------------------------------------------------')
_ = search(query, trained_model, new_embeddings, 5)

What is a SaaS company?
---------------------------------------------------------
0.907	What is the typical revenue per employee for a B2B SaaS company?
0.887	How do I validate SaaS idea?
0.887	What is DaaS?
0.878	Who are some of the leading local IaaS cloud service providers in India?
0.866	What are good product companies where SAP ABAP/SAP HCM jobs are there?
---------------------------------------------------------
0.561	What's the difference between SaaS and the Cloud?
0.528	How do I validate SaaS idea?
0.388	What are un, saarc, oecd?
0.316	What are the commonly used final gear ratios and tyre size combinations for supra sae cars for a 43 bhp KTM engine?
0.305	What is the typical revenue per employee for a B2B SaaS company?
---------------------------------------------------------
0.978	How do I validate SaaS idea?
0.958	What is the typical revenue per employee for a B2B SaaS company?
0.532	What's the difference between SaaS and the Cloud?
0.464	How is South Africa's economy doing 

Still a long way to go

In [66]:
quora_train_ds.to_csv('./data/quora-own-dataset.csv')

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 45.96ba/s]


189021

I am building a contrastive learning dataset with TripletLoss. As such, I must determine the quality of each record of data. Can you tell me if this is a good triplet for my dataset or it's way off?

{'anchor': 'What are differences between IELTS and IELTS for ukvi?', 'positive': 'How are Verilog-A and Verilog-AMS utilized in the industry?', 'negative': 'Can I crack IELTS exam without coaching?'}


response:

```
This triplet is way off for a contrastive learning dataset using TripletLoss.

Let's break it down:

✅ What Makes a Good Triplet?
A good triplet follows this structure:

Anchor: A sample query or data point.
Positive: A sample that is semantically similar to the anchor (close in the latent space).
Negative: A sample that is semantically different from the anchor (far in the latent space).
What I See in Your Triplet:
Element	Content	Quality
Anchor	What are differences between IELTS and IELTS for ukvi?	Question about IELTS exam types
Positive	How are Verilog-A and Verilog-AMS utilized in the industry?	🚫 Completely unrelated (electronics design languages vs IELTS)
Negative	Can I crack IELTS exam without coaching?	🚫 This is too close to the anchor (both are about IELTS exams — it could confuse the model)
```

## Evaluating embeddings model

torchmetrics <-- this looks promising

Perplexity query https://www.perplexity.ai/search/is-there-a-python-package-that-CHyy87UMRnWGhzNQ4lRbmA

```py
from torchmetrics.retrieval import RetrievalMRR
import torch

indexes = torch.tensor([0, 0, 0, 1, 1, 1, 1])
preds = torch.tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5, 0.2])
target = torch.tensor([False, False, True, False, True, False, True])

mrr = RetrievalMRR()
mrr(preds, target, indexes=indexes)
```

MRR: if all we care is "are the correct ones in and are they in the higher positions?" 

In [78]:
target_results = search(query, target_model, target_embeddings, 5)
print('---------------------------------------------------------')
current_results = search(query, model, current_embeddings, 5)
print('---------------------------------------------------------')
new_results = search(query, trained_model, new_embeddings, 5)

0.907	What is the typical revenue per employee for a B2B SaaS company?
0.887	How do I validate SaaS idea?
0.887	What is DaaS?
0.878	Who are some of the leading local IaaS cloud service providers in India?
0.866	What are good product companies where SAP ABAP/SAP HCM jobs are there?
---------------------------------------------------------
0.561	What's the difference between SaaS and the Cloud?
0.528	How do I validate SaaS idea?
0.388	What are un, saarc, oecd?
0.316	What are the commonly used final gear ratios and tyre size combinations for supra sae cars for a 43 bhp KTM engine?
0.305	What is the typical revenue per employee for a B2B SaaS company?
---------------------------------------------------------
0.978	How do I validate SaaS idea?
0.958	What is the typical revenue per employee for a B2B SaaS company?
0.532	What's the difference between SaaS and the Cloud?
0.464	How is South Africa's economy doing in 2014?
0.429	What should be the job preference in SSC CGL 2016, if I want to pre

In [87]:
def result_rank(results):
    return [r["corpus_id"] for r in results]

In [70]:
# target_results
# [{'corpus_id': 1086, 'score': 0.9072297811508179},
#  {'corpus_id': 5293, 'score': 0.8874691724777222},
#  {'corpus_id': 9711, 'score': 0.8872179985046387},
#  {'corpus_id': 6128, 'score': 0.8783174753189087},
#  {'corpus_id': 7710, 'score': 0.8664107322692871}]
current_results

[{'corpus_id': 7107, 'score': 0.560981810092926},
 {'corpus_id': 5293, 'score': 0.5279558897018433},
 {'corpus_id': 9441, 'score': 0.3883669078350067},
 {'corpus_id': 1864, 'score': 0.31555840373039246},
 {'corpus_id': 1086, 'score': 0.30493882298469543}]

In [79]:
target_ranking = [_["corpus_id"] for _ in target_results]
current_ranking = [_["corpus_id"] for _ in current_results]
new_ranking = [_["corpus_id"] for _ in new_results]

In [72]:
target_ranking

[1086, 5293, 9711, 6128, 7710]

In [73]:
current_ranking

[7107, 5293, 9441, 1864, 1086]

In [81]:
def reciprocal_rank(pred, target):
    
    assert len(pred) == len(target)
    
    target = set(target)
    for rank, id in enumerate(pred):
        if id in target:
            return 1 / (rank + 1)
        
    return 0

In [82]:
print(reciprocal_rank(current_ranking, target_ranking))
print(reciprocal_rank(new_ranking, target_ranking))

0.5
1.0


In [88]:
# run this with 100 examples
start_position = 300_000
test_set_size = 100

mrr_untrained_sum = 0
mrr_trained_sum = 0

for query in corpus_questions[start_position:start_position+test_set_size]:
    target_rank = result_rank(search(query, target_model, target_embeddings, 5, show=False))
    untrained_rank = result_rank(search(query, model, current_embeddings, 5, show=False))
    trained_rank = result_rank(search(query, trained_model, new_embeddings, 5, show=False))
    
    mrr_untrained_sum += reciprocal_rank(untrained_rank, target_rank)
    mrr_trained_sum += reciprocal_rank(trained_rank, target_rank)
    
mrr_untrained = mrr_untrained_sum / test_set_size
mrr_trained = mrr_trained_sum / test_set_size

print(f'MRR before training: {mrr_untrained}')
print(f'MRR after training: {mrr_trained}')    
    
    

MRR before training: 0.4378333333333334
MRR after training: 0.5060000000000001


Another metric: nDCG

## Train again with smaller learning rate?

In [97]:
from sentence_transformers import SentenceTransformerTrainingArguments

bs = 64
lr = 2e-5
epochs=4

# I used args from fast.ai's course
# https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

training_args = SentenceTransformerTrainingArguments(
    output_dir="./finetuned-quora-model-run-2",
    num_train_epochs=epochs,
    learning_rate=lr,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    weight_decay=0.01,
    # evaluation_strategy="epoch",
    # warmup in deep learning: https://datascience.stackexchange.com/questions/55991/in-the-context-of-deep-learning-what-is-training-warmup-steps
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    # fp16=True,
    # log_level="error",
    # report_to='none',
    )

In [98]:
mode2 = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
loss = losses.TripletLoss(model=model)
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=quora_train_ds,
    loss=loss,
    args=training_args
)
trainer.train()



RuntimeError: User specified an unsupported autocast device_type 'mps'

TODO - Use 1 eval metric to compare before and after training. Does it get better? Does it get better with more data?

Explanation of relevance metrics

Set up for LLM