# embedding-models

embedding models embed the text into numeric representations of many dimensions and can be trained to serve a lot of purposes like sentiment classification, semantic understanding

[Alt text](contrastive_learning.png)

In [None]:
#function to generate images with path specification required
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def imggen(im_path):
    img=mpimg.imread(im_path)
    plt.axis("off")
    return plt.imshow(img)

In [None]:
imggen('contrastive_learning.png')

## contrastive learning
In order to accurately capture thesemantic nature of a document, it often needs to be contrasted with another
document for a model to learn what makes it different or similar. 


so basically feeding the model with similar and dissimilar pairs..

like instead of a question why P? we model it with contrast as why P and not Q? this makes it learn more info about the subject and 
ability to learn different features that make the subject unique.

### the best way to apply contrastive learning is through sentence- transformers



# SBERT
### (or sentence-BERT)
1 LIMITATION OF BERT WAS the computational overhead because it uses a cross encoder and outputs 512*768 vector for a single sentence before similarity, either we use the [CLS] token that has a summary of the entire token embeddings but we still run a lot of compute in making those embeddings.

We can also use Glove= global vectors for word representation; It’s not a neural network like BERT. It’s a matrix factorization method trained on global word co-occurrence statistics from a giant corpus.

### sentence transformers:

uses 2 bert models parallely with tied weights, removes the classification head and brings in mean pooling over the token embeddings, to generate a final fixed size vector embedding which can later be used for similarity searches, compute is saved as we dont need to calculate with each sentence but only the ones in our query

also, we concat the sentences embedded along with the difference in the embeddings

In [None]:
imggen("sentra.png")

### creating an embedding model 

to create an embedding model we first need some contrastual data for training

NLI natural language for inference provides where types of data for evaluation, one of them being mnli, that has premise and hypothesis pairs along with labels for contradiction, entailment or neutral relations

In [None]:
imggen("mnli.png")

In [None]:
#load dataset nli dataset
from datasets import load_dataset

train_dataset= load_dataset("glue","mnli", split='train').select(range(50_000))
train_dataset= train_dataset.remove_columns("idx")


In [None]:
train_dataset = train_dataset.select_columns(['hypothesis', 'premise', 'label'])


### Fixing TensorFlow Registry Error

If you encounter the error "Name tf.RaggedTensorSpec has already been registered", restart the kernel before running the model training cells. This error occurs due to TensorFlow being imported multiple times.

In [None]:
# Clear TensorFlow registry and restart imports
import sys
import importlib

# Remove tensorflow modules if already imported
modules_to_remove = [module for module in sys.modules.keys() if 'tensorflow' in module or 'keras' in module]
for module in modules_to_remove:
    del sys.modules[module]

# Clear GPU memory if using GPU
try:
    import tensorflow as tf
    tf.keras.backend.clear_session()
    if tf.config.list_physical_devices('GPU'):
        tf.config.experimental.reset_memory_growth(tf.config.list_physical_devices('GPU')[0])
except:
    pass

print("TensorFlow modules cleared. Ready for clean import.")

In [None]:
train_dataset[:1]

In [None]:
# Model training with proper type handling
import numpy as np
import os

# Set environment variables to avoid conflicts
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# Fix numpy int64 compatibility issues
def fix_numpy_types():
    import numpy as np
    # Ensure numpy uses standard Python int types where needed
    np.int64 = int


try:
    fix_numpy_types()
    from sentence_transformers import SentenceTransformer, losses
    
    # Initialize the embedding model with error handling
    embedding_model = SentenceTransformer('bert-base-uncased')
    print("Model loaded successfully!")
    
except Exception as e:
    print(f"Error loading model: {e}")
    # Alternative approach if there are issues
    try:
        from transformers import AutoTokenizer, AutoModel
        from sentence_transformers import SentenceTransformer
        
        # Try loading with explicit device mapping
        embedding_model = SentenceTransformer('bert-base-uncased', device='cpu')
        print("Model loaded on CPU successfully!")
        
    except Exception as e2:
        print(f"Fallback loading also failed: {e2}")
        print("Please restart the kernel and try again.")


In [None]:

#softmax losses
from sentence_transformers import losses
loss=losses.SoftmaxLoss(
    
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3
    )

In [None]:
# adding evaluation for the model, we use stsb or semantic textual similarity benchmark
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

val_sts=load_dataset("glue","stsb",split='validation')
eval= EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"

)

In [None]:
from sentence_transformers import SentenceTransformerTrainingArguments
from torch import bfloat16
args=SentenceTransformerTrainingArguments(
    output_dir="base_embedding_model",
    num_train_epochs=1,
    per_device_eval_batch_size=4,
    per_gpu_train_batch_size=4,
   # float32=True, not for macs :(
    eval_steps=100,
    logging_steps=100
)


In [None]:
# from sentence_transformers import SentenceTransformerTrainer

# trainer=SentenceTransformerTrainer(
#     args=args,
#     model=embedding_model,
#     evaluator=eval,
#     loss=loss,
#     train_dataset=train_dataset

# )
# trainer.train()

In [None]:
#after training your model always remember to restart the notebook to make sure all vram is cleared up.
#as the model is ded everytime we try to train it, we shift to colab notebook further training and use the trained model here

In [None]:
from sentence_transformers import SentenceTransformer
trained_model=SentenceTransformer("/Users/abhimanyu/Downloads/bert_base_trained_glue2")


In [None]:
# sentences=["this is an example","i am not the example"]
# embeddings=trained_model.encode(sentences)
# print(embeddings)

In [None]:
eval(trained_model)

In [None]:
#mteb is the state of the art library for evaluation of the model
from mteb import MTEB
    

# Choose evaluation task with error handling
# evaluation = MTEB(tasks=["Banking77Classification"])

# Calculate results with proper type handling

import mteb

tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = MTEB(tasks=tasks)

resultt = evaluation.run(trained_model)
resultt

In [None]:
import mteb
from sentence_transformers import SentenceTransformer

trained_model=SentenceTransformer("/Users/abhimanyu/Downloads/bert_base_trained_glue2")


tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = MTEB(tasks=tasks)



In [None]:
import mteb

# Fetch the Banking77Classification task object
task = mteb.get_task("Banking77Classification")

# Get the dataset from the task object (usually stored in task.dataset or similar)
dataset = task.dataset


In [None]:
# For example, inspect first few labels
print(dataset['validation'][:5])  # or dataset['validation'][:5]

# Or print types of labels (adjust key if needed)
labels = dataset['train']['label']
print([type(label) for label in labels[:10]])


In [None]:
print(dir(tasks))


In [None]:
print(task.dataset)  # May be None or a complex object
from mteb import get_task

task = get_task("Banking77Classification")

print(task.splits)  # List available splits
print(task.get_dataset(split="train"))  # Load train split properly


In [None]:
import mteb
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = ("/Users/abhimanyu/Downloads/bert_base_trained_glue2")

model = mteb.get_model(model_name) # will default to SentenceTransformers(model_name) if not implemented in MTEB
tasks = mteb.get_tasks(tasks=["STSBenchmark"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model,output_folder="results/STSBenchmark")

In [None]:
# {
#   "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
#   "evaluation_time": 5.260719060897827,
#   "kg_co2_emissions": null,
#   "mteb_version": "1.12.39",
#   "scores": {
#     "test": [
#       {
#         "cosine_pearson": 0.5660132977293474,
#         "cosine_spearman": 0.6237134696545672,
#         "euclidean_pearson": 0.6112715246993904,
#         "euclidean_spearman": 0.6307245433589723,
#         "hf_subset": "default",
#         "languages": [
#           "eng-Latn"
#         ],
#         "main_score": 0.6237134696545672,
#         "manhattan_pearson": 0.6205697391835063,
#         "manhattan_spearman": 0.6330864771670663,
#         "pearson": [
#           0.5660132935950654,
#           1.2533873237354529e-117
#         ],
#         "spearman": [
#           0.6237134696545672,
#           1.6490706502858626e-149
#         ]
#       }
#     ]
#   },
#   "task_name": "STSBenchmark"
# }

In [None]:
from datasets import load_dataset, Dataset
train_datasets=load_dataset("glue","mnli",split="train").select(range(50_000))

train_datasets=train_datasets.remove_columns("idx")

mapping={2:0,1:0,0:1}
train_datasets=Dataset.from_dict({
    "sentence1":train_datasets["premise"],
    "sentence2":train_datasets["hypothesis"],
    "labels": [float(mapping[label]) for label in train_datasets["label"]]
})

from datasets import Dataset, load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

# (neutral/contradiction)=0 and (entailment)=1
mapping = {2: 0, 1: 0, 0:1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[label]) for label in train_dataset["label"]]
})

In [None]:
train_dataset[:2]

In [None]:
# #create an evaluater
# from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
# val_sts=load_dataset("glue","stsb",split="validation")
# val_evaluator= EmbeddingSimilarityEvaluator(
#     sentences1=val_sts["sentence1"],
#     sentences2=val_sts["sentence2"],
#     scores=[score/5 for score in val_sts["label"]],
#     main_similarity="cosine",
# )
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

In [None]:
val_sts[:3]

In [None]:
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers import SentenceTransformerTrainer
from sentence_transformers import SentenceTransformerTrainingArguments

embedding_model= SentenceTransformer('bert-base-uncased')
train_loss= losses.CosineSimilarityLoss(model=embedding_model)

args= SentenceTransformerTrainingArguments(
    output_dir="cosineloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    eval_steps=100,
    logging_steps=100,
)

In [None]:
trainer=SentenceTransformerTrainer(
    args=args,
    model= embedding_model,
    evaluator=evaluator,
    train_dataset=train_dataset,
    loss=train_loss    
)
trainer.train()