## Fine-tune Embedding Models

Fine-tuning the embedding model is a critical step in enhancing the performance of RAG systems. These systems rely on retrieving relevant information from a corpus to augment the language model's generation capabilities. However, pre-trained embedding models are often trained on general-purpose datasets, which may not accurately capture the nuances and semantics specific to a particular domain or use case. Fine-tuning the embedding model on domain-specific data allows the RAG system to adapt to the target domain, improving the relevance and accuracy of retrieved information. 

In this notebook, we will fine-tune an open-source sentence transformers embedding model using Amazon SageMaker. Hugging Face Sentence Transformers is a Python framework for generating high-quality sentence, text, and image embeddings using state-of-the-art models. The example model we will use is `sentence-transformers/msmarco-bert-base-dot-v5`. The same technique applies to all other sentence transformer models.

In [None]:
!pip install -Uq sagemaker
!pip install -Uq sentence-transformers
!pip install -Uq PyPDF2==3.0.1
!pip install -Uq langchain==0.1.5

In [None]:
import sagemaker
import boto3
import logging
import json

logging.getLogger().setLevel(logging.ERROR)
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
prefix = "finetune-embedding"
model_id = "sentence-transformers/msmarco-bert-base-dot-v5"
bucket = sess.default_bucket()
region = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")

In [None]:
train_data = "data/train_dataset.json"

train_s3_path = f"s3://{bucket}/{prefix}/{train_data}"

!aws s3 cp {train_data} {train_s3_path}

In [None]:
valid_data = "data/val_dataset.json"

valid_s3_path = f"s3://{bucket}/{prefix}/{valid_data}"

!aws s3 cp {valid_data} {valid_s3_path}

In [None]:
import time
from sagemaker.huggingface import HuggingFace

# define Training Job Name 
job_name = f'huggingface-sentence-transformer-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters = {
    "model_id": model_id,                             # pre-trained model
    "epochs": 5,
    "batch_size": 10,
    "evaluation_steps": 50
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = "train.py",      # train script
    source_dir           = "scripts",         # directory which includes all the files needed for training
    instance_type        = "ml.p3.2xlarge",   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 100,               # the size of the EBS volume in GB
    transformers_version = "4.28",            # the transformers version used in the training job
    pytorch_version      = "2.0",             # the pytorch_version version used in the training job
    py_version           = "py310",           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {"train": train_s3_path, "valid": valid_s3_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

### > Download the data locally

In [None]:
filename = huggingface_estimator.model_data.split('/')[-1]

!aws s3 cp {huggingface_estimator.model_data} {filename}

In [None]:
!rm -rf model && mkdir model
!tar -xzf {filename} -C model

## Evaluate Embedding Models

In [None]:
with open(train_data, 'r+') as f:
    train_dataset = json.load(f)

with open(valid_data, 'r+') as f:
    val_dataset = json.load(f)

In [None]:
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
from tqdm.notebook import tqdm
import pandas as pd

def evaluate_top_hit(dataset, embeddings, top_k=5):
    
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    docs = [Document(metadata=dict(id_=id_), page_content=text) for id_, text in corpus.items()] 

    db = FAISS.from_documents(docs, embeddings)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_docs = db.similarity_search(query, top_k)
        retrieved_ids = [doc.metadata['id_'] for doc in retrieved_docs]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_ids,
            'expected': expected_id,
            'query': query_id,
        }
        eval_results.append(eval_result)

    return eval_results

### > Evaluate percentage of top hit with the base model

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

base_embeddings = HuggingFaceEmbeddings(model_name=model_id)

eval_results = evaluate_top_hit(val_dataset, base_embeddings)

In [None]:
df_base = pd.DataFrame(eval_results)
top_hits = df_base['is_hit'].mean()

print("percent of top hits: {:.2f} %".format(top_hits*100))

### > Evaluate topic hit for fine tune model

In [None]:
model_path = "./model"
finetuned_embeddings = HuggingFaceEmbeddings(model_name=model_path)

eval_results = evaluate_top_hit(val_dataset, finetuned_embeddings)

In [None]:
df_finetuned = pd.DataFrame(eval_results)
top_hits = df_finetuned['is_hit'].mean()

print("percent of top hits: {:.2f} %".format(top_hits*100))

In [None]:
df_base['model'] = 'base'
df_finetuned['model'] = 'fine_tuned'
df_all = pd.concat([df_base, df_finetuned])
df_all.groupby('model').mean('is_hit')

## Evaluate using `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive set of embeeding metrics for sentencetransformers compatible models

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_sentence_transformers(
    dataset,
    model,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    return evaluator(model, output_path='results/')

In [None]:
base_model = SentenceTransformer(model_id, device="cuda")
finetuned_model = SentenceTransformer(model_path, device="cuda")

In [None]:
!rm -rf results && mkdir results

In [None]:
evaluate_sentence_transformers(val_dataset, base_model, name='base')
evaluate_sentence_transformers(val_dataset, finetuned_model, name='finetuned')

In [None]:
df_st_base = pd.read_csv('results/Information-Retrieval_evaluation_base_results.csv')
df_st_finetuned = pd.read_csv('results/Information-Retrieval_evaluation_finetuned_results.csv')

df_st_base['model'] = 'base'
df_st_finetuned['model'] = 'fine_tuned'
df_st = pd.concat([df_st_base, df_st_finetuned])
df_st = df_st.set_index('model')
df_st

In [None]:
%store train_s3_path
%store valid_s3_path
%store prefix
%store model_id