# Improve RAG Accuracy with Finetuned Embedding Models on Amazon SageMaker

## Table of Contents
1. [Introduction](#introduction)
2. [Install Dependencies](#install-dependencies)
3. [Load Data and Train the Model](#load-data-and-train-the-model)
4. [Create Inference Script](#create-inference-script)
5. [Upload the Model](#upload-the-model)
6. [Deploy Model on SageMaker](#deploy-model-on-sagemaker)
7. [Invoke the Model](#invoke-the-model)
8. [Compare Predictions](#compare-predictions)

<a id='introduction'></a>
### Introduction

This notebook demonstrates how to use Amazon SageMaker to fine-tune a Sentence Transformer embedding model and deploy it with Amazon SageMaker Endpoint. For more information about Finetuning Sentence Transformer, see [Sentence Transformer Training Overview](https://www.sbert.net/docs/sentence_transformer/training_overview.html).

We will fine-tune the embedding model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It is an open-source sentence-transformers model fine-tuned on a 1B sentence pairs dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.

To fine-tune it, we will use the [Bedrock FAQ](https://aws.amazon.com/bedrock/faqs/), a dataset of questions and answer pairs, using the [Multiple Negatives Ranking Loss function](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss).

<a id='install-dependencies'></a>
### Install Dependencies


In [None]:
!pip install pathos==0.3.2
!pip install datasets==2.19.2
!pip install transformers==4.40.2
!pip install transformers[torch]==4.40.2
!pip install sentence_transformers==3.1.1
!pip install accelerate==1.0.0
!pip install sagemaker==2.224.1

<a id='load-data-and-train-the-model'></a>
## Load Data and Train the Model

The following code snippet demonstrates how to load a training dataset from a JSON file, prepare the data for training, and then fine-tune the pre-trained model. After fine-tuning, the updated model is saved.

The `EPOCHS` variable determines the number of times the model will iterate over the entire training dataset during the fine-tuning process. A higher number of epochs typically leads to better convergence and potentially improved performance, but may also increase the risk of overfitting if not properly regularized.

In this example, we have a small training set consisting of only 100 records. As a result, we are using a high value for the `EPOCHS` parameter. Typically, in real-world scenarios, you would have a much larger training set. In such cases, the `EPOCHS` value should be a single or two-digit number to avoid overfitting the model to the training data.


In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator
import json

def load_data(path):
    """Load the dataset from a JSON file."""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data

dataset = load_data("training.json")


# Load the pre-trained model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Convert the dataset to the required format
train_examples = [InputExample(texts=[data["sentence1"], data["sentence2"]]) for data in dataset]

# Create a DataLoader object
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=8)

# Define the loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

EPOCHS=100

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=EPOCHS,
    show_progress_bar=True,
)

# Save the fine-tuned model
model.save("opt/ml/model/",safe_serialization=False)

<a id='create-inference-script'></a>
## Create inference.py File

To deploy and serve the fine-tuned embedding model for inference, we create an `inference.py` Python script that serves as the entry point. This script implements two essential functions: `model_fn` and `predict_fn`, as required by AWS SageMaker for deploying and using machine learning models.

The `model_fn` is responsible for loading the fine-tuned embedding model and the associated tokenizer. On the other hand, the `predict_fn` takes input sentences, tokenizes them using the loaded tokenizer, and computes their sentence embeddings using the fine-tuned model. To obtain a single vector representation for each sentence, it performs mean pooling over the token embeddings, followed by normalization of the resulting embedding. Finally, the `predict_fn` returns the normalized embeddings as a list, which can be further processed or stored as required.


In [None]:
%%writefile opt/ml/model/inference.py

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import os

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def model_fn(model_dir, context=None):
  # Load model from HuggingFace Hub
  tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/model")
  model = AutoModel.from_pretrained(f"{model_dir}/model")
  return model, tokenizer

def predict_fn(data, model_and_tokenizer, context=None):
    # destruct model and tokenizer
    model, tokenizer = model_and_tokenizer
    
    # Tokenize sentences
    sentences = data.pop("inputs", data)
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    
    # return dictonary, which will be json serializable
    return {"vectors": sentence_embeddings[0].tolist()}


<a id='upload-the-model'></a>
## Upload the Model

After creating the `inference.py` script, we package it together with the fine-tuned embedding model into a single `model.tar.gz` file. This compressed file can then be uploaded to an Amazon S3 bucket, making it accessible for deployment as a SageMaker endpoint.

In [None]:
import boto3
import tarfile
import os

model_dir = "opt/ml/model"
model_tar_path = "model.tar.gz"

with tarfile.open(model_tar_path, "w:gz") as tar:
    tar.add(model_dir, arcname=os.path.basename(model_dir))
    
s3 = boto3.client('s3')

# Get the region name
session = boto3.Session()
region_name = session.region_name

# Get the account ID from STS (Security Token Service)
sts_client = session.client("sts")
account_id = sts_client.get_caller_identity()["Account"]

model_path = f"s3://sagemaker-{region_name}-{account_id}/model_trained_embedding/model.tar.gz"

bucket_name = f"sagemaker-{region_name}-{account_id}"
s3_key = "model_trained_embedding/model.tar.gz"

with open(model_tar_path, "rb") as f:
    s3.upload_fileobj(f, bucket_name, s3_key)
          

<a id='deploy-model-on-sagemaker'></a>
## Deploy Model on SageMaker

Finally, we can deploy our fine-tuned model on a SageMaker Endpoint using `SageMaker HuggingFaceModel`.

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel
import sagemaker


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_path,  # path to your trained SageMaker model
   role=sagemaker.get_execution_role(),                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
   entry_point="opt/ml/model/inference.py",
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)



<a id='invoke-the-model'></a>
## Invoke the Model

You can invoke the model using the `predict` function.

In [None]:
# example request: you always need to define "inputs"
data = {
   "inputs": "Are Agents fully managed?."
}

# request
predictor.predict(data)

<a id='compare-predictions'></a>
## Compare Predictions

To illustrate the impact of fine-tuning, we can compare the cosine similarity scores between two semantically related sentences using both the original pre-trained model and the fine-tuned model. A higher cosine similarity score indicates that the two sentences are more semantically similar, as their embeddings are closer in the vector space.

In [None]:
from sentence_transformers import SentenceTransformer, util

pretrained_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


sentences = [
    "What are Agents, and how can they be used?"
    , 
    "Agents for Amazon Bedrock are fully managed capabilities that automatically break down tasks, create an orchestration plan, securely connect to company data through APIs, and generate accurate responses for complex tasks like automating inventory management or processing insurance claims."
]

#Compute embedding for both lists
embedding_x= pretrained_model.encode(sentences[0], convert_to_tensor=True)
embedding_y = pretrained_model.encode(sentences[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_x, embedding_y)

In [None]:
from sentence_transformers import SentenceTransformer, util

data1 = {
   "inputs": 
    "What are Agents, and how can they be used?"
}

data2 = {
   "inputs": 
    "Agents for Amazon Bedrock are fully managed capabilities that automatically break down tasks, create an orchestration plan, securely connect to company data through APIs, and generate accurate responses for complex tasks like automating inventory management or processing insurance claims."
}



el1 = predictor.predict(data1)
el2 = predictor.predict(data2)

util.pytorch_cos_sim(el1["vectors"], el2["vectors"])
