# Fine-Tuning Sentence Transformers on SageMaker
This notebook shows how to launch a SageMaker training job using a `trainer.py` script for fine-tuning a Sentence Transformer model on custom data. You will also do some preliminary evaluation in this notebook, and additional evaluation in the `02-embeddings-eval.ipynb` notebook.

## Setup Dependencies
Initialize the SageMaker session and retrieve the execution role.

In [None]:
%pip install sentence-transformers==3.1.1 datasets==2.19.2 transformers==4.40.2

In [None]:
import boto3
import json
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput

from botocore.exceptions import ClientError

from datasets import load_dataset

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

In this evaluation, you'll pull samples from the [PubMedQA dataset](https://huggingface.co/datasets/qiaojin/PubMedQA). It has sets of prebuilt Question/Context/Answers on complex medical topics which will be used to tune the embeddings to the medical domain.

In [None]:
source_dataset = load_dataset("qiaojin/PubMedQA", "pqa_artificial")
source_dataset["train"][0]

In [None]:
def process_dataset(input_dataset, output_filename, max_items=-1):
    output_data = []

    if max_items > -1:
        print(f"max_items set, reducing input to {max_items} items.")
    else:
        max_items = len(input_dataset)
    
    for idx, item in enumerate(input_dataset.select(range(max_items))):
        data_item = {
            "id": item["pubid"],
            "question": item["question"],
            "context": item["context"]["contexts"][0]
        }
        output_data.append(data_item)

        print(f"item: {idx+1}", end="\r")
        
    #write training data to an output file
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(output_data, f, ensure_ascii=False, indent=4)

This will take 9000 items from the source dataset to train the model.

In [None]:
process_dataset(source_dataset["train"],"./data/base_data/base_data.json", max_items=9000)

Split the test and train datasets, shuffling the source and doing a 90/10 split, then upload to S3 to be used in a SageMaker managed training job.

In [None]:
training_dataset = load_dataset("json", data_files="./data/base_data/base_data.json", split="train")

training_dataset = training_dataset.train_test_split(test_size=0.10, shuffle=True)

prefix = "embedding-finetuning" 
s3_output_path = f"s3://{bucket}/{prefix}/output"

local_data_path = f"./data/{prefix}"
s3_data_path = f"s3://{bucket}/{prefix}"

training_dataset["train"].to_json(f"{local_data_path}/train/train.json", orient="records")
train_dataset_s3_path = f"{s3_data_path}/train/train.json"
training_dataset["train"].to_json(train_dataset_s3_path, orient="records")

training_dataset["test"].to_json(f"{local_data_path}/test/test.json", orient="records")
test_dataset_s3_path = f"{s3_data_path}/test/test.json"
training_dataset["test"].to_json(test_dataset_s3_path, orient="records")

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)
print(f"\nYou can view the uploaded dataset in the console here: \nhttps://s3.console.aws.amazon.com/s3/buckets/{sagemaker_session.default_bucket()}/?region={sagemaker_session.boto_region_name}&prefix={s3_data_path.split('/', 3)[-1]}/")

## Configure PyTorch Estimator
We configure the training job to run `trainer.py` with the desired hyperparameters. 

Here you configure:
- the link to your training script
- any library upgrades necessary
- the IAM role for the training job to assume, providing it access to the training data and other resources
- instance type and count to be used in the training job
- pytorch versions (since you are using the pytorch estimator here)
- training hyperparameters (# of training epochs, training batch size, base model to use for training)
- `keep_alive_period_in_seconds` if you want to use SageMaker warm pools between iterative runs.

>Note: when configuring the training job, adding additional data seemed to perform better than adding more epochs.

In [None]:
estimator = PyTorch(
    entry_point="scripts/trainer.py",
    source_dir=".",
    requirements_file="requirements.txt",
    role=role,
    instance_count=1,
    instance_type="ml.g5.12xlarge",
    framework_version="2.2.0",     # ✅ Supports SDPA and FlashAttention-2
    py_version="py310",            # ✅ Python 3.10 for modern libraries
    hyperparameters={
        "epochs": 4,
        "batch_size": 16,
        "model_name": "Alibaba-NLP/gte-base-en-v1.5"
    },
    output_path=s3_output_path,
    base_job_name="embedding-finetune",
    keep_alive_period_in_seconds=1800
)


## Launch Training Job
This command will start the SageMaker training job using the uploaded data.

In [None]:
estimator.fit({
    "train": TrainingInput(train_dataset_s3_path, content_type="application/json"),
    "validation": TrainingInput(test_dataset_s3_path, content_type="application/json")
})

## 🔍 Evaluate Tuned Model from the SageMaker Training Job

In [None]:
import json
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets

Set the `max_items` parameter to choose how large of a test to run.

In [None]:
source_dataset = load_dataset("qiaojin/PubMedQA", "pqa_artificial")

process_dataset(source_dataset["train"], "./data/test_full.json", max_items=10)

In [None]:
# load test dataset
from datasets import load_dataset, concatenate_datasets
dataset = load_dataset("json", data_files="./data/test_full.json", split="train")
# Add an id column to the dataset
#dataset = dataset.add_column("id", range(len(dataset)))
# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)
 
# save datasets to disk
dataset["train"].to_json("./data/test_train_dataset.json", orient="records")
dataset["test"].to_json("./data/test_test_dataset.json", orient="records")

dataset["train"][0]

Here you are taking the full test and training dataset and assembling it into a document corpus that you can use for evaluation. This is a subset of the overall corpus (which you can choose to run against later in the notebook.).

In [None]:
# load train dataset again
train_dataset = load_dataset("json", data_files="./data/test_train_dataset.json", split="train")
test_dataset = load_dataset("json", data_files="./data/test_test_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

### Evaluate Base Model first

In [None]:
#base_model_id = "sentence-transformers/all-MiniLM-L6-v2"
base_model_id = "Alibaba-NLP/gte-base-en-v1.5"
base_model_id_safe = base_model_id.replace("/","_")

# Evaluate the BASE model
model = SentenceTransformer(
    base_model_id, 
    model_kwargs={"attn_implementation": "sdpa"},
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)

Here you are setting up the dimensions of the vectors to evaluate. The `matryoshka_dimensions` need to be in descending order, with the maximum dimension not exceeding the maximum of source model. This is supplied to create the loss function for evaluation.

In [None]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

# Important: large to small, the max dimension cannot be greater than the embedding model's max
matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

In this section you are taking the full document corpus to search against, along with their ids for validation of accuracy, and a set of queries to be the test data.

In [None]:
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["context"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["question"])
)  # Our queries (qid => question)

In [None]:
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]

In [None]:
matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)
 
# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)

In [None]:
# Evaluate the BASE model
model = SentenceTransformer(
    base_model_id, 
    model_kwargs={"attn_implementation": "sdpa"},
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
base_results = evaluator(model)

print("===============\nBASE MODEL\n===============")

# # COMMENT IN for full results
# print(base_results)
 
# Print the main score
import pandas as pd
data = {'dimension':[], 'base': []}

for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    data['dimension'].append(key)
    data['base'].append(base_results[key])
    
df = pd.DataFrame(data)
df

### Evaluate Tuned Model

This will grab the output model artifact from the training job, download it, then unpack it locally so it can be used for quick evaluation. You can skip this section if you already have model artifacts downloaded.

In [None]:
import sagemaker
from sagemaker.estimator import Estimator
import tarfile
import os

# Step 1: Attach to the completed training job
job_name = "<<YOUR_TRAINING_JOB_NAME>>"  # <-- replace with your actual job name
estimator = Estimator.attach(job_name)

# Step 2: Download the model tar.gz file
model_tar_path = estimator.model_data
local_model_path = "./downloaded_model"
os.makedirs(local_model_path, exist_ok=True)

# Step 3: Download and extract
s3 = sagemaker.Session().boto_session.resource("s3")
bucket, key = model_tar_path.replace("s3://", "").split("/", 1)
s3.Bucket(bucket).download_file(key, f"{local_model_path}/model.tar.gz")

with tarfile.open(f"{local_model_path}/model.tar.gz") as tar:
    tar.extractall(path=local_model_path)

print("✅ Model downloaded and extracted!")

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.evaluation import InformationRetrievalEvaluator, SequentialEvaluator
import torch

# Adjust path if you're pointing to model artifacts from SageMaker output
tuned_model_path = "downloaded_model"
model = SentenceTransformer(
    tuned_model_path,
    model_kwargs={"attn_implementation": "sdpa"},
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)


In [None]:
# Evaluate the TUNED model
tuned_results = evaluator(model)

print("===============\nTUNED MODEL\n===============")
 
# # COMMENT IN for full results
# print(tuned_results)
 
# Print the main score
import pandas as pd
data = {'dimension':[], 'tuned': []}

for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    data['dimension'].append(key)
    data['tuned'].append(tuned_results[key])
    
df = pd.DataFrame(data)
df

In [None]:
# Optional: Compare base vs tuned
data = {'dimension':[], 'base': [], 'tuned': [], 'delta': [], 'delta_percent': []}
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    delta = tuned_results[key] - base_results[key]
    delta_percent = (delta / base_results[key]) * 100
    data['dimension'].append(key)
    data['base'].append(base_results[key])
    data['tuned'].append(tuned_results[key])
    data['delta'].append(delta)
    data['delta_percent'].append(delta_percent)
df = pd.DataFrame(data)
df

In this comparison of the base versus tuned model, you can see improvements in retrieval in every metric and every dimension. Higher dimensionality across the board had less improvements, while lower dimensionality showed significant gains. Note that the 64 dimension results for the tuned model are actually as good or better than the 768 dimenstion results. If implemented, this could lead to significant improvements in search performance, or a 2-tier system where the first pass over the dataset is done at low dimensionality and then the resultset is evaluated at full dimensionality.

- NDCG (Normalized Discounted Cumulative Gain) is a metric used to evaluate the ranking of embeddings, which are numerical representations of objects like words or documents. It measures how well the ranking of embeddings corresponds to the expected or desired ranking, taking into account both the relevance of the embeddings and their position in the ranking.

- Accuracy measures the proportion of correctly classified or identified embeddings out of the total number of embeddings.

- Precision measures the proportion of relevant embeddings among the top-ranked or recommended embeddings.

- Recall measures the proportion of relevant embeddings that are successfully retrieved or recommended out of the total number of relevant embeddings.

- Mean Reciprocal Rank (MRR) evaluates how well an embedding model ranks the most relevant items. It focuses on the position of the first relevant item in the ranked list of embeddings.

    The reciprocal rank is calculated as 1 divided by the rank of the first relevant item. For example, if the first relevant item is ranked 3rd, the reciprocal rank would be 1/3. MRR is then calculated as the average of these reciprocal ranks across multiple queries or test cases. 

Improvements
---
- Normalized Dicounted Cumulative Gain (NDCG): 9.45% to 29.3%
- Accuracy: 5.5% to 21.6%
- Precision: 5.5% to 21.6%
- Recall: 5.5% to 21.6%
- MRR: 10.8% to 32.4%

In [None]:
# Optional: Compare base vs tuned
data = {'dimension':[], 'base': [], 'tuned': [], 'delta': [], 'delta_percent': []}
metrics = ["ndcg", "accuracy", "precision", "recall", "mrr"]

for metric in metrics:
    for dim in matryoshka_dimensions:
        key = f"dim_{dim}_cosine_{metric}@10"
        delta = tuned_results[key] - base_results[key]
        delta_percent = (delta / base_results[key]) * 100
        data['dimension'].append(key)
        data['base'].append(base_results[key])
        data['tuned'].append(tuned_results[key])
        data['delta'].append(delta)
        data['delta_percent'].append(delta_percent)
df = pd.DataFrame(data)
df