# Supervised Fine-Tuning (SFT) with Parameter Efficient Fine Tuning(PEFT LoRA) of Amazon Nova using Amazon SageMaker Training Job

You can customize Amazon Nova models through base recipes using Amazon SageMaker training jobs. These recipes support Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), with both Full-Rank and Low-Rank Adaptation (LoRA) options.

The end-to-end customization workflow involves stages like model training, model evaluation, and deployment for inference. This model customization approach on SageMaker AI provides greater flexibility and control to fine-tune its supported Amazon Nova models, optimize hyperparameters with precision, and implement techniques including LoRA Parameter-Efficient Fine-Tuning (PEFT), Full-Rank Supervised Fine-Tuning, and Direct Preference Optimization (DPO).

This notebook demonstrates Supervised Fine-Tuning (SFT) with Parameter-Efficient Fine-Tuning (PEFT) of Amazon Nova using Amazon SageMaker Training Job. SFT is a technique that allows fine-tuning language models on specific tasks using labeled examples, while PEFT enables efficient fine-tuning by updating only a small subset of the model's parameters.


> _**Note:** This notebook demonstrates fine-tuning using Nova Lite, but the same techniques can be applied to Nova Pro or Nova Micro models with appropriate adjustments to the configuration._


## Installing Dependencies

The first cell installs the required Python packages for this notebook. For more details on other pre-requisites needed check out [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-general-prerequisites.html)

In [None]:
!pip install -r ./requirements.txt --upgrade

***

## Step 0: Prerequisites

This section sets up the necessary AWS credentials and SageMaker session to run the notebook. You'll need proper IAM permissions to use SageMaker.


If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

The code initializes a SageMaker session, sets up the IAM role, and configures the S3 bucket for storing training data and model artifacts.


In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
bucket_name = sess.default_bucket()
default_prefix = sess.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

### Low-rank adapter fine tuning


The most effective and cost-efficient method to enhance the base model performance is through the utilization of Low Rank Adapter (LoRA) fine-tuning. The underlying principle of LoRA is that only a small number of additional weights requires updating to adapt it to new tasks or domains. LoRA efficiently fine-tunes large language models by introducing low-rank trainable weight matrices into specific model layers, reducing the number of trainable parameters while maintaining model quality. A LoRA adapter augments the base foundation model by incorporating lightweight adapter layers that modify the modelâ€™s weights during inference, while keeping the original model parameters frozen. This approach is also considered one of the most cost-effective fine-tuning techniques. For more information, see Fine-tune models with adapter inference components

In what cases is Low-rank Adapter Fine tuning recommended?

* Developers are recommended to generally start with Low-rank Adapter Fine tuning due to its fast training procedure.
* It is recommended to use Low-rank Adapter (LoRA) fine-tuning in cases where the base model performance is already satisfactory, and the goal is to enhance the model's capabilities across multiple related tasks, such as text summarization and language translation. LoRA's regularization properties help prevent over-fitting and mitigate the "forgetting" of the source domain, ensuring the model remains versatile and adaptable to various applications.
* Consider using LoRA for instruction fine-tuning (IFT) scenarios with relatively small datasets. LoRA performs better with smaller, task-specific datasets than broader larger datasets.
* It is recommended to leverage Low-rank Adapter (LoRA) fine-tuning on Amazon SageMaker AI when the developer has a larger labeled dataset that exceeds the Bedrock Customization Data Limits.
* Additionally, LoRA on SageMaker AI is recommended when the developer has already achieved promising results through Bedrock Customization, and seeks to further optimize hyper-parameters.

![lora-arch](imgs/lora_based_arch.png)


***

## Step 1: Prepare the dataset

In this example, we are going to load [IBMresearch/finQA](https://huggingface.co/datasets/ibm-research/finqa) dataset, an open-source financial dataset with 2.8k financial reports for 8k Q&A pairs to study numerical reasoning with structured and unstructured evidence.

### Understanding the Nova Format

Let's format the dataset by using the prompt style for Amazon Nova:

```
{
    "system": [{"text": Content of the System prompt}],
    "messages": [
        {
            "role": "user",
            "content": ["text": Content of the user prompt]
        },
        {
            "role": "assistant",
            "content": ["text": Content of the answer]
        },
        ...
    ]
}
```

### Step 1.3: Data Preprocessing 

The notebook defines utility functions to clean the dataset content by removing prefixes and handling special cases:

```python
def clean_prefix(content):
    # Removes prefixes like "USER:", "ASSISTANT:", etc.
    ...

def clean_message_list(message_list):
    # Cleans message lists from None values and converts to proper format
    ...

def clean_numbered_conversation(message_list):
    # Cleans message lists from None values and converts to proper format
    ...
```

In [None]:
import json
import re
from typing import Dict, Any, List
from datasets import load_dataset, DatasetDict, Dataset
from random import randint

# --- Utility Functions (Provided in previous steps) ---

def clean_prefix(content):
    """Remove conversational prefixes from content."""
    prefixes = [ "SYSTEM:", "System:", "USER:", "User:", "ASSISTANT:", "Assistant:", "Bot:", "BOT:", ]
    if isinstance(content, str):
        lines = content.split("\n")
        cleaned_lines = []
        for line in lines:
            cleaned_line = line.strip()
            for prefix in prefixes:
                if cleaned_line.startswith(prefix):
                    cleaned_line = cleaned_line[len(prefix) :].strip()
                    break
            cleaned_lines.append(cleaned_line)
        return "\n".join(cleaned_lines)
    return content

# Placeholder for clean_message_list (used for final cleanup step, as per your request)
def clean_message_list(message_list):
    """Applies clean_prefix to content text within the Nova format structure."""
    if not isinstance(message_list, list):
        return message_list
    
    cleaned = []
    for item in message_list:
        if item.get("content"):
            new_content = []
            for content_item in item["content"]:
                if isinstance(content_item, dict) and "text" in content_item:
                    # Re-apply cleaning here if necessary
                    content_item["text"] = clean_prefix(content_item["text"])
                    new_content.append(content_item)
            item["content"] = new_content
            cleaned.append(item)
    return cleaned

# --- FinQA Data Processing Functions ---

def finqa_to_standard_format(example: Dict[str, Any]) -> Dict[str, Any]:
    """Converts a FinQA example into the intermediate standard message list format."""
    
    # 1. Format Table for prompt
    table_data = example.get('table', [])
    table_str = ""
    # ... (Table formatting logic) ...
    if table_data and isinstance(table_data, list) and table_data[0]:
        header = table_data[0]
        rows = table_data[1:]
        table_str += "| " + " | ".join(map(str, header)) + " |\n"
        table_str += "| " + " | ".join(["---"] * len(header)) + " |\n"
        for row in rows:
            table_str += "| " + " | ".join(map(str, row)) + " |\n"
    else:
        table_str = "No structured table data provided."
    
    user_prompt = f"""Given the following financial context and table data:

---
CONTEXT: {example['pre_text']}

TABLE:
{table_str}

---
QUESTION: {example['question']}

Please generate the step-by-step calculation program to answer the question."""

    assistant_response = example['program_re']

    messages: List[Dict[str, Any]] = [
        {"role": "system", "content": "You are a specialized financial analysis AI. Your task is to convert financial questions into accurate executable calculation programs based on provided context and data."},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": assistant_response}
    ]
    return {"messages": messages}

def convert_to_nova_format(example: Dict[str, Any]) -> Dict[str, Any]:
    """Converts the standard message list into the Nova format structure."""
    standard_messages = example.get("messages", [])
    if not standard_messages:
        return {"system": [], "messages": []}

    nova_system = []
    nova_messages = []

    for i, msg in enumerate(standard_messages):
        role = msg["role"]
        content = msg["content"]
        cleaned_content = clean_prefix(content)
        
        # Required Nova structure: content is [{"text": ...}]
        nova_content_struct = [{"text": cleaned_content}]
        
        if role == "system" and i == 0:
            nova_system = nova_content_struct
        else:
            nova_messages.append({"role": role, "content": nova_content_struct})
            
    # Return the new keys
    return {"system": nova_system, "messages": nova_messages}


def convert_to_nova_test_format(example: Dict[str, Any]) -> Dict[str, Any]:
    """
    Converts the standard message list into the Nova test/evaluation format.
    Required format: {"system": str, "query": str, "response": str}
    """
    standard_messages = example.get("messages", [])
    if not standard_messages:
        return {"system": "", "query": "", "response": ""}
    
    system_content = ""
    query_content = ""
    response_content = ""
    
    for msg in standard_messages:
        role = msg["role"]
        content = clean_prefix(msg["content"])
        
        if role == "system":
            system_content = content
        elif role == "user":
            query_content = content
        elif role == "assistant":
            response_content = content
    
    return {
        "system": system_content,
        "query": query_content,
        "response": response_content
    }

## Loading dataset and applying preprocessing steps

In [None]:
print("1. Loading FinQA DatasetDict (all splits)...")
dataset_dict = load_dataset("ibm-research/finqa", trust_remote_code=True)
final_dataset_dict = DatasetDict()

# Loop through each split (train, validation, test)
for split_name in dataset_dict.keys():
    print(f"\nProcessing split: {split_name}...")
    
    current_dataset = dataset_dict[split_name]
    initial_features = list(current_dataset.features.keys())
    
    # --- Step 1: Convert FinQA structure to standard message list ---
    processed_std = current_dataset.map(
        finqa_to_standard_format, 
        remove_columns=initial_features
    )
    
    # --- Step 2: Convert to appropriate Nova format based on split ---
    if split_name == "test":
        # Test split: Use flat string format (system/query/response)
        processed_nova = processed_std.map(convert_to_nova_test_format)
        
        # Convert to pandas for any final cleanup if needed
        processed_df = processed_nova.to_pandas()
        
        # Convert back to Dataset
        final_dataset = Dataset.from_pandas(processed_df)
        
    else:
        # Train/Validation splits: Use nested Nova format (system array + messages array)
        processed_nova = processed_std.map(convert_to_nova_format)
        
        # --- Step 3: Apply cleanup and finalize structure ---
        # Convert to pandas to apply column-wise function (clean_message_list)
        processed_df = processed_nova.to_pandas()
        
        # Apply the cleaning function to the 'messages' column
        # Note: 'system' is separate in Nova format, so clean_message_list only applies to 'messages'.
        processed_df["messages"] = processed_df["messages"].apply(clean_message_list)
        
        # Convert back to Dataset
        final_dataset = Dataset.from_pandas(processed_df)
    
    # Assign the final processed split to the DatasetDict
    final_dataset_dict[split_name] = final_dataset
    
    # Print example for verification
    if final_dataset.num_rows > 0:
        rand_index = randint(0, final_dataset.num_rows - 1)
        print(f"\nExample from {split_name} split (Row {rand_index}):")
        print(json.dumps(final_dataset[rand_index], indent=2))

print("\n--- FINAL RESULT ---")
print(final_dataset_dict)

### Step 1.4: Splitting data into test , train and validation set

In [None]:
from datasets import Dataset, DatasetDict
from random import randint

# 1. Define the datasets using the correct keys from final_dataset_dict
test_dataset = final_dataset_dict["test"]
val_dataset = final_dataset_dict["validation"]
train_dataset = final_dataset_dict["train"]

rand_index = randint(0, len(test_dataset) - 1)
print(f"Sampling a random example from test_dataset (Index {rand_index}):")
print(test_dataset[rand_index])

# choose a small number of samples for the training job
N_SAMPLES = 100

# 2. Get the first N_SAMPLES from each split
# Note: If a split has fewer than 100 samples, it will return all available samples.
test_subset = test_dataset.select(range(min(N_SAMPLES, len(test_dataset))))
val_subset = val_dataset.select(range(min(N_SAMPLES, len(val_dataset))))
train_subset = train_dataset.select(range(min(N_SAMPLES, len(train_dataset))))

# 3. Verification and Output
print(f"Original Test Size: {len(test_dataset)}")
print(f"Subset Test Size: {len(test_subset)}")
print(f"Subset Train Size: {len(train_subset)}")
print(f"Subset Validation Size: {len(val_subset)}")
print("-" * 30)

### Step 1.5: Data Preperation on test data for Offline Evaluation post fine tuning

Let's format the test dataset in the format:

Required Fields:

* query: String containing the question or instruction that needs an answer
* response: String containing the expected model output

Optional Fields:

* system: String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query

Example Entry
```

{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}
```

In [None]:
from datasets import Dataset 

# flatten the dataset
all_examples = []

for examples_list in test_subset:
    column_name = test_subset.column_names[0]
    examples = examples_list[column_name]
    print(examples)
    all_examples.extend(examples)

# create a new dataset with the desired structure
test_subset_formatted = Dataset.from_dict(
    {
        "system": [example["system"] for example in all_examples],
        "query": [example["query"] for example in all_examples],
        "response": [example["response"] for example in all_examples],
    }
)

print(test_subset_formatted[randint(0, len(test_subset))])

### Step 1.6: Upload all 3 curated datasets (train, test, val) to Amazon S3

The notebook applies the functions to transform the datasets into the required formats


The processed datasets are saved locally and then uploaded to Amazon S3 for use in SageMaker training:


In [None]:
import boto3
import shutil

In [None]:
s3_client = boto3.client('s3')

# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f"{default_prefix}/datasets/nova-sft-peft"
else:
    input_path = f"datasets/nova-sft-peft"

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.jsonl"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/val/dataset.jsonl"
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test/gen_qa.jsonl"

In [None]:
import os
import shutil

TRAIN_FULL_PATH = "./data/train/subset_train.jsonl"
VAL_FULL_PATH = "./data/val/subset_validation.jsonl"
TEST_FULL_PATH = "./data/test/subset_test.json" 


# 1. Save datasets to local files
print("1. Creating local directories and saving datasets...")
os.makedirs("./data/train", exist_ok=True)
os.makedirs("./data/val", exist_ok=True)
os.makedirs("./data/test", exist_ok=True) 

# Save the splits from the final_dataset_dict subsets
print(f"Saving train split (first {len(train_subset)} samples) to {TRAIN_FULL_PATH}")
# Use the SUBSET dataset object to save
train_subset.to_json(TRAIN_FULL_PATH, orient="records", lines=True)

# Note: The FinQA validation split is stored under the key "validation"
print(f"Saving validation split (first {len(val_subset)} samples) to {VAL_FULL_PATH}")
# Use the SUBSET dataset object to save
val_subset.to_json(VAL_FULL_PATH, orient="records", lines=True)

print(f"Saving test split (first {len(test_subset)} samples) to {TEST_FULL_PATH}")
# Note: Retaining the user's original call signature for test (without orient/lines args)
# Use the SUBSET dataset object to save
test_subset.to_json(TEST_FULL_PATH)


# 2. Upload local files to S3
print("\n2. Uploading datasets to S3 using s3_client...")

# Use the FULL local paths for upload_file
s3_client.upload_file(
    TRAIN_FULL_PATH, bucket_name, f"{input_path}/train/dataset.jsonl"
)

s3_client.upload_file(
    VAL_FULL_PATH, bucket_name, f"{input_path}/val/dataset.jsonl"
)

s3_client.upload_file(
    TEST_FULL_PATH, bucket_name, f"{input_path}/test/gen_qa.jsonl"
)


# 3. Cleanup local files
print("\n3. Cleaning up local data directory...")
try:
    shutil.rmtree("./data")
except OSError as e:
    # Handle case where directory might not exist or permissions fail
    print(f"Error cleaning up local directory: {e}")


# 4. Print confirmation
print("\nDatasets uploaded successfully:")
print(f"Training data uploaded to: {train_dataset_s3_path}")
print(f"Validation data uploaded to: {val_dataset_s3_path}")
print(f"Test data uploaded to: {test_dataset_s3_path}")
%store test_dataset_s3_path


***

## Step 2: Model fine-tuning

We now define the PyTorch estimator to run the supervised fine-tuning on a tool-calling dataset for our Amazon Nova model

This section sets up and runs the fine-tuning job using SageMaker. It uses Supervised Fine-Tuning (SFT) with Parameter-Efficient Fine-Tuning (PEFT) to efficiently train the model.


#### Instance Type and Count

P5 instances are optimized for deep learning workloads, providing high-performance GPUs.


In [None]:
instance_type = "ml.p5.48xlarge"
instance_count = 4

instance_type

#### Image URI

This specifies the pre-built container for SFT fine-tuning, which is different from the DPO container.

The images URIs are available in the documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-fine-tuning-training-job.html#nova-model-training-jobs-notebook).

In [None]:
image_uri = f"708977205387.dkr.ecr.{sess.boto_region_name}.amazonaws.com/nova-fine-tune-repo:SM-TJ-SFT-V2-latest"

image_uri

#### Configuring the Model and Recipe

This specifies which model to fine-tune and the recipe to use. The recipe includes "lora" indicating parameter-efficient fine-tuning, and "sft" indicating supervised fine-tuning.


In [None]:
model_id = "nova-lite-2/prod"
recipe="fine-tuning/nova/nova_2_0/nova_lite/SFT/nova_lite_2_0_p5_gpu_lora_sft"


In [None]:
from sagemaker.pytorch import PyTorch

# define Training Job Name
job_name = f"train-{model_id.split('/')[0].replace('.', '-')}-peft-sft"

# define OutputDataConfig path
if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{job_name}"
else:
    output_path = f"s3://{bucket_name}/{job_name}"

recipe_overrides = {
    "run": {
        "replicas": instance_count,  # Required
    },
}

estimator = PyTorch(
    output_path=output_path,
    base_job_name=job_name,
    role=role,
    disable_profiler=True,
    debugger_hook_config=False,
    instance_count=instance_count,
    instance_type=instance_type,
    training_recipe=recipe,
    recipe_overrides=recipe_overrides,
    max_run=432000,
    sagemaker_session=sess,
    image_uri=image_uri
)

#### Configuring the Data Channels

Configure the Data Channels

In [None]:
from sagemaker.inputs import TrainingInput

train_input = TrainingInput(
    s3_data=train_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="Converse",
)

val_input = TrainingInput(
    s3_data=val_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="Converse",
)

### Starting the Training Job
This starts the training job with the configured estimator and datasets. Note that it uses the test dataset for validation during training.


In [None]:
# starting the train job with our uploaded datasets as input
estimator.fit(inputs={"train": train_input, "validation": val_input}, wait=False)

In [None]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

In [None]:
from IPython.display import HTML, Markdown, Image
display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format("us-east-1", training_job_name)))

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format("us-east-1", training_job_name)))

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/train-nova-lite-peft-sft/{}/output/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket_name, training_job_name, "us-east-1")))


---
## _Wait Until the ^^ Training Job ^^ Completes Above!( 35-40 mins)_

**While you wait, go to 02-evaluate-fine-tuned-models.ipynb. In the notebook, we will plot the loss curve for a previously fine-tuned model, and view results from previously run evaluation jobs of the base model and fine-tuned models to see how fine-tuning has improved model performance.**

You're welcome to come back to this notebook once the training job is complete and run the section below to understand how to get the model outputs and submit an evaluation job.

---

### Reading the Output Content of fine tuned model

After the job is complete, the trained model weights will be available in an escrow S3 bucket. This secure bucket is controlled by Amazon and uses special access controls. You can access the paths shared in manifest files that are saved in a customer S3 bucket as part of the training process. You will point to this S3 location when you wish to host the fine-tuned model on Bedrock as well. In this section, let's download the artifacts (training and validation metrics), and get the escrow S3 bucket location from `manifest.json`

In [None]:
model_s3_uri = estimator.model_data

%store model_s3_uri

output_s3_uri = "/".join(model_s3_uri.split("/")[:-1])+"/output.tar.gz"
%store output_s3_uri


### Downloading and Extracting the Artifacts

In [None]:
!mkdir -p ./tmp/train_output/

In [None]:
!aws s3 cp $output_s3_uri ./tmp/train_output/output.tar.gz

In [None]:
!tar -xvzf ./tmp/train_output/output.tar.gz -C ./tmp/train_output/

In [None]:
escrow_model_uri = json.load(open('./tmp/train_output/manifest.json'))['checkpoint_s3_bucket']

In [None]:
escrow_model_uri

Store the escrow model URI for deployment

In [None]:
%store escrow_model_uri

### Plotting the Train/Loss Curve 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the CSV files
train_df = pd.read_csv('./tmp/train_output/step_wise_training_metrics.csv')
#val_df = pd.read_csv('./tmp/train_output/validation_metrics.csv')

# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(train_df['step_number'], train_df['training_loss'], label='Training Loss', color='blue')
#plt.plot(val_df['step_number'], val_df['validation_loss'], label='Validation Loss', color='red')

plt.xlabel('Step Number')
plt.ylabel('Loss')
plt.title('Training vs Validation Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### In this notebook we covered how you can fine tune Nova2.0 with Lora SFT recipe on a financial dataset. Move on to next notebook to learn how to evaluate this model