# Amazon Nova 2 Lite Fine Tuned Model Evaluation using Amazon SageMaker Training Jobs

The purpose of the evaluation process is to assess trained-model performance against benchmarks or custom dataset. The evaluation process typically involves steps to create evaluation recipe pointing to the trained model, specify evaluation datasets and metrics, submit a separate job for the evaluation, and evaluate against standard benchmarks or custom data. The evaluation process will output performance metrics stored in your Amazon S3 bucket.
 

## 1. Getting Started
Amazon Nova provides four different types of evaluation recipes. A recipe is a yaml file that configures a SageMaker job to perform the specified tasks.  Recipes are the only mechanism to configure the respective SageMaker job.

All recipes are available in the Amazon SageMaker HyperPod recipes GitHub repository. Though located in the HyperPod repository, recipes can be used with SageMaker Training Jobs (SMTJ) as well as SageMaker HyperPod. 

For Amazon Nova, SageMaker recipes include:
- General text benchmark recipes
- General multi-modal benchmark recipes
- Bring your own dataset benchmark recipes
- Nova LLM as a Judge benchmark recipes


Recipes are organized by model, technique, and GPU. A full list of recipes can be found in the Nova recipe reference material.
[Amazon Nova recipes](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-recipes.html)

The GitHub repository of all recipes:

[SageMaker HyperPod recipes GitHub](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#fine-tuning) **

** For Nova specific recipes, see the folder `recipes_collection -> recipes -> evaluation/nova`

## 2. Dependencies and Prerequisites

### Dependencies
Several python packages will need to be installed in order to execute this notebook.  Please review the packages in requirements.txt. 

botocore, boto3, sagemaker are required for the training jobs, while the other packages are used to help visualize results.

In [None]:
! pip install -r ./requirements.txt --upgrade

### Prerequisite: Data Prep Notebook
The Data Prep notebook walks through preparing and transformming a public data into a format and scheme acceptable for SMTJ.  The Data Prep notebook creates training and validation datasets used for training a model, as well as a test dataset for evaluation.

This note book will use the `test` dataset as describe below.

### Prerequisite: SFT Notebook
The SFT notebook walks through training a custom model. As a result of that training, output model artifacts are created.  We will use those output artificate to eval the model using SMTJ. 

**--------------- STOP ---------------** <br><br>To complete this notebook, both the Data Prep notebook and SFT notebook must be completed first. Those notebooks create the necessary datasets, as well as model artifacts.  Specific items that are carried over from those notebooks, to this notebook, are called out below.

In [None]:
# This value is obtained as result of executing the data prep notebook
test_dataset_s3_path = ""
%store -r test_dataset_s3_path 

print(test_dataset_s3_path)

### Credentials, Sessions, Roles, and more!

This section sets up the necessary AWS credentials and SageMaker session to run the notebook. You'll need proper IAM permissions to use SageMaker.


If you are going to use Sagemaker in a local environment, you will need access to an IAM Role with the required permissions for Sagemaker. Learn more about it here [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

For more details on other Nova pre-requisites needed check out [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-general-prerequisites.html)

The code initializes a SageMaker session, sets up the IAM role, and configures the S3 bucket for storing training data artifacts.

In [None]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name='us-east-1'))

sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sagemaker_session is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sagemaker_session.default_bucket()


try:
    role = sagemaker.get_execution_role()

except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]


bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session.default_bucket()}")
print(f"sagemaker session region: {sagemaker_session.boto_region_name}")

Capture S3 bucket prefix for later use.  After a SageMaker Training Job completes, this value will be used to identify where the SMTJ outputs are stored.

In [None]:
if default_prefix:
    output_path_prefix = f"s3://{bucket_name}/{default_prefix}"

else:
    output_path_prefix= f"s3://{bucket_name}"

## 3. Data Prep - Review
In the data prep workbook, we created our training, validation, and test datasets.  We will use the `test` dataset created for evaluation.  If the data prep notebook has not been completed, do so now so that these datasets and s3 locations are available to complete this notebook.

The test data is found in the file `gen_qa.jsonl` and for the eval SMTJ this is the required name.

Remember, prepare high-quality prompt-response pairs for training. Data should be:
- Consistent in format
- Representative of desired behavior
- Deduplicated and cleaned


For reference, here is the schema that represents a single record in the test data.  

```
{
    "system": "",
    "query": "",
    "response": ""
}


## 4. Evaluation
Bedfore starting, a SageMaker AI-trained Amazon Nova model is needed, for which you want to evaluate its performance.

To evaluate a model, the SageMaker Training Job (SMTJ) uses a PyTorch estimator to run the customization.  The estimator defines properties of the training job, such as training job name, instance type, instance count, output location for job results, training recipe, and more.

More details of the estimator can be found here: [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)

So we must define
- the training recipe
- properties for the estimator

### Recipes
A Nova recipe is a YAML configuration file that provides details to SageMaker AI on how to run your model evaluation job. It provides the model name or checkpoint, sets training hyperparameters, defines optimization settings, and includes any additional options required to eval the model successfully.

These recipe files serve as the blueprint for your model customization jobs, allowing you to specify training parameters, hyperparameters, and other critical settings that determine how your model learns from your data.

Evals offer specific tasks that can be accomplished:
- General text benchmark recipes
- General multi-modal benchmark recipes
- Bring your own dataset benchmark recipes (BYOD)
- Nova LLM as a Judge benchmark recipes (BYOM)

When defining a recipe, there are key components to modify for your uses case.  See this link for details - [Evaluation specific configurations](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-config)

The recipes can be found at:
- [Amazon Nova recipes](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-recipes.html), or 
- [GitHub - sagemaker-hyperpod-recipes](https://github.com/aws/sagemaker-hyperpod-recipes). Navigate to `recipes_collection -> recipes -> evaluation -> nova -> nova_2_0/nova_lite`  to find all Nova 2 Lite evaluation recipes.

Now, let's look at some text eval recipe examples: General Text Benchmarks, Bring Your Own Dataset, and Bring Your Own Metric.

### Recipe Organization in Notebook 
The organization of this notebook is to illustrate multiple recipes (for example, BYOD or BYOM). 

Please review the link above, "GitHub - sagemaker-hyperpod-recipes", to learn of more about the properties of a recipe. 

Recipes are not executed sequentially.

In order to execute a distinct recipe, a flag in python code is set futher below in the notebook.  Then, when the notebook is executed, the selected recipe is configured for the SMTJ.

### Example: General Text Benchmarks
These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks.

In [None]:
gtb_recipe = "evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_general_text_benchmark_eval"
gtb_recipe_job_name = "eval-gtb-nova-lite-2-recipe-job" # example
gtb_sm_job_name = "eval-gtb-nova-lite-2"

# this can also be a checkpoint URI of a trained model
gtb_peft_model_name_or_path = "nova-lite-2/prod" # example


gtb_recipe_overrides = {
    "run": {
        "name": gtb_recipe_job_name,
        "model_name_or_path": gtb_peft_model_name_or_path,
        "data_s3_path": "",  # For SMTJ, this is ""
        "output_s3_path": f"{output_path_prefix}/{gtb_sm_job_name}"
    },
    "processor": {
        "lambda_arn": ""
    }
}

### Example: Bring Your Own Dataset Recipe (BYOD)
This recipe enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics (BYOM, explained further below).

Note, as our goal is to eval against a trained checkpoint, the model_name_or_path identifies that checkpoint.

When a model is trained using SMTJ, the output from that job includes a manifest.json.  In that manifest.json is a property, "checkpoint_s3_bucket".  This property holds the value that will be used for the model_name_or_path.

In [None]:
byod_recipe = "evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval"
byod_recipe_job_name = "eval-byod-nova-lite-2-recipe-job" # example
byod_sm_job_name = "eval-byod-nova-lite-2"

# this can also be a checkpoint URI of a trained model
byod_peft_model_name_or_path = "nova-lite-2/prod" # example


byod_recipe_overrides = {
    "run": {
        "name": byod_recipe_job_name,
        "model_name_or_path": byod_peft_model_name_or_path,
        "data_s3_path": "",  # For SMTJ, this is ""
        "output_s3_path": f"{output_path_prefix}/{byod_sm_job_name}"
    },
    "processor": {
        "lambda_arn": ""
    }
}

### Bring Your Own Metric Recipe
These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics.

When a model is trained using SMTJ, the output from that job includes a manifest.json.  In that manifest.json is a property, "checkpoint_s3_bucket".  This property holds the value that will be used for the model_name_or_path.  checkpoint_s3_bucket is the source of truth of for the location of the trained model checkpoint.

Note, as our goal is to eval against a trained checkpoint, the `model_name_or_path` hold the value of the checkpoint location.

#### Lambda: Preprocessing, Postprocessing
Bring your own metrics is unique in that a lambda is executed on each data item - for preprocessing or postprocessing of the 
eval data.

Preprocessing is used to augment, transform, or apply processing on a data item before the model is prompted with that data item.

Postprocessing is used augment, transform, calculate a metric, or apply processing after a data item is returned from the model.

#### Lambda: Prerequisites
There are additional steps required to execute lambda (appropriate permission issues notwithstanding). For details, see [Evaluating your SageMaker AI-trained model](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html)

Essentially, there are two layers that must be added to your lambda function
- nova-custom-eval-layer.zip - open-source Nova custom evaluation SDK to validate input and output payloads for your custom function
- AWSLambdaPowertoolsPythonV3-python312-arm64 - required for pydantic dependency

The lambda function can run on x86 or ARM architecture.

An example of the lambda is provided in the example folder (custom-eval-metric.py).

In [None]:
# this can also be a checkpoint URI of a trained model
byom_recipe = "evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval"
byom_recipe_job_name = "eval-byom-nova-lite-2-recipe-job" # example
byom_sm_job_name = "eval-byom-nova-lite-2"

# this can also be a checkpoint URI of a trained model
byom_peft_model_name_or_path = "nova-lite-2/prod" # example

lambda_arn = "arn:aws:lambda:us-east-1:111111111111:function:custom-metric-eval"

# See Code Comments immediately following this
byom_recipe_overrides = {
    "run": {
        "name": byom_recipe_job_name,
        "model_name_or_path": byom_peft_model_name_or_path,
        "data_s3_path": "",  # For SMTJ, this is ""
        "output_s3_path": f"{output_path_prefix}/{byom_sm_job_name}"
    },
    "processor": {
        "lambda_arn": f"{lambda_arn}",
        "lambda_type": "custom_metrics"
    }
}

#### Code Comments
Take a look at the processor section.  As of the writting of the notebook, the base, unmodified github recipe sets this configuration:

```
processor:
  preset_reward_function: "" # Do not modify
  lambda_arn: "" # Provide valid lambda arn to enable custom metrics processing
  lambda_type: "rft" # or "custom_metrics"
  preprocessing:
    enabled: false
  postprocessing:
    enabled: true
  aggregation: average
```

So when this eval job executes, it will not call preprocessing (enabled: false), but it will call postprocess (enabled: true).  This "experiment" is done by design to illustrate overriding values, or keeping values as is, to see outcomes and to be mindful of what needs to be overridden.

### Select training technique to use
This is just a helper that allows chosing of whichever recipe desired.  Change the value of `technique` to one of the keywords indicated in the comment.

This will allow running the notebook efficiently.  Change the key, run the cells.  Change the keyword to another technique, run the cells.  Easy!

In [None]:
# technique choices of "GTB" or "BYOD" OR "BYOM"
technique = "BYOM" 

# recipe = ""
sm_training_job_name = ""
recipe_job_name = ""
training_recipe = ""
recipe_overrides = {}

if technique == "GTB":
    # General Text Benchmark
    print("General Text Benchmark")
    training_recipe = gtb_recipe
    recipe_overrides = gtb_recipe_overrides
    sm_training_job_name = gtb_sm_job_name
    recipe_job_name = gtb_recipe_job_name
    use_training_input_channel = False
elif technique == "BYOD":
    # Bring your own dataset
    print("Bring your own dataset")
    training_recipe = byod_recipe
    recipe_overrides = byod_recipe_overrides
    sm_training_job_name = byod_sm_job_name
    recipe_job_name = byod_recipe_job_name
    use_training_input_channel = True
elif technique == "BYOM":
    print("Bring your own metric")
    # Bring your own metric
    training_recipe = byom_recipe
    recipe_overrides = byom_recipe_overrides
    sm_training_job_name = byom_sm_job_name
    recipe_job_name = byom_recipe_job_name
    use_training_input_channel = True
else:
    print("*** Issue - evaluation undefined ***")

#### Instance Type and Count

P5 instances are optimized for deep learning workloads, providing high-performance GPUs.

In [None]:
instance_type = "ml.p5.48xlarge" 
instance_count = 1

### Select Container Image URI
This specifies the pre-built container for eval of Nova 2 Lite.

In [None]:
image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-Beta-latest"

print(f"image_uri: \n{image_uri}")

### Output path
This path will be used to write model results of the model training.  In this location will be found an output.tar.gz file, containing the `manifest.json` and `step_wise_training_metrics.csv`.

In [None]:
import json

print(f"recipe_overrides: \n{json.dumps(recipe_overrides, indent=4)}\n")

output_s3_path = recipe_overrides["run"]["output_s3_path"]

print(f"output_path:\n{output_s3_path}")

In [None]:
from sagemaker.pytorch import PyTorch

eval_estimator = PyTorch(
    output_path=output_s3_path,
    base_job_name=sm_training_job_name,
    role=role,
    disable_profiler=True,
    debugger_hook_config=False,
    instance_count=instance_count,
    instance_type=instance_type,
    training_recipe=training_recipe,
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
)

#### Configuring the Data Channel

In [None]:
from sagemaker.inputs import TrainingInput

eval_input = TrainingInput(
    s3_data=test_dataset_s3_path,
    distribution="FullyReplicated",
    s3_data_type="S3Prefix",
)

#### Start the Training Job
This starts the training job with the configured estimator and datasets.

In [None]:
# starting the train job with our uploaded datasets as input

if use_training_input_channel:
    print("using data input channel")
    eval_estimator.fit(inputs={"train": eval_input}, wait=False)
else:
    print("NOT using data input channel")
    eval_estimator.fit(wait=False)

In [None]:
training_job_name = eval_estimator.latest_training_job.name

print(f'Training Job Name:  \n{training_job_name}')

In [None]:
from IPython.display import HTML, Markdown, Image

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format("us-east-1", training_job_name)))
display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format("us-east-1", training_job_name)))
display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket_name, training_job_name, "us-east-1")))


## ^^ _This will take 20-30 mins in evaluation_

## 5. Evaluation Artifacts 
Downloading the artifact from Evaluation. 

In [None]:
estimator_output = eval_estimator.model_data

print(f"estimator_output: \n{estimator_output}\n")

output = '/'.join(estimator_output.split("/")[:-1]) +"/output.tar.gz"

Define consistent local folder names.

In [None]:
folder = "tmp"
output_folder = "eval_output"

In [None]:
print(f"output: \n{output}\n")
print(f"folder: \n{folder}\n")
print(f"output_folder: \n{output_folder}\n")

In [None]:
!mkdir -p ./$folder/
!mkdir -p ./$folder/$output_folder/

In [None]:
!aws s3 cp $output ./$folder/$output_folder/$training_job_name/output.tar.gz
!tar -xvzf ./$folder/$output_folder/$training_job_name/output.tar.gz -C ./$folder/$output_folder/$training_job_name

### Plotting the metrics
Define a function to plot the metrics

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os


def plot_metrics(results):
    # Extract metrics and their standard errors
    metrics = {}
    for key, value in results.items():
        if not key.endswith("_stderr"):
            metrics[key] = {"value": value, "stderr": results.get(f"{key}_stderr", 0)}

    # Sort metrics by value for better visualization
    sorted_metrics = dict(
        sorted(metrics.items(), key=lambda x: x[1]["value"], reverse=True)
    )

    # Prepare data for plotting
    labels = list(sorted_metrics.keys())
    values = [sorted_metrics[label]["value"] for label in labels]
    errors = [sorted_metrics[label]["stderr"] for label in labels]

    # Normalize BLEU score to be on the same scale as other metrics (0-1)
    bleu_index = labels.index("bleu") if "bleu" in labels else -1
    if bleu_index >= 0:
        values[bleu_index] /= 100
        errors[bleu_index] /= 100

    # Create figure
    fig, ax = plt.subplots(figsize=(12, 8))

    # Create bar chart
    x = np.arange(len(labels))
    bars = ax.bar(
        x,
        values,
        yerr=errors,
        align="center",
        alpha=0.7,
        capsize=5,
        color="skyblue",
        ecolor="black",
    )

    # Add labels and title
    ax.set_ylabel("Score")
    ax.set_title("Evaluation Metrics")
    ax.set_xticks(x)
    ax.set_xticklabels(labels, rotation=45, ha="right")
    ax.set_ylim(0, 1.0)

    # Add value labels on top of bars
    for i, bar in enumerate(bars):
        height = bar.get_height()
        # Convert BLEU back to its original scale for display
        display_value = values[i] * 100 if labels[i] == "bleu" else values[i]
        ax.text(
            bar.get_x() + bar.get_width() / 2.0,
            height + 0.01,
            f"{display_value:.2f}",
            ha="center",
            va="bottom",
        )

    # Add a note about BLEU
    if bleu_index >= 0:
        ax.text(
            0.5,
            -0.15,
            "Note: BLEU score shown as percentage (original: {:.2f})".format(
                values[bleu_index] * 100
            ),
            transform=ax.transAxes,
            ha="center",
            fontsize=9,
        )

    plt.tight_layout()
    return fig

When a eval job completes, an output.tar.gz is created.  Above, that file is unzipped, and untarred.

These results in a folder structure:

training-job-name<br>
|--> recipe-job-name<br>
|--> --> eval_results<br>
|--> --> tensorboard_results<br>

Take a moment to analyze the folders and content.  In the eval_results folder is a JSON file (results_xxx.json) and JSONL file (inference_output.jsonl)

The file results_xxx.json will contain metric results.

The file inference_output.jsonl will contain all the model inference results of the input eval dataset.

In [None]:
results_path = f"{folder}/{output_folder}/{training_job_name}/{recipe_job_name}/eval_results"

print(f"results_path: \n{results_path}")

Helper function to find all json files in the output.

In [None]:
import glob
import os

def find_json_files(path):
    return glob.glob(os.path.join(path, "*.json"))

In [None]:
evaluation_results_path = find_json_files(results_path)
evaluation_results_path = find_json_files(results_path)[0]

Plot the metrics and write the output plot to local storage.

In [None]:
import json
import os

with open(evaluation_results_path, "r") as f:
    data = json.load(f)

fig = plot_metrics(data["results"]["all"])
training_job_path = f"./{folder}/{output_folder}/{training_job_name}"
os.makedirs(training_job_path, exist_ok=True)

output_file = os.path.join(training_job_path, 'evaluation_metrics.png')

fig.savefig(output_file, bbox_inches='tight')

print(training_job_path)

# 6. Exploring More Eval Use Cases

## Eval BYOM Lambda (Deeper) Dive
Always curious - it is good to gain an understanding of the event dictionary as input into the preprocessing and postprocessing methods of the lamba.  Also, it would be good to understand the resulting response of the lamba.

Use the custom-metric-eval.py lambda, here are example of the 3 afore mentioned data points.

- preprocess.json
- postprocess.json
- result.json

Note:  For preprocessing, the lambda can affect the values contained in system, prompt, or gold.  No other additional properties can be added, or removed.

The result object of the lambda is dynamic, and can return a variety of information and metrics.

## Eval SMTJ for Batch Inference
An interesting observation and feature - an eval SMTJ can also be used to batch inference prompts.  In that case, the model_name_or_path would reflect the base model, not the checkpoint model.   As such, inference can be done across the test dataset.  

This would be helpful in a use case such as obtaining a model's reasoning for a given `query` and `response`, any using that reasoning to repopulate an SFT training job (an SFT training schema a reasonContent block for the purpose of improving training by using reasoning inputs).