# üöÄüî• Custom Nova Model Evaluation using Forge RFT in SageMaker Training Job - Lambda Single Turn üî•üöÄ

# RFT Evaluation with Custom Reward Functions

## Introduction

This notebook demonstrates how to evaluate Amazon Nova models using **Reinforcement Fine-Tuning (RFT) evaluation** with custom reward functions. In this specific example, we will be using an **AWS Lambda function** which will be hosting a custom-coded reward function to evaluate a model on its ability to solve **algebraic equations**.

Please note - Lambda is one of two potential infrastructrue solutions that Nova RFT evaluation can support. The other workflow is "Bring Your Own Orchestrator" which uses the [Verifers](https://github.com/tang-ti/verifiers/tree/main) open source library to allow for custom environments and should be used for more complex use cases.

**When to use Lambda-based RFT evaluation:**
- Single-turn tasks with custom scoring logic
- Reward computation completes within 15 minutes
- You want AWS to handle the orchestration infrastructure

**When to use BYOO (Bring Your Own Orchestrator) RFT evaluation:**
- Multi-turn agent scenarios (e.g., coding agents that iteratively debug across multiple interactions)
- Complex reward calculations that exceed 15-minute Lambda timeout
- Custom orchestration logic for simulating realistic environments
- Tasks requiring stateful interactions between model and environment
- Full control over the rollout generation process and conversation flow

**Use Case Example: Math Problem Solving**

We'll evaluate a Nova model on solving algebraic equations. The Lambda function will:
- Parse the model's JSON response to extract the answer
- Compare it against the ground truth
- Return a binary reward (1.0 for correct, 0.0 for incorrect)
- Track additional metrics like format compliance

**Recipe we will be using for this example**
This is the recipe we will be using to start our example job. This recipe yaml file will be included within the notebook.
```
  name: nova-lite-math-eval
  model_type: amazon.nova-2-lite-v1:0:256k
  model_name_or_path: nova-lite-2/prod
  replicas: 1
  data_s3_path: ""  # Leave empty for SageMaker Training job
  output_s3_path: ""  # Leave empty for SageMaker Training job

evaluation:
  task: rft_eval # Must specify gen_qa for RFT evaluation. Do not change for this example.
  strategy: rft_eval # Must specify gen_qa for RFT evaluation. Do not change for this example.
  metric: all

inference:
  max_new_tokens: 100
  top_k: -1
  top_p: 1.0
  temperature: 0
  top_logprobs: 0
  reasoning_effort: null

rl_env:
  reward_lambda_arn: arn:aws:lambda:us-east-1:123456789123:function:SageMaker-RFT-Math-Evaluator
```
Some important configuration parameters to note:
- **name**: The name of the evaluation run. This will be used when generating the output directory name.
- **model_name_or_path**: For this example, we will be using the Nova 2 Lite model.
- **reasoning_effort**: This can be set to either null, low, or high. This value will modify the amount of tokens the Nova 2 lite model will use during its reasoning.
- **top_logprobs**: The amount of tokens that will have logprobs shown during the output. These values can be found in the output parquet file. This can be useful for analyzing the model behavior.
- **max_new_tokens**: The amount of tokens that the model will generate before stopping. For this specific example, we can have a relatively low number because we expect the model to just print out the answers to simple math questions!
- **reward_lambda_arn**: The arn of the AWS Lambda function you will create as a part of this example.


## Setup and Dependencies

These dependencies will be used during the execution and analysis of the evaluation run. 

**IMPORTANT: Ensure that this specific version (2.254.1) of the Sagemaker CLI is used. Nova Forge does not currently support the latest SageMaker v3 CLI!**

In [1]:
!pip install sagemaker==2.254.1

Collecting sagemaker==2.254.1
  Using cached sagemaker-2.254.1-py3-none-any.whl.metadata (17 kB)
Collecting attrs<26,>=24 (from sagemaker==2.254.1)
  Using cached attrs-25.4.0-py3-none-any.whl.metadata (10 kB)
Collecting boto3<2.0,>=1.39.5 (from sagemaker==2.254.1)
  Downloading boto3-1.42.21-py3-none-any.whl.metadata (6.8 kB)
Collecting botocore<1.43.0,>=1.42.21 (from boto3<2.0,>=1.39.5->sagemaker==2.254.1)
  Downloading botocore-1.42.21-py3-none-any.whl.metadata (5.9 kB)
Collecting s3transfer<0.17.0,>=0.16.0 (from boto3<2.0,>=1.39.5->sagemaker==2.254.1)
  Using cached s3transfer-0.16.0-py3-none-any.whl.metadata (1.7 kB)
Using cached sagemaker-2.254.1-py3-none-any.whl (1.7 MB)
Using cached attrs-25.4.0-py3-none-any.whl (67 kB)
Downloading boto3-1.42.21-py3-none-any.whl (140 kB)
Downloading botocore-1.42.21-py3-none-any.whl (14.6 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m14.6/14.6 MB

In [5]:
import os
import sagemaker, boto3
import json
import tarfile
import pandas as pd
import glob
import ast
from sagemaker.inputs import TrainingInput
from sagemaker.pytorch import PyTorch

# Setup SageMaker session
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()

print(f"Sagemaker version: {sagemaker.__version__}")
print(f"Execution Role: {role}")
print("üöÄ All dependencies successfully installed")

Sagemaker version: 2.254.1
Execution Role: arn:aws:iam::618100645563:role/service-role/AmazonSageMaker-ExecutionRole-20251113T000017
üöÄ All dependencies successfully installed


## Step 1: Implementing Lambda Function

This series of steps will cover the creation of the custom Lambda reward function including required permissions as well as go over requirements for function inputs and outputs.

### Step 1a: Creating AWS Lambda function
1. Go to AWS Lambda Console
2. Click "Create function"
3. Choose "Author from scratch"
4. Configure:
   - **Function name**: `SageMaker-RFT-Math-Evaluator` (Ensure that the function name is prefixed by "SageMaker-", this is a requirement!)
   - **Runtime**: Python 3.12
   - **Architecture**: x86_64
5. Click "Create function"

### Step 1b: Add custom Lambda code


1. In the Lambda console, scroll to "Code source"
2. Replace the default code with the Lambda function code shown below
3. Click "Deploy"

Note: The code below is implemented following the structure recommended in the official [Nova Forge RFT reward function documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-fields). 



In [None]:
import json
import re

def lambda_handler(event, context):
    """AWS Lambda handler for RFT Math Evaluation"""
    print(f"Received {len(event)} samples for evaluation")
    samples = event if isinstance(event, list) else [event]
    results = []
    
    for sample in samples:
        sample_id = sample.get("id", "unknown")
        print(f"\nProcessing sample: {sample_id}")
        
        # Extract model response (last assistant message)
        model_response = next(
            (msg["content"] for msg in reversed(sample["messages"]) 
             if msg["role"] == "assistant"), 
            ""
        )
        
        print(f"Raw model response: {model_response[:150]}...")
        
        # Calculate score
        score = lambda_grader(model_response, sample["reference_answer"])
        
        print(f"Score for {sample_id}: {score}")
        
        # Build result
        result = {
            "id": sample_id,
            "aggregate_reward_score": score,
            "metrics_list": [
                {"name": "correctness", "value": score, "type": "Reward"}
            ]
        }
        results.append(result)
    
    print(f"\nReturning {len(results)} results")
    print(f"Full response: {json.dumps(results, indent=2)}")
    return results

def lambda_grader(model_response, reference_answer):
    """Calculates correctness score for math response"""
    try:
        # Remove any special tokens (pattern: <|...|>)
        cleaned = re.sub(r'<\|[^|]+\|>', '', model_response)
        
        # Remove markdown code blocks
        cleaned = re.sub(r'```\w*\s*|\s*```', '', cleaned).strip()
        
        print(f"Cleaned response: {cleaned}")
        
        # Extract JSON object using regex (handles multi-line)
        json_match = re.search(r'\{[^}]*\}', cleaned, re.DOTALL)
        if json_match:
            json_str = json_match.group(0)
            print(f"Extracted JSON: {json_str}")
            
            parsed = json.loads(json_str)
            answer = parsed.get("x")
            expected = reference_answer["x"]
            
            print(f"Parsed answer: {answer}, Expected: {expected}")
            
            return 1.0 if answer == expected else 0.0
        else:
            print("No JSON object found in response")
            return 0.0
            
    except Exception as e:
        print(f"Error parsing response: {e}")
        return 0.0


### Step 1c: Example Lambda inputs & outputs

#### Example Lambda input
Here is an example input to the Lambda function. Inputs will always follow this same JSON structure:
```
[
  {
    "id": "math_001",
    "messages": [
      {
        "role": "system",
        "content": "You are a math solver. Follow instructions exactly."
      },
      {
        "role": "user",
        "content": "Solve for x: 2x + 5 = 13. Return JSON: {\"x\": <number>}"
      },
      {
        "role": "assistant",
        "content": "{\"x\": 4}"
      }
    ],
    "reference_answer": {"x": 4}
  }
]

```

#### Example Lambda output
Here is an example output from the Lambda function. Outputs must be in this JSON format for the custom Lambda reward function to correctly interact with the evaluation job.
```
[
  {
    "id": "math_001",
    "aggregate_reward_score": 1.0,
    "metrics_list": [
      {
        "name": "correctness",
        "value": 1.0,
        "type": "Reward"
      }
    ]
  }
]

```

### Step 1d: Update SageMaker execution role with correct permissions

In order for the evaluation job to have the correct permissions to utilize the custom reward function, Lambda execution permission must be granted to the Sagemaker Training Job execution role. This role can be found from the above cell where we instantiated dependencies.

To grant permission:
1. Go to IAM Console
2. Click Roles ‚Üí Search for your SageMaker Training Job execution role name
3. Click the role name
4. Click Add permissions ‚Üí Create inline policy
5. Click JSON tab and paste:
   ```
   {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:<YOUR_AWS_REGION>:<YOUR_AWS_ACC_ID>:function:SageMaker-RFT-Math-Evaluator"
    }
  ]
}
```

In [10]:
# Run this to find the execution role ARN of your SageMaker instance. 
# The role name is just the appended ID after "/service-role/"
print(f"Execution Role: {execution_role}")


Execution Role: arn:aws:iam::618100645563:role/service-role/AmazonSageMaker-ExecutionRole-20251113T000017


## Step 2: Upload dataset
Quick steps to upload a custom dataset which will be supported by RFT.

#### Example provided dataset
For this example we will use this provided dataset:

```
{"id": "math_001", "messages": [{"role": "system", "content": "You are a math solver. Return ONLY valid JSON. Do not use markdown formatting or code blocks."}, {"role": "user", "content": "Solve for x: 2x + 5 = 13. Return only JSON format: {\"x\": <number>}"}], "reference_answer": {"x": 4}}
{"id": "math_002", "messages": [{"role": "system", "content": "You are a math solver. Return ONLY valid JSON. Do not use markdown formatting or code blocks."}, {"role": "user", "content": "Solve for x: 3x - 7 = 8. Return only JSON format: {\"x\": <number>}"}], "reference_answer": {"x": 5}}
{"id": "math_003", "messages": [{"role": "system", "content": "You are a math solver. Return ONLY valid JSON. Do not use markdown formatting or code blocks."}, {"role": "user", "content": "Solve for x: x/2 + 3 = 7. Return only JSON format: {\"x\": <number>}"}], "reference_answer": {"x": 8}}
{"id": "math_004", "messages": [{"role": "system", "content": "You are a math solver. Return ONLY valid JSON. Do not use markdown formatting or code blocks."}, {"role": "user", "content": "Solve for x: 5x + 10 = 35. Return only JSON format: {\"x\": <number>}"}], "reference_answer": {"x": 5}}
{"id": "math_005", "messages": [{"role": "system", "content": "You are a math solver. Return ONLY valid JSON. Do not use markdown formatting or code blocks."}, {"role": "user", "content": "Solve for x: 4x - 12 = 0. Return only JSON format: {\"x\": <number>}"}], "reference_answer": {"x": 3}}
```

This dataset is in the required format for RFT datasets.

The required schema for RFT datasets is
```
{
  "messages": [
    {
      "role": "<string>",
      "content": [
        {
          "type": "<string>",
          "text": "<string>"
        }
      ]
    }
  ],
  "reference_answer": {
    "<key>": "<value>"
  }
}
```
**Important current limitations**
- Text only: No multimodal inputs (images, audio, video) are supported
- Single-turn conversations: Only supports single user message (no multi-turn dialogues)
- JSON format: Input data must be in JSONL format (one JSON object per line)
- Model outputs: Evaluation is performed on generated completions from the specified model

For more information on the required dataset format, see https://docs.aws.amazon.com/sagemaker/latest/dg/nova-rft-evaluation.html.


### Step 2a:
1. Upload the provided dataset in a jsonl formatted document to the s3 location of your choice.
2. Take note of the location of the dataset in S3; this will be used by the evaluation job to locate the dataset during execution.

## Step 3: Create your recipe yaml file

1. The recipe yaml will be provided as a part of this example notebook under the filename "rft_Eval_Example.yaml"
2. Modify the Lambda ARN in the recipe file to match the one created in the above instructions.
3. Modify the yaml file match the specifications you'd like (naming, model type, etc).

## Step 4: Execute and run SageMaker Training Job using the created resources

In [15]:
# Configuration
input_s3_uri = "S3_URI_FOR_INPUT_DATASET"
output_s3_uri = "S3_PATH_FOR_OUTPUT_LOCATION"
instance_type = "YOUR_INSTANCE_TYPE"
job_name = "nova-lite-math-eval-workbook-example"
recipe_path = "./rft_eval_recipe.yaml"
image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest"

# Create training input
evalInput = TrainingInput(
    s3_data=input_s3_uri,
    distribution='FullyReplicated',
    s3_data_type='S3Prefix'
)

# Create estimator
estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri
)

# Run evaluation
estimator.fit(inputs={"train": evalInput})

print(f"‚úÖ Evaluation job completed! Job name: {estimator.latest_training_job.name}")


INFO:sagemaker:Remote debugging, profiler and debugger hooks are disabled for Nova recipes.
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: jmoul-nova-lite-math-eval-2025-12-24-00-17-48-051


2025-12-24 00:17:49 Starting - Starting the training job
2025-12-24 00:17:49 Pending - Training job waiting for capacity......
2025-12-24 00:18:42 Pending - Preparing the instances for training...................................................
2025-12-24 00:27:26 Downloading - Downloading the training image............
[34m2025-12-24 00:30:09,567 - INFO - Successfully registered Nova model as AutoModel[0m
[34m[2025-12-24 00:30:10,878] [unknown_run_name] Starting Model Evaluation[0m
[34m[2025-12-24 00:30:10,879] [unknown_run_name] Provided recipe config for evaluation: {'run': {'name': 'nova-lite-math-eval', 'model_type': 'amazon.nova-2-lite-v1:0:256k', 'model_name_or_path': 'nova-lite-2/prod', 'replicas': 1, 'data_s3_path': '', 'output_s3_path': ''}, 'evaluation': {'task': 'rft_eval', 'strategy': 'rft_eval', 'metric': 'all'}, 'inference': {'max_new_tokens': 100, 'top_k': -1, 'top_p': 1.0, 'temperature': 0, 'top_logprobs': 0, 'reasoning_effort': None}, 'rl_env': {'reward_lambda_ar

## Step 4: Results analysis
This step will give a code example of how to examine the outputs from the RFT evaluation job as well as show model inference outputs.

#### Code example
Autoatically grabs the values output from the SMTJ and prints them out in a human readable format.  

**Make sure to update the bucket value to the correct output path for your data specified during job setup!**

In [17]:
import boto3
import tarfile
import json
import glob
import ast
import pandas as pd
import shutil
import os

# Clean old results
if os.path.exists('results/'):
    shutil.rmtree('results/')
if os.path.exists('results.tar.gz'):
    os.remove('results.tar.gz')

# S3 path to output file
s3_output_path = "S3_OUTPUT_URI_FOR_ZIP_FILE"

# Parse S3 path
s3_parts = s3_output_path.replace("s3://", "").split("/", 1)
bucket = s3_parts[0]
key = s3_parts[1]

print(f"Downloading from: {s3_output_path}")

# Download output tar.gz
s3 = boto3.client('s3')
s3.download_file(bucket, key, 'results.tar.gz')

# Extract results
with tarfile.open('results.tar.gz', 'r:gz') as tar:
    tar.extractall('results/')

print("=" * 50)
print("EVALUATION RESULTS")
print("=" * 50)

# Find all result files
json_files = glob.glob('results/**/results_*.json', recursive=True)
parquet_files = glob.glob('results/**/details_*.parquet', recursive=True)

# Load results JSON (aggregated metrics)
if json_files:
    with open(json_files[0], 'r') as f:
        results = json.load(f)
        print("\nüìä Aggregated Evaluation Metrics:")
        
        for task_name, metrics in results['results'].items():
            print(f"\nTask: {task_name}")
            for metric_name, value in metrics.items():
                print(f"  {metric_name}: {value:.3f}")

# Load and display parquet details
if parquet_files:
    df = pd.read_parquet(parquet_files[0])
    
    print("\n" + "=" * 50)
    print("SAMPLE RESULTS FROM OUTPUT PARQUET")
    print("=" * 50)
    
    sample_print_count = 5
    for i in range(min(sample_print_count, len(df))):
        row = df.iloc[i]
        
        # Parse predictions (stored as string representation of list)
        predictions = ast.literal_eval(row['predictions']) if isinstance(row['predictions'], str) else row['predictions']
        prediction = predictions[0] if predictions else "No prediction"
        
        # Parse string representations to dicts
        metrics = ast.literal_eval(row['metrics']) if isinstance(row['metrics'], str) else row['metrics']
        specifics = ast.literal_eval(row['specifics']) if isinstance(row['specifics'], str) else row['specifics']
        
        lambda_metrics = metrics['rft_eval_lambda_metric']
        
        print(f"\nüìù Sample {i + 1} - {specifics['sample_id']}:")
        print(f"  Question: {specifics['original_line']['messages'][1]['content']}")
        print(f"  Model Response: {repr(prediction)}")
        print(f"  Expected Answer: {specifics['original_line']['reference_answer']}")
        print(f"  ‚úÖ Reward Score: {lambda_metrics['lambda_reward_score']:.2f}")
        print(f"  ‚úÖ Correctness: {lambda_metrics['lambda_correctness']:.2f}")


Downloading from: s3://nova-eval-forge-smoke-test/output/jmoul-nova-lite-math-eval-2025-12-24-00-17-48-051/output/output.tar.gz
EVALUATION RESULTS

üìä Aggregated Evaluation Metrics:

Task: custom|rft_eval_rft_eval|0
  lambda_correctness: 1.000
  lambda_reward_score: 1.000

SAMPLE RESULTS FROM OUTPUT PARQUET

üìù Sample 1 - math_004:
  Question: Solve for x: 5x + 10 = 35. Return only JSON format: {"x": <number>}
  Model Response: '<|begin_of_solution|>```json\n{"x": 5}\n```<|end_of_solution|>'
  Expected Answer: {'x': 5}
  ‚úÖ Reward Score: 1.00
  ‚úÖ Correctness: 1.00

üìù Sample 2 - math_002:
  Question: Solve for x: 3x - 7 = 8. Return only JSON format: {"x": <number>}
  Model Response: '<|begin_of_solution|>```json\n{"x": 5}\n```<|end_of_solution|>'
  Expected Answer: {'x': 5}
  ‚úÖ Reward Score: 1.00
  ‚úÖ Correctness: 1.00

üìù Sample 3 - math_003:
  Question: Solve for x: x/2 + 3 = 7. Return only JSON format: {"x": <number>}
  Model Response: '<|begin_of_solution|>```json\n{"

  tar.extractall('results/')


#Summary