# Amazon Nova LLM-as-a-Judge Evaluation with Amazon SageMaker

This notebook demonstrates how to use Amazon Nova LLM-as-a-Judge methodology to evaluate and compare the outputs of two different large language models using Amazon SageMaker Training Jobs. We'll compare responses from a Qwen2.5 1.5B model deployed on SageMaker against Claude 3.7 Sonnet via Amazon Bedrock.

## Overview

The Amazon Nova LLM-as-a-Judge approach uses a powerful language model to evaluate the quality of responses from other models by comparing them side-by-side. This method provides:

**Objective Comparison**: Systematic evaluation of model outputs

**Scalable Assessment**: Automated evaluation of large datasets

**Detailed Metrics**: Win rates, confidence intervals, and preference distributions

**Cost-Effective**: More efficient than human evaluation for large-scale comparisons

## Prerequisites

* AWS Account with SageMaker and Bedrock access
* Appropriate IAM roles and permissions
* SageMaker Studio or Jupyter environment

## Understanding Amazon Nova LLM-as-a-Judge Evaluation Metrics

When using the Amazon Nova LLM-as-a-Judge framework to compare the outputs of two language models, a set of quantitative metrics is generated. These metrics help you objectively assess which model performs better and how reliable the evaluation is.

---

### Core Preference Metrics

- **a_scores**  
  The number of times Model A's response was preferred by the judge model over Model B.

- **b_scores**  
  The number of times Model B's response was preferred by the judge model over Model A.

- **ties**  
  The number of times the judge found both responses equally good or could not determine a preference.

- **inference_error**  
  The number of evaluation cases where the judge could not provide a valid judgment due to technical issues, malformed outputs, or other errors.

---

### Statistical Confidence Metrics

- **winrate**  
  The proportion of valid judgments in which Model B was preferred

- **lower_rate**  
  The lower bound of the 95% confidence interval for the winrate. This tells you the minimum likely winrate for Model B, accounting for statistical uncertainty.

- **upper_rate**  
  The upper bound of the 95% confidence interval for the winrate. This tells you the maximum likely winrate for Model B.

- **score**  
  An aggregate performance score, which in some contexts matches the number of Model B wins, but can be used for custom scoring depending on the evaluation setup.

---

### Standard Error Metrics

- **a_scores_stderr, b_scores_stderr, ties_stderr, inference_error_stderr, score_stderr**  
  These metrics reflect the standard error (uncertainty) of each corresponding count or score. Smaller values indicate more reliable results, while larger values suggest more variability or a need for a larger sample size.

---

### How to Interpret These Metrics

- **Winrate and Confidence Intervals:**  
  - If the winrate is significantly above 0.5 and the confidence interval does not include 0.5, Model B is statistically favored.
  - If the winrate is below 0.5 and the confidence interval does not include 0.5, Model A is statistically favored.
  - If the interval includes 0.5, results are inconclusive.

- **Error Analysis:**  
  - High inference_error or large standard errors indicate possible issues with the evaluation process or insufficient data.

- **Preference Distribution:**  
  - The balance of a_scores, b_scores, and ties provides a direct picture of model performance differences on your evaluation set.

---

### Example Metrics Output

```json
{
"a_scores": 16.0,
"a_scores_stderr": 0.03,
"b_scores": 10.0,
"b_scores_stderr": 0.09,
"ties": 0.0,
"ties_stderr": 0.0,
"inference_error": 0.0,
"inference_error_stderr": 0.0,
"score": 10.0,
"score_stderr": 0.09,
"winrate": 0.38,
"lower_rate": 0.23,
"upper_rate": 0.56
}
```
---

These metrics, generated automatically during evaluation, provide a comprehensive, statistically rigorous summary of how two models compare on your chosen dataset. They enable you to make informed decisions about model selection and deployment.


### Setup and Installation

Set up the required dependencies and configure the environment.

In [None]:
# install python SDK
!pip install --upgrade sagemaker 

### Import Libraries
Import necessary Python packages and set up SageMaker session

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
import boto3
import json
from datasets import load_dataset
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import boto3
import json
from sagemaker.huggingface import HuggingFacePredictor


# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()


## Model Setup


### Deploy Qwen2.5 1.5B Instruct Model

In [None]:
!python3 deploy_sm_model.py

### Test out the endpoint

In [None]:

# Initialize the predictor once
predictor = HuggingFacePredictor(endpoint_name="qwen25-<endpoint_name_here>")

def generate_with_qwen25(prompt: str, max_tokens: int = 500, temperature: float = 0.9) -> str:
    """
    Sends a prompt to the deployed Qwen2.5 model on SageMaker and returns the generated response.

    Args:
        prompt (str): The input prompt/question to send to the model.
        max_tokens (int): Maximum number of tokens to generate.
        temperature (float): Sampling temperature for generation.

    Returns:
        str: The model-generated text.
    """
    response = predictor.predict({
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature
        }
    })
    return response[0]["generated_text"]

answer = generate_with_qwen25("What is the Grotto at Notre Dame?")
print(answer)

### Bedrock Claude 3.7 Model

In [None]:

# Initialize Bedrock client once
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

# (Claude 3.7 Sonnet) model ID via Bedrock
MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

def generate_with_claude37(prompt: str, max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.9) -> str:
    """
    Sends a prompt to the Claude 4-tier model via Amazon Bedrock and returns the generated response.

    Args:
        prompt (str): The user message or input prompt.
        max_tokens (int): Maximum number of tokens to generate.
        temperature (float): Sampling temperature for generation.
        top_p (float): Top-p nucleus sampling.

    Returns:
        str: The text content generated by Claude.
    """
    payload = {
        "anthropic_version": "bedrock-2023-05-31",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p
    }

    response = bedrock.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps(payload),
        contentType="application/json",
        accept="application/json"
    )

    response_body = json.loads(response['body'].read())
    return response_body["content"][0]["text"]

answer = generate_with_claude37("What is the Grotto at Notre Dame?")
print(answer)


### Data Preparation

In [None]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:20]")
print(squad[3]["question"])
print(squad[3]["answers"]["text"][0])

questions = [squad[i]["question"] for i in range(6)]



### Generate Evaluation Dataset


In [None]:
import json


output_path = "llm_judge.jsonl"

with open(output_path, "w") as f:
    for q in questions:
        try:
            response_a = generate_with_qwen25(q)
        except Exception as e:
            response_a = f"[Qwen2.5 generation failed: {e}]"
        
        try:
            response_b = generate_with_claude37(q)
        except Exception as e:
            response_b = f"[Claude 3.7 generation failed: {e}]"

        row = {
            "prompt": q,
            "response_A": response_a,
            "response_B": response_b
        }
        f.write(json.dumps(row) + "\n")

print(f"JSONL file created at: {output_path}")


### Configure Training Parameters
Set up the necessary parameters for the training job

In [None]:
# Please populate parameters

your_bucket_name = ""
assert your_bucket_name != "", "PLEASE POPULATE YOUR BUCKET NAME ABOVE"

input_s3_uri = "s3://{}/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl".format(your_bucket_name)
output_s3_uri = "s3://{}/evaluation-results/mcaiich/llm_judge/".format(your_bucket_name) # Output data s3 location
instance_type = "ml.g5.24xlarge"  
job_name = "nova-micro-gen-qa-eval-job"
recipe_path = "<PATH-TO-RECIPE>/recipe.yaml"
image_uri = f"708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest"


### Upload Data to S3

## Grant S3 Permissions to the SageMaker Execution Role

Before proceeding, make sure to grant the Execution Role direct `s3:PutObject` permissions for your S3 bucket prefix.

**Steps:**
- Go to the Execution Role (e.g., `AmazonSageMaker-ExecutionRole-...`) in the AWS IAM Console.
- Attach the following inline policy:

```json
{
  "Effect": "Allow",
  "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::my-bucket-east",
    "arn:aws:s3:::my-bucket-east/*"
  ]
}
```


In [None]:
import boto3

def parse_s3_uri(s3_uri):
    assert s3_uri.startswith("s3://"), "Invalid S3 URI"
    parts = s3_uri.replace("s3://", "").split("/", 1)
    bucket = parts[0]
    key = parts[1] if len(parts) > 1 else ""
    return bucket, key

def upload_to_s3(local_path, s3_uri):
    """
    Upload evaluation data to S3 bucket using current role credentials.
    """
    bucket, key = parse_s3_uri(s3_uri)
    
    s3 = boto3.client("s3")
    s3.upload_file(Filename=local_path, Bucket=bucket, Key=key)
    print(f"✅ Uploaded {local_path} to {s3_uri}")

# Example usage
upload_to_s3(
    "llm_judge.jsonl",
    "s3://{}/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl".format(your_bucket_name)
)


### Set Up Training Input (Optional)
Configure input data source for evaluation

In [None]:
# (Optional) For bring your own dataset for evaluation
evalInput = TrainingInput(
    s3_data=input_s3_uri,
    distribution='FullyReplicated',
    s3_data_type='S3Prefix'
)

### Run Amazon Nova LLM-as-a-Judge Evaluation


In [None]:
estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri = image_uri,
    disable_profiler=True,
    debugger_hook_config=False,
)

estimator.fit(inputs={"train": evalInput})

### Download Results From Training Job

In [None]:
import boto3
import os
import tarfile

def download_and_extract_job_output(training_job_name: str, output_s3_uri: str, download_dir: str = "./output"):
    """
    Downloads the output.tar.gz of a SageMaker training job and extracts it locally.

    Args:
        training_job_name (str): Name of the SageMaker training job.
        output_s3_uri (str): Base S3 URI where outputs are stored.
        download_dir (str): Local directory to extract files into.
    """
    # Build the full S3 path
    s3_uri = f"{output_s3_uri.rstrip('/')}/{training_job_name}/output/output.tar.gz"
    print(f"Resolved S3 URI: {s3_uri}")

    # Parse bucket and key
    def parse_s3_uri(s3_uri):
        assert s3_uri.startswith("s3://"), "Invalid S3 URI"
        parts = s3_uri.replace("s3://", "").split("/", 1)
        bucket = parts[0]
        key = parts[1] if len(parts) > 1 else ""
        return bucket, key

    bucket, key = parse_s3_uri(s3_uri)

    # Create S3 client
    s3 = boto3.client("s3")

    # Create output directory
    if not os.path.exists(download_dir):
        os.makedirs(download_dir)

    local_tar_path = os.path.join(download_dir, "output.tar.gz")

    # Download file
    print("Downloading...")
    s3.download_file(bucket, key, local_tar_path)
    print(f"Downloaded to {local_tar_path}")

    # Extract tar.gz
    print("Extracting...")
    with tarfile.open(local_tar_path, "r:gz") as tar:
        tar.extractall(path=download_dir)

    print(f"Extracted contents to {download_dir}")


In [None]:
download_and_extract_job_output("nova-micro-gen-qa-eval-job-<DATE>", output_s3_uri)

## Results Visualization

Based on the evaluation results shown in the uploaded image, here's how to create comprehensive visualizations:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from pathlib import Path


def plot_llm_judge_results(results):
    """
    Plot LLM judge evaluation results from a JSON file.

    Args:
        json_file_path (str): Path to the JSON results file
        save_plots (bool): Whether to save plots to files
        output_dir (str): Directory to save plots (defaults to same dir as JSON file)

    Returns:
        dict: Dictionary containing the plotted data for further analysis
    """

    # Set style
    plt.style.use("default")
    sns.set_palette("husl")

    # Create figure with subplots
    fig = plt.figure(figsize=(16, 12))

    # 1. Score Distribution Bar Chart
    ax1 = plt.subplot(2, 3, 1)
    scores = {
        "A Scores": results["a_scores"],
        "B Scores": results["b_scores"],
        "Ties": results["ties"],
        "Inference Errors": results["inference_error"],
    }

    bars = ax1.bar(
        scores.keys(),
        scores.values(),
        color=["#FF6B6B", "#4ECDC4", "#45B7D1", "#FFA07A"],
    )
    ax1.set_title("Score Distribution", fontsize=14, fontweight="bold")
    ax1.set_ylabel("Count")

    # Add value labels on bars
    for bar, value in zip(bars, scores.values()):
        height = bar.get_height()
        ax1.text(
            bar.get_x() + bar.get_width() / 2.0,
            height + height * 0.01,
            f"{int(value)}",
            ha="center",
            va="bottom",
            fontweight="bold",
        )

    plt.xticks(rotation=45, ha="right")

    # 2. Win Rate with Confidence Interval
    ax2 = plt.subplot(2, 3, 2)
    winrate = results["winrate"]
    lower_rate = results["lower_rate"]
    upper_rate = results["upper_rate"]

    # Create horizontal bar for winrate
    ax2.barh(["Win Rate"], [winrate], color="#4ECDC4", alpha=0.7, height=0.3)

    # Add confidence interval
    ax2.errorbar(
        [winrate],
        ["Win Rate"],
        xerr=[[winrate - lower_rate], [upper_rate - winrate]],
        fmt="o",
        color="black",
        capsize=10,
        capthick=2,
    )

    ax2.set_xlim(0, 1)
    ax2.set_xlabel("Win Rate")
    ax2.set_title("B vs A Win Rate with 95% CI", fontsize=14, fontweight="bold")
    ax2.axvline(
        x=0.5, color="red", linestyle="--", alpha=0.7, label="50% (No preference)"
    )
    ax2.legend()

    # Add text annotation
    ax2.text(
        winrate,
        0,
        f"{winrate:.3f}\n[{lower_rate:.3f}, {upper_rate:.3f}]",
        ha="center",
        va="bottom",
        fontweight="bold",
        bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8),
    )

    # 3. Preference Pie Chart (excluding inference errors)
    ax3 = plt.subplot(2, 3, 3)
    total_valid = results["a_scores"] + results["b_scores"] + results["ties"]

    if total_valid > 0:
        pie_data = [results["a_scores"], results["b_scores"], results["ties"]]
        pie_labels = ["A Preferred", "B Preferred", "Ties"]
        colors = ["#FF6B6B", "#4ECDC4", "#45B7D1"]

        wedges, texts, autotexts = ax3.pie(
            pie_data, labels=pie_labels, colors=colors, autopct="%1.1f%%", startangle=90
        )

        # Make percentage text bold
        for autotext in autotexts:
            autotext.set_fontweight("bold")
            autotext.set_color("white")

    ax3.set_title(
        "Preference Distribution\n(Valid Judgments Only)",
        fontsize=14,
        fontweight="bold",
    )

    # 4. Comparison of A vs B Scores
    ax4 = plt.subplot(2, 3, 4)
    categories = ["A Scores", "B Scores"]
    values = [results["a_scores"], results["b_scores"]]
    colors = ["#FF6B6B", "#4ECDC4"]

    bars = ax4.bar(categories, values, color=colors, alpha=0.8)
    ax4.set_title("A vs B Score Comparison", fontsize=14, fontweight="bold")
    ax4.set_ylabel("Score Count")

    # Add value labels
    for bar, value in zip(bars, values):
        height = bar.get_height()
        ax4.text(
            bar.get_x() + bar.get_width() / 2.0,
            height + height * 0.01,
            f"{int(value)}",
            ha="center",
            va="bottom",
            fontweight="bold",
        )

    # Add difference annotation
    diff = abs(values[0] - values[1])
    winner = "A" if values[0] > values[1] else "B"
    ax4.text(
        0.5,
        max(values) * 0.8,
        f"{winner} leads by {int(diff)}",
        ha="center",
        transform=ax4.transData,
        bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7),
        fontweight="bold",
    )

    # 5. Win Rate Visualization (Gauge-like)
    ax5 = plt.subplot(2, 3, 5)

    # Create a semi-circular gauge
    theta = np.linspace(0, np.pi, 100)
    r = 1

    # Background arc
    ax5.plot(r * np.cos(theta), r * np.sin(theta), "lightgray", linewidth=10)

    # Win rate arc
    winrate_theta = np.linspace(0, winrate * np.pi, int(winrate * 100))
    ax5.plot(
        r * np.cos(winrate_theta), r * np.sin(winrate_theta), "#4ECDC4", linewidth=10
    )

    # Add needle
    needle_angle = winrate * np.pi
    ax5.arrow(
        0,
        0,
        0.8 * np.cos(needle_angle),
        0.8 * np.sin(needle_angle),
        head_width=0.05,
        head_length=0.05,
        fc="red",
        ec="red",
    )

    ax5.set_xlim(-1.2, 1.2)
    ax5.set_ylim(-0.2, 1.2)
    ax5.set_aspect("equal")
    ax5.axis("off")
    ax5.set_title("Win Rate Gauge", fontsize=14, fontweight="bold")

    # Add labels
    ax5.text(-1, -0.1, "0%", ha="center", fontweight="bold")
    ax5.text(0, -0.1, "50%", ha="center", fontweight="bold")
    ax5.text(1, -0.1, "100%", ha="center", fontweight="bold")
    ax5.text(
        0,
        0.5,
        f"{winrate:.1%}",
        ha="center",
        va="center",
        fontsize=16,
        fontweight="bold",
        bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8),
    )

    # 6. Summary Statistics Table
    ax6 = plt.subplot(2, 3, 6)
    ax6.axis("off")

    # Create summary statistics
    total_evaluations = (
        results["a_scores"]
        + results["b_scores"]
        + results["ties"]
        + results["inference_error"]
    )

    summary_data = [
        ["Total Evaluations", f"{int(total_evaluations)}"],
        ["A Scores", f"{int(results['a_scores'])}"],
        ["B Scores", f"{int(results['b_scores'])}"],
        ["Ties", f"{int(results['ties'])}"],
        ["Inference Errors", f"{int(results['inference_error'])}"],
        ["Win Rate (B vs A)", f"{results['winrate']:.3f}"],
        ["95% CI Lower", f"{results['lower_rate']:.3f}"],
        ["95% CI Upper", f"{results['upper_rate']:.3f}"],
        [
            "Error Rate",
            (
                f"{results['inference_error']/total_evaluations:.1%}"
                if total_evaluations > 0
                else "0%"
            ),
        ],
    ]

    # Create table
    table = ax6.table(
        cellText=summary_data,
        colLabels=["Metric", "Value"],
        cellLoc="left",
        loc="center",
        colWidths=[0.6, 0.4],
    )

    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 2)

    # Style the table
    for i in range(len(summary_data) + 1):
        for j in range(2):
            cell = table[(i, j)]
            if i == 0:  # Header
                cell.set_facecolor("#4ECDC4")
                cell.set_text_props(weight="bold", color="white")
            else:
                cell.set_facecolor("#f0f0f0" if i % 2 == 0 else "white")
                if j == 1:  # Value column
                    cell.set_text_props(weight="bold")

    ax6.set_title("Summary Statistics", fontsize=14, fontweight="bold", pad=20)

    plt.tight_layout()

    return plt

### Run Visualization

In [None]:
evaluation_results_path = "./output/nova-micro-llm-judge-eval-job/eval_results/results_2025-06-26T22-02-09.817675.json"

In [None]:

with open(evaluation_results_path, "r") as f:
    data = json.load(f)

fig = plot_llm_judge_results(data["results"]["all"])

output_file = os.path.join("./", "evaluation_metrics.png")
fig.savefig(output_file, bbox_inches="tight")

### Results

**Based on the evaluation results:**

**Performance Metrics** 

* **Total Evaluations:** 12 questions evaluated (6 Each)

* **Model B (Claude 3.7) Performance:** 9 wins (75%)

* **Model A (Qwen2.5) Performance:** 3 wins (25%)

* **Ties:** 0

* **Error Rate:** 0%

**Statistical Confidence**  
* **Win Rate:** 75% in favor of Model B

* **95% Confidence Interval:** [9.1%, 91.7%]

**Statistical Significance:** High confidence that Model B outperforms Model A




### Key Insights

**Clear Winner:** Claude 3.7 Sonnet significantly outperformed Qwen2.5 1.5 B Model

This was expected because Qwen 2.5 1.5B is a significantly smaller model than Claude 3.7

**Consistent Performance:** No ties suggest clear quality differences

**Reliable Evaluation:** Zero inference errors indicate robust setup

**Statistical Validity:** Wide confidence interval due to small sample size

## Conclusion

This notebook demonstrated a complete Amazon Nova LLM-as-a-Judge evaluation pipeline using Amazon SageMaker AI. The methodology provides:

**Scalable Evaluation:** Automated comparison of multiple models

**Statistical Rigor:** Confidence intervals and significance testing

**Cost Efficiency:** Reduced need for human evaluation

**Actionable Insights:** Clear metrics for model selection

The results showed Claude 3.7 Sonnet outperforming Qwen2.5 with 75% win rate, providing valuable insights for model selection and deployment decisions.