# Amazon Nova Rubric Based LLM-as-a-Judge Evaluation with Amazon SageMaker

This notebook demonstrates how to use Amazon Nova's rubric-based LLM-as-a-Judge methodology to evaluate and compare the outputs of two different large language models using Amazon SageMaker Training Jobs. We'll compare responses from a **Qwen2.5 1.5B Instruct model (Model A)** against a **Qwen2.5 7B Instruct model (Model B)**, both deployed on SageMaker AI.

**Overview**

The Amazon Nova rubric-based LLM-as-a-Judge approach uses a powerful language model to evaluate the quality of responses from other models by dynamically generating custom evaluation rubrics for each comparison. This method provides:

**Objective Comparison**: Systematic evaluation using automatically generated, context-specific rubrics with weighted criteria (accuracy, completeness, clarity, etc.)

**Scalable Assessment**: Automated rubric generation and criterion-based scoring across large datasets without manual rubric design

**Detailed Metrics**: Win rates, confidence intervals, preference distributions, weighted scores per criterion, and detailed justifications for each evaluation dimension

**Cost-Effective**: More efficient than human evaluation for large-scale comparisons while maintaining evaluation transparency through explicit rubric criteria

**Adaptive Evaluation**: Rubrics are tailored to each specific question-answer pair, ensuring relevant and context-appropriate assessment criteria

## Prerequisites

* AWS Account with SageMaker and Bedrock access
* Appropriate IAM roles and permissions
* SageMaker Studio or Jupyter environment

## Understanding Amazon Nova LLM-as-a-Judge Evaluation Metrics 

When using the Amazon Nova LLM-as-a-Judge framework to compare the outputs of two language models, a set of quantitative metrics is generated. These metrics help you objectively assess which model performs better and how reliable the evaluation is.

---

### Core Preference Metrics

- **a_scores**  
  The number of times Model A's response was preferred by the judge model over Model B.

- **b_scores**  
  The number of times Model B's response was preferred by the judge model over Model A.

- **ties**  
  The number of times the judge found both responses equally good or could not determine a preference.

- **inference_error**  
  The number of evaluation cases where the judge could not provide a valid judgment due to technical issues, malformed outputs, or other errors.

---

### Rubric-Specific Metrics
- **weighted_a / weighted_b** Aggregate weighted scores for each model calculated by combining criterion-specific scores multiplied by their respective weights. These provide a more nuanced performance measure than simple win counts.

- **margin** The difference between weighted scores (typically weighted_b - weighted_a), indicating the magnitude of preference. Negative margins favor Model B, positive margins favor Model A.

- **criteria_breakdown** Detailed performance across individual rubric dimensions (e.g., accuracy: 0.7 weight, completeness: 0.2 weight, clarity: 0.1 weight) with justifications for each criterion.

---
### Statistical Confidence Metrics

- **winrate**  
  The proportion of valid judgments in which Model B was preferred

- **lower_rate**  
  The lower bound of the 95% confidence interval for the winrate. This tells you the minimum likely winrate for Model B, accounting for statistical uncertainty.

- **upper_rate**  
  The upper bound of the 95% confidence interval for the winrate. This tells you the maximum likely winrate for Model B.

- **average_weighted_score_a / average_weighted_score_b** Mean weighted scores across all valid evaluations, providing aggregate performance measures that account for rubric criterion weights.

- **average_margin** Mean margin across all evaluations, indicating the typical magnitude of preference between models.

---

### Standard Error Metrics

- **a_scores_stderr, b_scores_stderr, ties_stderr, inference_error_stderr, score_stderr**  
  These metrics reflect the standard error (uncertainty) of each corresponding count or score. Smaller values indicate more reliable results, while larger values suggest more variability or a need for a larger sample size.

---

### How to Interpret These Metrics

- **Winrate and Confidence Intervals**
If the winrate is significantly above 0.5 and the confidence interval does not include 0.5, Model B is statistically favored across rubric criteria.
If the winrate is below 0.5 and the confidence interval does not include 0.5, Model A is statistically favored.
If the interval includes 0.5, results are inconclusive and may require additional evaluations.


- **Weighted Score Analysis**
  - Compare average_weighted_score_a vs average_weighted_score_b to understand overall performance accounting for criterion importance.
Examine average_margin to assess the typical magnitude of performance differences.
Review criteria_breakdown to identify specific strengths and weaknesses (e.g., Model B excels in accuracy but Model A is better at clarity).


- **Error Analysis**
  - High inference_error or large standard errors indicate possible issues with the evaluation process, rubric generation, or insufficient data.
Review individual evaluation justifications to understand why errors occurred.

- **Preference Distribution**
  - The balance of a_scores, b_scores, and ties provides a direct picture of model performance differences on your evaluation set.
Weighted scores provide deeper insight than raw counts by accounting for criterion importance.

---

### Example Metrics Output

```json
{
  "a_scores": 3.0,
  "a_scores_stderr": 0.02,
  "b_scores": 7.0,
  "b_scores_stderr": 0.05,
  "ties": 0.0,
  "ties_stderr": 0.0,
  "inference_error": 1.0,
  "inference_error_stderr": 0.01,
  "winrate": 0.70,
  "lower_rate": 0.40,
  "upper_rate": 0.909,
  "average_weighted_score_a": 0.495,
  "average_weighted_score_b": 0.630,
  "average_margin": -0.135,
  "error_rate": 0.091,
  "criteria_weights": {
    "accuracy": 0.6,
    "completeness": 0.25,
    "clarity": 0.15
  }
}
```
---

These metrics, generated automatically during rubric-based evaluation, provide a comprehensive, statistically rigorous, and interpretable summary of how two models compare on your chosen dataset. They enable you to make informed decisions about model selection, deployment, and improvement priorities.

### Setup and Installation

Set up the required dependencies and configure the environment.

### Setup and Installation

Set up the required dependencies and configure the environment.

#### IMPORTANT: Ensure that this specific version (2.254.1) of the Sagemaker CLI is used. Nova Customization does not currently support the latest SageMaker v3 CLI!

In [None]:
!pip install sagemaker==2.254.1

### Import Libraries
Import necessary Python packages and set up SageMaker session

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
import boto3
import json
from datasets import load_dataset
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import boto3
import json
from sagemaker.huggingface import HuggingFacePredictor


# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()


## Model Setup


### Deploy Qwen2.5 1.5B Instruct Model

In [None]:
!python3 deploy_model_arg.py Qwen/Qwen2.5-1.5B-Instruct

### Deploy Qwen2.5 7B Instruct Model

In [None]:
!python3 deploy_model_arg.py Qwen/Qwen2.5-7B-Instruct

### Test out the endpoint

In [None]:
from sagemaker.huggingface import HuggingFacePredictor

def generate_response(endpoint_name: str, prompt: str, max_tokens: int = 500, temperature: float = 0.9) -> str:
    predictor = HuggingFacePredictor(endpoint_name=endpoint_name)
    response = predictor.predict({
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature
        }
    })
    return response[0]["generated_text"]

# Example usage
if __name__ == "__main__":
    answer_1_5b = generate_response("qwen25-15b-instruct-2025-11-24-22-28-11-675", "What is the Grotto at Notre Dame?")
    print(answer_1_5b)

    print("*******************")
    print("*******************")
    answer_7b = generate_response("qwen25-7b-instruct-2025-11-24-22-27-28-607", "What is the Grotto at Notre Dame?")
    print(answer_7b)

### Data Preparation

In [None]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:20]")
print(squad[3]["question"])
print(squad[3]["answers"]["text"][0])

questions = [squad[i]["question"] for i in range(7, 14)]

print(len(questions))

### Generate Evaluation Dataset


In [None]:
import json


output_path = "llm_judge.jsonl"

with open(output_path, "w") as f:
    for q in questions:
        try:
            response_a = generate_response("qwen25-15b-instruct-2025-11-24-22-28-11-675", q)
        except Exception as e:
            response_a = f"[Qwen2.5 generation failed: {e}]"
        try:
            response_b = generate_response("qwen25-7b-instruct-2025-11-24-22-27-28-607", q)
        except Exception as e:
            response_b = f"[Claude 3.7 generation failed: {e}]"

        row = {
            "prompt": q,
            "response_A": response_a,
            "response_B": response_b
        }
        f.write(json.dumps(row) + "\n")

print(f"JSONL file created at: {output_path}")



### Configure Training Parameters
Set up the necessary parameters for the training job
The recipe yaml will be provided as a part of this example notebook under the filename eval_rubric_judge_recipe.yaml.

Please enter your bucket name in the cell below. 

In [None]:
# Please populate parameters

your_bucket_name = "<Your Bucket Name>"
assert your_bucket_name != "", "PLEASE POPULATE YOUR BUCKET NAME ABOVE"

input_s3_uri = "<INPUT_S3_DATA_PATH>".format(your_bucket_name)
output_s3_uri = "<OUTPUT_S3_DATA_PATH>".format(your_bucket_name) # Output data s3 location
instance_type = "ml.p5.48xlarge"  # this has to be run on a P5 instance
job_name = "rubric-judge-demo"
recipe_path = "./eval_rubric_judge_recipe.yaml" # Ensure this is the correct recipe for rubric judge
image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-V2-latest"



### Upload Data to S3

## Grant S3 Permissions to the SageMaker Execution Role

Before proceeding, make sure to grant the Execution Role direct `s3:PutObject` permissions for your S3 bucket prefix.

**Steps:**
- Go to the Execution Role (e.g., `AmazonSageMaker-ExecutionRole-...`) in the AWS IAM Console.
- Attach the following inline policy:

```json
{
  "Effect": "Allow",
  "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::my-bucket-east",
    "arn:aws:s3:::my-bucket-east/*"
  ]
}
```


In [None]:
import boto3

def parse_s3_uri(s3_uri):
    assert s3_uri.startswith("s3://"), "Invalid S3 URI"
    parts = s3_uri.replace("s3://", "").split("/", 1)
    bucket = parts[0]
    key = parts[1] if len(parts) > 1 else ""
    return bucket, key

def upload_to_s3(local_path, s3_uri):
    """
    Upload evaluation data to S3 bucket using current role credentials.
    """
    bucket, key = parse_s3_uri(s3_uri)
    
    s3 = boto3.client("s3")
    s3.upload_file(Filename=local_path, Bucket=bucket, Key=key)
    print(f"âœ… Uploaded {local_path} to {s3_uri}")

# Example usage
upload_to_s3(
    "llm_judge.jsonl",
    "s3://{}/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl".format(your_bucket_name)
)


### Set Up Training Input (Optional)
Configure input data source for evaluation

In [None]:

evalInput = TrainingInput(
 s3_data=input_s3_uri,
 distribution='FullyReplicated',
 s3_data_type='S3Prefix'
)

### Run Amazon Nova LLM-as-a-Judge Evaluation


In [None]:
estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    instance_count=1
)

estimator.fit({'train': evalInput})

### Download Results From Training Job

In [None]:
import boto3
import os
import tarfile

def download_and_extract_job_output(training_job_name: str, output_s3_uri: str, download_dir: str = "./output"):
    """
    Downloads the output.tar.gz of a SageMaker training job and extracts it locally.

    Args:
        training_job_name (str): Name of the SageMaker training job.
        output_s3_uri (str): Base S3 URI where outputs are stored.
        download_dir (str): Local directory to extract files into.
    """
    # Build the full S3 path
    s3_uri = f"{output_s3_uri.rstrip('/')}/{training_job_name}/output/output.tar.gz"
    print(f"Resolved S3 URI: {s3_uri}")

    # Parse bucket and key
    def parse_s3_uri(s3_uri):
        assert s3_uri.startswith("s3://"), "Invalid S3 URI"
        parts = s3_uri.replace("s3://", "").split("/", 1)
        bucket = parts[0]
        key = parts[1] if len(parts) > 1 else ""
        return bucket, key

    bucket, key = parse_s3_uri(s3_uri)

    # Create S3 client
    s3 = boto3.client("s3")

    # Create output directory
    if not os.path.exists(download_dir):
        os.makedirs(download_dir)

    local_tar_path = os.path.join(download_dir, "output.tar.gz")

    # Download file
    print("Downloading...")
    s3.download_file(bucket, key, local_tar_path)
    print(f"Downloaded to {local_tar_path}")

    # Extract tar.gz
    print("Extracting...")
    with tarfile.open(local_tar_path, "r:gz") as tar:
        tar.extractall(path=download_dir)

    print(f"Extracted contents to {download_dir}")


In [None]:
training_job_name = "rubric-judge-dmeo"
assert training_job_name != "", "PLEASE POPULATE YOUR TRAINING JOB NAME ABOVE"

download_and_extract_job_output(training_job_name, output_s3_uri)


## Rubric Inspection

In [None]:
import pandas as pd
import ast

#example_parquet_file_output = 'output/jmoul-rubric-demo/eval_results/details/rubric_judge_model/2025-12-09T01-39-55.025383+00-00/details_custom|rubric_llm_judge_judge|0_2025-12-09T01-39-55.025383+00-00.parquet'

parquet_file_output = "output/nova-lite-v2-rubric-llm-judge-eval-job/eval_results/details/rubric_judge_model/2025-12-15T18-04-27.835638+00-00/details_custom|rubric_llm_judge_judge|0_2025-12-15T18-04-27.835638+00-00.parquet"
assert parquet_file_output != "", "PLEASE POPULATE YOUR PARQUET FILE LOCATION, LOOK AT EXAMPLE ABOVE"

df = pd.read_parquet(parquet_file_output)

for idx, row in df.iterrows():
    metrics = ast.literal_eval(row['metrics'])
    
    print(f"\n{'='*80}")
    print(f"Row {idx}:")
    print(f"  Preference: {metrics['predictions']}")
    print(f"  A wins: {metrics['a_scores']}")
    print(f"  B wins: {metrics['b_scores']}")
    print(f"  Weighted A: {metrics['weighted_score_A']:.3f}")
    print(f"  Weighted B: {metrics['weighted_score_B']:.3f}")
    print(f"  Margin: {metrics['score_margin']:.3f}")
    
    # Overall justification
    if metrics.get('overall_justification'):
        print(f"\n  Overall Justification:")
        print(f"    {metrics['overall_justification']}")
    
    # Per-criterion breakdown with justifications
    if metrics.get('criteria_breakdown'):
        print(f"\n  Criteria:")
        for crit_name, crit_data in metrics['criteria_breakdown'].items():
            print(f"\n    {crit_name}:")
            print(f"      Score A: {crit_data['score_A']}, Score B: {crit_data['score_B']}")
            print(f"      Weight: {crit_data['weight']}, Type: {crit_data['type']}")
            print(f"      Description: {crit_data['description']}")
            if crit_data.get('justification_A'):
                print(f"      Justification A: {crit_data['justification_A']}")
            if crit_data.get('justification_B'):
                print(f"      Justification B: {crit_data['justification_B']}")

## Results Visualization

Based on the evaluation results shown in the uploaded image, here's how to create comprehensive visualizations:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from pathlib import Path


def plot_llm_judge_results(results):
    """
    Plot LLM judge evaluation results from a JSON file.

    Args:
        json_file_path (str): Path to the JSON results file
        save_plots (bool): Whether to save plots to files
        output_dir (str): Directory to save plots (defaults to same dir as JSON file)

    Returns:
        dict: Dictionary containing the plotted data for further analysis
    """

    # Set style
    plt.style.use("default")
    sns.set_palette("husl")

    # Create figure with subplots
    fig = plt.figure(figsize=(16, 12))

    # 1. Score Distribution Bar Chart
    ax1 = plt.subplot(2, 3, 1)
    scores = {
        "A Scores": results["a_scores"],
        "B Scores": results["b_scores"],
        "Ties": results["ties"],
        "Inference Errors": results["inference_error"],
    }

    bars = ax1.bar(
        scores.keys(),
        scores.values(),
        color=["#FF6B6B", "#4ECDC4", "#45B7D1", "#FFA07A"],
    )
    ax1.set_title("Score Distribution", fontsize=14, fontweight="bold")
    ax1.set_ylabel("Count")

    # Add value labels on bars
    for bar, value in zip(bars, scores.values()):
        height = bar.get_height()
        ax1.text(
            bar.get_x() + bar.get_width() / 2.0,
            height + height * 0.01,
            f"{int(value)}",
            ha="center",
            va="bottom",
            fontweight="bold",
        )

    plt.xticks(rotation=45, ha="right")

    # 2. Win Rate with Confidence Interval
    ax2 = plt.subplot(2, 3, 2)
    winrate = results["winrate"]
    lower_rate = results["lower_rate"]
    upper_rate = results["upper_rate"]

    # Create horizontal bar for winrate
    ax2.barh(["Win Rate"], [winrate], color="#4ECDC4", alpha=0.7, height=0.3)

    # Add confidence interval
    ax2.errorbar(
        [winrate],
        ["Win Rate"],
        xerr=[[winrate - lower_rate], [upper_rate - winrate]],
        fmt="o",
        color="black",
        capsize=10,
        capthick=2,
    )

    ax2.set_xlim(0, 1)
    ax2.set_xlabel("Win Rate")
    ax2.set_title("B vs A Win Rate with 95% CI", fontsize=14, fontweight="bold")
    ax2.axvline(
        x=0.5, color="red", linestyle="--", alpha=0.7, label="50% (No preference)"
    )
    ax2.legend()

    # Add text annotation
    ax2.text(
        winrate,
        0,
        f"{winrate:.3f}\n[{lower_rate:.3f}, {upper_rate:.3f}]",
        ha="center",
        va="bottom",
        fontweight="bold",
        bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8),
    )

    # 3. Preference Pie Chart (excluding inference errors)
    ax3 = plt.subplot(2, 3, 3)
    total_valid = results["a_scores"] + results["b_scores"] + results["ties"]

    if total_valid > 0:
        pie_data = [results["a_scores"], results["b_scores"], results["ties"]]
        pie_labels = ["A Preferred", "B Preferred", "Ties"]
        colors = ["#FF6B6B", "#4ECDC4", "#45B7D1"]

        wedges, texts, autotexts = ax3.pie(
            pie_data, labels=pie_labels, colors=colors, autopct="%1.1f%%", startangle=90
        )

        # Make percentage text bold
        for autotext in autotexts:
            autotext.set_fontweight("bold")
            autotext.set_color("white")

    ax3.set_title(
        "Preference Distribution\n(Valid Judgments Only)",
        fontsize=14,
        fontweight="bold",
    )

    # 4. Comparison of A vs B Scores
    ax4 = plt.subplot(2, 3, 4)
    categories = ["A Scores", "B Scores"]
    values = [results["a_scores"], results["b_scores"]]
    colors = ["#FF6B6B", "#4ECDC4"]

    bars = ax4.bar(categories, values, color=colors, alpha=0.8)
    ax4.set_title("A vs B Score Comparison", fontsize=14, fontweight="bold")
    ax4.set_ylabel("Score Count")

    # Add value labels
    for bar, value in zip(bars, values):
        height = bar.get_height()
        ax4.text(
            bar.get_x() + bar.get_width() / 2.0,
            height + height * 0.01,
            f"{int(value)}",
            ha="center",
            va="bottom",
            fontweight="bold",
        )

    # Add difference annotation
    diff = abs(values[0] - values[1])
    winner = "A" if values[0] > values[1] else "B"
    ax4.text(
        0.5,
        max(values) * 0.8,
        f"{winner} leads by {int(diff)}",
        ha="center",
        transform=ax4.transData,
        bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7),
        fontweight="bold",
    )

    # 5. Win Rate Visualization (Gauge-like)
    ax5 = plt.subplot(2, 3, 5)

    # Create a semi-circular gauge
    theta = np.linspace(0, np.pi, 100)
    r = 1

    # Background arc
    ax5.plot(r * np.cos(theta), r * np.sin(theta), "lightgray", linewidth=10)

    # Win rate arc
    winrate_theta = np.linspace(0, winrate * np.pi, int(winrate * 100))
    ax5.plot(
        r * np.cos(winrate_theta), r * np.sin(winrate_theta), "#4ECDC4", linewidth=10
    )

    # Add needle
    needle_angle = winrate * np.pi
    ax5.arrow(
        0,
        0,
        0.8 * np.cos(needle_angle),
        0.8 * np.sin(needle_angle),
        head_width=0.05,
        head_length=0.05,
        fc="red",
        ec="red",
    )

    ax5.set_xlim(-1.2, 1.2)
    ax5.set_ylim(-0.2, 1.2)
    ax5.set_aspect("equal")
    ax5.axis("off")
    ax5.set_title("Win Rate Gauge", fontsize=14, fontweight="bold")

    # Add labels
    ax5.text(-1, -0.1, "0%", ha="center", fontweight="bold")
    ax5.text(0, -0.1, "50%", ha="center", fontweight="bold")
    ax5.text(1, -0.1, "100%", ha="center", fontweight="bold")
    ax5.text(
        0,
        0.5,
        f"{winrate:.1%}",
        ha="center",
        va="center",
        fontsize=16,
        fontweight="bold",
        bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8),
    )

    # 6. Summary Statistics Table
    ax6 = plt.subplot(2, 3, 6)
    ax6.axis("off")

    # Create summary statistics
    total_evaluations = (
        results["a_scores"]
        + results["b_scores"]
        + results["ties"]
        + results["inference_error"]
    )

    summary_data = [
        ["Total Evaluations", f"{int(total_evaluations)}"],
        ["A Scores", f"{int(results['a_scores'])}"],
        ["B Scores", f"{int(results['b_scores'])}"],
        ["Ties", f"{int(results['ties'])}"],
        ["Inference Errors", f"{int(results['inference_error'])}"],
        ["Win Rate (B vs A)", f"{results['winrate']:.3f}"],
        ["95% CI Lower", f"{results['lower_rate']:.3f}"],
        ["95% CI Upper", f"{results['upper_rate']:.3f}"],
        [
            "Error Rate",
            (
                f"{results['inference_error']/total_evaluations:.1%}"
                if total_evaluations > 0
                else "0%"
            ),
        ],
    ]

    # Create table
    table = ax6.table(
        cellText=summary_data,
        colLabels=["Metric", "Value"],
        cellLoc="left",
        loc="center",
        colWidths=[0.6, 0.4],
    )

    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 2)

    # Style the table
    for i in range(len(summary_data) + 1):
        for j in range(2):
            cell = table[(i, j)]
            if i == 0:  # Header
                cell.set_facecolor("#4ECDC4")
                cell.set_text_props(weight="bold", color="white")
            else:
                cell.set_facecolor("#f0f0f0" if i % 2 == 0 else "white")
                if j == 1:  # Value column
                    cell.set_text_props(weight="bold")

    ax6.set_title("Summary Statistics", fontsize=14, fontweight="bold", pad=20)

    plt.tight_layout()

    return plt

### Run Visualization

In [None]:
# Example evaluations path below
# evaluation_results_path = "./output/jmoul-rubric-demo/eval_results/results_2025-12-09T01-39-55.025383+00-00.json"

evaluation_results_path = "output/nova-lite-v2-rubric-llm-judge-eval-job/eval_results/results_2025-12-15T18-04-27.835638+00-00.json"
assert evaluation_results_path != "", "PLEASE POPULATE YOUR EVALUATIONS RESULTS PATH ABOVE"


In [None]:
import os

with open(evaluation_results_path, "r") as f:
    data = json.load(f)

fig = plot_llm_judge_results(data["results"]["all"])

output_file = os.path.join("./", "evaluation_metrics.png")
fig.savefig(output_file, bbox_inches="tight")

### Results

- **Model B (Qwen2.5 7B Instruct)** demonstrates superior performance with a 70% win rate against **Model A (Qwen2.5 1.5B Instruct)**, though there's notable variability across different question types.

- **Aggregate Statistics**  
  - Total Evaluations: 11 (7 detailed datapoints provided)
  - Valid Judgments: 10 (excluding 1 inference error)
  - Win Distribution: B scored 7 wins vs A's 3 wins
  - Preference Rate: 70% preferred B, 30% preferred A
  - 95% Confidence Interval: [0.400, 0.909] - indicating statistical confidence in B's superiority
  - Error Rate: 9.1% (1 inference error out of 11 evaluations)

**Detailed Datapoint Analysis**   

From the 5 valid evaluations in your detailed data:

- **Model B(Qwen2.5 7B Instruct) Wins (3 cases - 60%)**
  - Row 0: Notre Dame student newspapers - B provided more accurate, complete information with specific examples vs A's unsupported number
  - Row 3: Common Sense publication year - B correctly identified 1916 vs A's incorrect 1879
  - Row 6: Notre Dame's oldest structure - B accurately identified western facade and southern spire vs A's confused response
- **Model A(Qwen2.5 1.5B Instruct) Wins (2 cases - 40%)**
  - Row 1: Congregation of Holy Cross headquarters - A correctly explained decentralized structure vs B's inaccurate single location claim
  - Row 2: Primary seminary information - A provided detailed, accurate explanation vs B's completely unrelated content

**Weighted Performance Metrics**
- Average Weighted Score A: 0.495
- Average Weighted Score B: 0.630
- Average Margin (B-A): -0.135 (negative indicates B's advantage)

## Conclusion

This notebook demonstrated a complete **Amazon Nova Rubric-Based LLM-as-a-Judge** evaluation pipeline using Amazon SageMaker AI. The methodology provides:

**Scalable Evaluation**: Automated rubric-based comparison of multiple models with dynamically generated evaluation criteria

**Statistical Rigor**: Confidence intervals and significance testing with weighted scoring across multiple rubric dimensions (accuracy, completeness, clarity)

**Cost Efficiency**: Reduced need for human evaluation through automated rubric generation and criterion-specific scoring

**Actionable Insights**: Clear metrics for model selection with detailed justifications per evaluation criterion and weighted performance analysis

The 70% win rate with a confidence interval not crossing the 50% threshold suggests statistically meaningful superiority for the 7B model, though the relatively small sample size (11 evaluations) means continued testing would strengthen these conclusions.