# Homework 3: LLM-as-Judge for Recipe Bot Evaluation

In this notebook, we will cover:

- Creating train, validation, and test datasets from labeled traces
- Fetching experiment data to calculate TPR and TNR on those datasets
- Running evals on unlabled traces using an LLM judge we create in a Braintrust playground
- Using judgy to calibrate our LLM judge predictions

Let's start by importing the required libraries.


## Imports


In [None]:
import os
import sys

import numpy as np
import pandas as pd
import requests

from judgy import estimate_success_rate

from backend.utils import get_agent_response

sys.path.append(os.path.abspath(".."))
import braintrust as bt
import openai as oai
import requests

from dotenv import load_dotenv
from pydantic import BaseModel, Field

# Configuration - make percentages configurable
TRAIN_SIZE = 0.15  # 15%
VALIDATION_SIZE = 0.40  # 40%
TEST_SIZE = 0.45  # 45%

# Verify percentages sum to 1
assert abs(TRAIN_SIZE + VALIDATION_SIZE + TEST_SIZE - 1.0) < 1e-10, "Percentages must sum to 100%"

print(f"Split configuration:")
print(f"Train: {TRAIN_SIZE:.1%}")
print(f"Validation: {VALIDATION_SIZE:.1%}")
print(f"Test: {TEST_SIZE:.1%}")

Split configuration:
Train: 15.0%
Validation: 40.0%
Test: 45.0%


## Setup


In [45]:
# Load the labeled traces data
data_path = "data/labeled_traces.csv"
df = pd.read_csv(data_path)

print(f"Loaded dataset with {len(df)} rows and {len(df.columns)} columns")
print(f"Columns: {list(df.columns)}")

# Check the distribution of labels
label_counts = df["label"].value_counts()
print(f"\nOriginal label distribution:")
print(label_counts)
print(f"\nOriginal label proportions:")
print(label_counts / len(df))


Loaded dataset with 101 rows and 11 columns
Columns: ['query', 'dietary_restriction', 'response', 'success', 'error', 'trace_id', 'query_id', 'label', 'reasoning', 'confidence', 'labeled']

Original label distribution:
label
PASS    75
FAIL    26
Name: count, dtype: int64

Original label proportions:
label
PASS    0.742574
FAIL    0.257426
Name: count, dtype: float64


## Step 1: Get & Label Your Data


For the sake of time we're going to assume this is already done.

A great way to accomplish this within Braintrust is to use the following workflow that we've demonstrated in previous lessons:

1. Use an LLM to help curate relevant traces for the failure mode you are targeting.

2. Generate traces by calling `get_agent_response`

3. In Braintrust:

- Move those logs to a dataset
- Create a "Human Review" score called "label" that can be set to "PASS" or "FAIL" (have it write to `metadata.label`)
- Review the traces
- Move them to a dataset

4. Download the dataset created at the end of step 3 as a .csv named `labeled_traces.csv`

There are many ways to accomplish this if you're so bold to give it a go. The above is just a recommended strategy that works :)


## Step 2: Split Your Labeled Data


In [46]:
def load_and_analyze_data(csv_path):
    """Load the labeled traces data and return basic statistics."""
    df = pd.read_csv(csv_path)
    pass_df = df[df["label"] == "PASS"].reset_index(drop=True)
    fail_df = df[df["label"] == "FAIL"].reset_index(drop=True)

    print(f"Loaded dataset with {len(df)} rows and {len(df.columns)} columns")
    print(f"Total PASS: {len(pass_df)}, Total FAIL: {len(fail_df)}")

    original_ratio = len(pass_df) / len(fail_df) if len(fail_df) > 0 else float("inf")
    print(f"Original PASS:FAIL ratio = {original_ratio:.2f}:1")

    return df, pass_df, fail_df, original_ratio


In [47]:
def find_optimal_allocation(pass_df, fail_df, train_pct, val_pct, test_pct, original_ratio):
    """Find the optimal allocation that satisfies all constraints."""
    print(f"\nFinding optimal allocation for {train_pct:.1f}%/{val_pct:.1f}%/{test_pct:.1f}% split...")

    best_solution = None
    best_score = float("inf")

    # Try different balanced sizes for validation and test (must be even for perfect balance)
    max_search = min(50, len(fail_df) * 2)  # Reasonable search limit

    for val_size in range(2, max_search, 2):  # Even numbers only
        val_n_fail = val_size // 2
        val_n_pass = val_size // 2

        # Try test sizes from 2 to remaining capacity
        max_test_fail = len(fail_df) - val_n_fail
        for test_size in range(2, min(max_search, 2 * max_test_fail + 1), 2):  # Even numbers only
            test_n_fail = test_size // 2
            test_n_pass = test_size // 2

            # Check if we have enough FAIL examples
            if val_n_fail + test_n_fail > len(fail_df):
                continue

            # Calculate remaining for train
            train_n_fail = len(fail_df) - val_n_fail - test_n_fail

            # For train to be representative, use same ratio as original
            train_n_pass_ideal = int(train_n_fail * original_ratio)

            # Check if we have enough PASS examples
            total_pass_needed = val_n_pass + test_n_pass + train_n_pass_ideal
            if total_pass_needed > len(pass_df):
                # Adjust train_n_pass to what's available
                train_n_pass_actual = len(pass_df) - val_n_pass - test_n_pass
                if train_n_pass_actual < 0:
                    continue  # Not enough PASS examples
            else:
                train_n_pass_actual = train_n_pass_ideal

            train_size = train_n_pass_actual + train_n_fail

            # Calculate total and percentages
            total_size = val_size + test_size + train_size
            if total_size == 0:
                continue

            actual_train_pct = (train_size / total_size) * 100
            actual_val_pct = (val_size / total_size) * 100
            actual_test_pct = (test_size / total_size) * 100

            # Calculate how far we are from target percentages
            pct_diff = abs(actual_train_pct - train_pct) + abs(actual_val_pct - val_pct) + abs(actual_test_pct - test_pct)

            # Check train representativeness
            if train_n_fail > 0:
                train_ratio = train_n_pass_actual / train_n_fail
                ratio_preservation = abs(train_ratio - original_ratio) / original_ratio
            else:
                ratio_preservation = 1.0  # Bad if no FAIL in train

            # Combined score: prioritize ratio preservation and target percentages
            score = pct_diff + (ratio_preservation * 50)  # Weight ratio preservation heavily

            if score < best_score and train_n_fail > 0:  # Must have some FAIL in train
                best_score = score
                best_solution = {
                    "val_n_pass": val_n_pass,
                    "val_n_fail": val_n_fail,
                    "test_n_pass": test_n_pass,
                    "test_n_fail": test_n_fail,
                    "train_n_pass": train_n_pass_actual,
                    "train_n_fail": train_n_fail,
                    "train_pct": actual_train_pct,
                    "val_pct": actual_val_pct,
                    "test_pct": actual_test_pct,
                    "total_size": total_size,
                }

    if best_solution is None:
        print("⚠️  Could not find optimal solution, using fallback...")
        # Fallback: use a reasonable split
        val_n_fail = test_n_fail = min(6, len(fail_df) // 3)
        val_n_pass = test_n_pass = val_n_fail
        train_n_fail = len(fail_df) - val_n_fail - test_n_fail
        train_n_pass = min(len(pass_df) - val_n_pass - test_n_pass, int(train_n_fail * original_ratio))

        best_solution = {
            "val_n_pass": val_n_pass,
            "val_n_fail": val_n_fail,
            "test_n_pass": test_n_pass,
            "test_n_fail": test_n_fail,
            "train_n_pass": train_n_pass,
            "train_n_fail": train_n_fail,
        }
    else:
        print(f"✅ Found optimal solution:")
        print(
            f"   Target vs actual: {train_pct:.1f}%→{best_solution['train_pct']:.1f}%, {val_pct:.1f}%→{best_solution['val_pct']:.1f}%, {test_pct:.1f}%→{best_solution['test_pct']:.1f}%"
        )

    return best_solution


In [48]:
def allocate_examples(pass_df, fail_df, allocation, random_seed=None):
    """Allocate PASS and FAIL examples to train/validation/test splits."""
    # Extract allocation numbers
    val_n_pass = allocation["val_n_pass"]
    val_n_fail = allocation["val_n_fail"]
    test_n_pass = allocation["test_n_pass"]
    test_n_fail = allocation["test_n_fail"]
    train_n_pass = allocation["train_n_pass"]
    train_n_fail = allocation["train_n_fail"]

    # Generate random seed if not provided
    if random_seed is None:
        random_seed = np.random.randint(0, 2**31)
        print(f"Using random seed: {random_seed}")

    # Shuffle examples for random distribution
    pass_shuffled = pass_df.sample(n=len(pass_df), random_state=random_seed).reset_index(drop=True)
    fail_shuffled = fail_df.sample(n=len(fail_df), random_state=random_seed).reset_index(drop=True)

    # Allocate FAIL examples
    fail_idx = 0
    val_fail = fail_shuffled.iloc[fail_idx : fail_idx + val_n_fail] if val_n_fail > 0 else pd.DataFrame()
    fail_idx += val_n_fail
    test_fail = fail_shuffled.iloc[fail_idx : fail_idx + test_n_fail] if test_n_fail > 0 else pd.DataFrame()
    fail_idx += test_n_fail
    train_fail = fail_shuffled.iloc[fail_idx : fail_idx + train_n_fail] if train_n_fail > 0 else pd.DataFrame()

    # Allocate PASS examples
    pass_idx = 0
    val_pass = pass_shuffled.iloc[pass_idx : pass_idx + val_n_pass] if val_n_pass > 0 else pd.DataFrame()
    pass_idx += val_n_pass
    test_pass = pass_shuffled.iloc[pass_idx : pass_idx + test_n_pass] if test_n_pass > 0 else pd.DataFrame()
    pass_idx += test_n_pass
    train_pass = pass_shuffled.iloc[pass_idx : pass_idx + train_n_pass] if train_n_pass > 0 else pd.DataFrame()

    # Combine PASS and FAIL for each split
    validation_df = pd.concat([val_pass, val_fail], ignore_index=True) if len(val_pass) > 0 or len(val_fail) > 0 else pd.DataFrame()
    test_df = pd.concat([test_pass, test_fail], ignore_index=True) if len(test_pass) > 0 or len(test_fail) > 0 else pd.DataFrame()
    train_df = pd.concat([train_pass, train_fail], ignore_index=True) if len(train_pass) > 0 or len(train_fail) > 0 else pd.DataFrame()

    return train_df, validation_df, test_df


In [49]:
def verify_splits(train_df, validation_df, test_df, original_ratio):
    """Verify the quality of the splits."""
    print(f"\nFinal splits created:")
    print(f"Train: {len(train_df)} examples")
    print(f"Validation: {len(validation_df)} examples")
    print(f"Test: {len(test_df)} examples")

    total_used = len(train_df) + len(validation_df) + len(test_df)
    print(f"Total used: {total_used}")

    # Calculate final percentages
    if total_used > 0:
        train_pct = (len(train_df) / total_used) * 100
        val_pct = (len(validation_df) / total_used) * 100
        test_pct = (len(test_df) / total_used) * 100

        print(f"\nFinal percentages:")
        print(f"Train: {train_pct:.1f}%")
        print(f"Validation: {val_pct:.1f}%")
        print(f"Test: {test_pct:.1f}%")

    # Verify balance
    print(f"\nBalance verification:")

    if len(validation_df) > 0:
        val_pass = len(validation_df[validation_df["label"] == "PASS"])
        val_fail = len(validation_df[validation_df["label"] == "FAIL"])
        print(f"Validation: {val_pass} PASS = {val_fail} FAIL ✓")

    if len(test_df) > 0:
        test_pass = len(test_df[test_df["label"] == "PASS"])
        test_fail = len(test_df[test_df["label"] == "FAIL"])
        print(f"Test: {test_pass} PASS = {test_fail} FAIL ✓")

    if len(train_df) > 0:
        train_pass = len(train_df[train_df["label"] == "PASS"])
        train_fail = len(train_df[train_df["label"] == "FAIL"])
        train_ratio = train_pass / train_fail if train_fail > 0 else float("inf")

        print(f"Train: {train_pass} PASS + {train_fail} FAIL (ratio: {train_ratio:.2f}:1)")
        print(f"   Original ratio: {original_ratio:.2f}:1")
        print(f"   Ratio preservation: {(train_ratio / original_ratio) * 100:.1f}%")


In [50]:
def create_proportional_splits(csv_path, train_pct=15.0, val_pct=40.0, test_pct=45.0, output_path=None, random_seed=None):
    """
    Create proportional train/validation/test splits from labeled traces data using ALL the data.

    Args:
        csv_path (str): Path to the CSV file with labeled traces
        train_pct (float): Target percentage for training set (default: 15.0)
        val_pct (float): Target percentage for validation set (default: 40.0)
        test_pct (float): Target percentage for test set (default: 45.0)
        output_path (str): Path to save the combined dataset (optional)
        random_seed (int): Random seed for reproducible splits (default: None for random)

    Returns:
        tuple: (train_df, validation_df, test_df, combined_df)

    Requirements:
        - Uses ALL the data from the CSV file
        - Maintains the same PASS/FAIL proportion across all three datasets
        - Splits according to specified percentages
        - Randomized allocation for reproducible results when seed is provided
    """

    # Validate percentages
    if abs(train_pct + val_pct + test_pct - 100.0) > 1e-10:
        raise ValueError("Percentages must sum to 100%")

    print(f"Creating proportional splits with {train_pct:.1f}%/{val_pct:.1f}%/{test_pct:.1f}% distribution using ALL data")

    # Load data
    df = pd.read_csv(csv_path)
    print(f"Loaded dataset with {len(df)} rows and {len(df.columns)} columns")

    # Check the distribution of labels
    label_counts = df["label"].value_counts()
    print(f"\nOriginal label distribution:")
    print(label_counts)

    pass_df = df[df["label"] == "PASS"].reset_index(drop=True)
    fail_df = df[df["label"] == "FAIL"].reset_index(drop=True)

    total_pass = len(pass_df)
    total_fail = len(fail_df)
    original_ratio = total_pass / total_fail if total_fail > 0 else float("inf")

    print(f"Total PASS: {total_pass}, Total FAIL: {total_fail}")
    print(f"Original PASS:FAIL ratio = {original_ratio:.2f}:1")

    # Set random seed if provided
    if random_seed is not None:
        np.random.seed(random_seed)
        print(f"Using random seed: {random_seed}")
    else:
        random_seed = np.random.randint(0, 2**31)
        print(f"Using random seed: {random_seed}")

    # Shuffle the data
    pass_shuffled = pass_df.sample(frac=1, random_state=random_seed).reset_index(drop=True)
    fail_shuffled = fail_df.sample(frac=1, random_state=random_seed).reset_index(drop=True)

    # Calculate split sizes for PASS examples
    train_pass_size = int(np.round(total_pass * train_pct / 100))
    val_pass_size = int(np.round(total_pass * val_pct / 100))
    test_pass_size = total_pass - train_pass_size - val_pass_size  # Use remaining to ensure we use all data

    # Calculate split sizes for FAIL examples
    train_fail_size = int(np.round(total_fail * train_pct / 100))
    val_fail_size = int(np.round(total_fail * val_pct / 100))
    test_fail_size = total_fail - train_fail_size - val_fail_size  # Use remaining to ensure we use all data

    print(f"\nSplit allocation:")
    print(f"Train: {train_pass_size} PASS + {train_fail_size} FAIL = {train_pass_size + train_fail_size} total")
    print(f"Validation: {val_pass_size} PASS + {val_fail_size} FAIL = {val_pass_size + val_fail_size} total")
    print(f"Test: {test_pass_size} PASS + {test_fail_size} FAIL = {test_pass_size + test_fail_size} total")
    print(
        f"Total: {train_pass_size + val_pass_size + test_pass_size} PASS + {train_fail_size + val_fail_size + test_fail_size} FAIL = {len(df)} total"
    )

    # Split PASS examples
    train_pass = pass_shuffled.iloc[:train_pass_size].copy()
    val_pass = pass_shuffled.iloc[train_pass_size : train_pass_size + val_pass_size].copy()
    test_pass = pass_shuffled.iloc[train_pass_size + val_pass_size :].copy()

    # Split FAIL examples
    train_fail = fail_shuffled.iloc[:train_fail_size].copy()
    val_fail = fail_shuffled.iloc[train_fail_size : train_fail_size + val_fail_size].copy()
    test_fail = fail_shuffled.iloc[train_fail_size + val_fail_size :].copy()

    # Combine PASS and FAIL for each split
    train_df = pd.concat([train_pass, train_fail], ignore_index=True)
    validation_df = pd.concat([val_pass, val_fail], ignore_index=True)
    test_df = pd.concat([test_pass, test_fail], ignore_index=True)

    # Add dataset labels
    train_df = train_df.copy()
    validation_df = validation_df.copy()
    test_df = test_df.copy()

    train_df["dataset"] = "train"
    validation_df["dataset"] = "validation"
    test_df["dataset"] = "test"

    # Verify proportions
    print(f"\nFinal proportions verification:")
    for name, dataset in [("Train", train_df), ("Validation", validation_df), ("Test", test_df)]:
        pass_count = len(dataset[dataset["label"] == "PASS"])
        fail_count = len(dataset[dataset["label"] == "FAIL"])
        total_count = len(dataset)
        pass_prop = pass_count / total_count * 100 if total_count > 0 else 0
        fail_prop = fail_count / total_count * 100 if total_count > 0 else 0
        ratio = pass_count / fail_count if fail_count > 0 else float("inf")

        print(f"{name}: {pass_count} PASS ({pass_prop:.1f}%) + {fail_count} FAIL ({fail_prop:.1f}%) = {total_count} total")
        print(f"  {name} PASS:FAIL ratio = {ratio:.2f}:1")

    # Check that proportions are maintained
    train_ratio = len(train_df[train_df["label"] == "PASS"]) / len(train_df[train_df["label"] == "FAIL"])
    val_ratio = len(validation_df[validation_df["label"] == "PASS"]) / len(validation_df[validation_df["label"] == "FAIL"])
    test_ratio = len(test_df[test_df["label"] == "PASS"]) / len(test_df[test_df["label"] == "FAIL"])

    print(f"\nRatio consistency check:")
    print(f"Original ratio: {original_ratio:.3f}:1")
    print(f"Train ratio: {train_ratio:.3f}:1 (diff: {abs(train_ratio - original_ratio) / original_ratio * 100:.1f}%)")
    print(f"Validation ratio: {val_ratio:.3f}:1 (diff: {abs(val_ratio - original_ratio) / original_ratio * 100:.1f}%)")
    print(f"Test ratio: {test_ratio:.3f}:1 (diff: {abs(test_ratio - original_ratio) / original_ratio * 100:.1f}%)")

    # Combine datasets
    combined_df = pd.concat([train_df, validation_df, test_df], ignore_index=True)

    # Save if output path provided
    if output_path:
        combined_df.to_csv(output_path, index=False)
        print(f"\n💾 Saved combined dataset to: {output_path}")

    print(f"\n🎉 Successfully created proportional splits using ALL {len(df)} examples!")

    return train_df, validation_df, test_df, combined_df


In [None]:
# Example usage with the current data using ALL the data
train_df, validation_df, test_df, combined_df = create_proportional_splits(
    csv_path="data/labeled_traces.csv",
    train_pct=20.0,
    val_pct=40.0,
    test_pct=40.0,
    output_path="data/labeled_traces_dataset.csv",
    random_seed=42,  # Fixed seed for reproducible results
)


In [None]:
# Example usage variations:
print("=" * 60)
print("USAGE EXAMPLES:")
print("=" * 60)

# 1. Reproducible splits (same examples each time)
print("\n1. Reproducible split with fixed seed:")
train1, val1, test1, _ = create_proportional_splits(
    csv_path="data/labeled_traces.csv",
    train_pct=20.0,
    val_pct=35.0,
    test_pct=45.0,
    random_seed=123,  # Fixed seed for reproducibility
)

# 2. Different random splits (different examples each time)
print("\n" + "=" * 40)
print("2. Random split (will be different each run):")
train2, val2, test2, _ = create_proportional_splits(
    csv_path="data/labeled_traces.csv",
    train_pct=20.0,
    val_pct=35.0,
    test_pct=45.0,
    random_seed=456,  # Different seed
)

# 3. Another random split to show difference
print("\n" + "=" * 40)
print("3. Another random split (different from above):")
train3, val3, test3, _ = create_proportional_splits(
    csv_path="data/labeled_traces.csv",
    train_pct=20.0,
    val_pct=35.0,
    test_pct=45.0,
    random_seed=789,  # Different random seed
)

# Show that the examples are different across random runs
print("\n" + "=" * 40)
print("RANDOMIZATION VERIFICATION:")
print("=" * 40)

if len(train2) > 0 and len(train3) > 0:
    # Check if train sets have different examples by comparing trace_ids
    train2_ids = set(train2["trace_id"].tolist()) if "trace_id" in train2.columns else set(train2.index.tolist())
    train3_ids = set(train3["trace_id"].tolist()) if "trace_id" in train3.columns else set(train3.index.tolist())
    overlap = len(train2_ids.intersection(train3_ids))
    total = len(train2_ids)

    print(f"Train set overlap between random runs: {overlap}/{total} ({overlap / total * 100:.1f}%)")
    print(f"✅ Randomization working!" if overlap < total else "⚠️  Same examples - check randomization")

print(f"\n💡 TIP: Use random_seed=42 (or any number) for reproducible splits")
print(f"💡 TIP: Use random_seed=None for different splits each time")


USAGE EXAMPLES:

1. Reproducible split with fixed seed:
Creating proportional splits with 20.0%/35.0%/45.0% distribution using ALL data
Loaded dataset with 101 rows and 11 columns

Original label distribution:
label
PASS    75
FAIL    26
Name: count, dtype: int64
Total PASS: 75, Total FAIL: 26
Original PASS:FAIL ratio = 2.88:1
Using random seed: 123

Split allocation:
Train: 15 PASS + 5 FAIL = 20 total
Validation: 26 PASS + 9 FAIL = 35 total
Test: 34 PASS + 12 FAIL = 46 total
Total: 75 PASS + 26 FAIL = 101 total

Final proportions verification:
Train: 15 PASS (75.0%) + 5 FAIL (25.0%) = 20 total
  Train PASS:FAIL ratio = 3.00:1
Validation: 26 PASS (74.3%) + 9 FAIL (25.7%) = 35 total
  Validation PASS:FAIL ratio = 2.89:1
Test: 34 PASS (73.9%) + 12 FAIL (26.1%) = 46 total
  Test PASS:FAIL ratio = 2.83:1

Ratio consistency check:
Original ratio: 2.885:1
Train ratio: 3.000:1 (diff: 4.0%)
Validation ratio: 2.889:1 (diff: 0.1%)
Test ratio: 2.833:1 (diff: 1.8%)

🎉 Successfully created proporti

In [52]:
# Verify balance - check label proportions in each split
def check_balance(dataset_name, dataset_df):
    if len(dataset_df) == 0:
        print(f"\n{dataset_name} - EMPTY DATASET")
        return

    label_counts = dataset_df["label"].value_counts()
    label_props = label_counts / len(dataset_df)

    pass_count = label_counts.get("PASS", 0)
    fail_count = label_counts.get("FAIL", 0)
    pass_prop = label_props.get("PASS", 0)
    fail_prop = label_props.get("FAIL", 0)

    print(f"\n{dataset_name} - Label distribution:")
    print(f"PASS: {pass_count} ({pass_prop:.1%})")
    print(f"FAIL: {fail_count} ({fail_prop:.1%})")
    print(f"Balance ratio (PASS:FAIL): {pass_count}:{fail_count}")

    if pass_count > 0 and fail_count > 0:
        ratio = max(pass_count, fail_count) / min(pass_count, fail_count)
        print(f"Balance score: {ratio:.2f} (1.0 = perfect balance)")
        if ratio == 1.0:
            print("🎯 PERFECTLY BALANCED!")
    elif pass_count > 0 and fail_count == 0:
        print("⚠️  Only PASS examples (imbalanced)")
    elif fail_count > 0 and pass_count == 0:
        print("⚠️  Only FAIL examples (imbalanced)")

    return label_props


print("🔍 Checking balance across splits:")
original_props = check_balance("Original", df)
train_props = check_balance("Train", train_df)
val_props = check_balance("Validation", validation_df)
test_props = check_balance("Test", test_df)


🔍 Checking balance across splits:

Original - Label distribution:
PASS: 75 (74.3%)
FAIL: 26 (25.7%)
Balance ratio (PASS:FAIL): 75:26
Balance score: 2.88 (1.0 = perfect balance)

Train - Label distribution:
PASS: 15 (75.0%)
FAIL: 5 (25.0%)
Balance ratio (PASS:FAIL): 15:5
Balance score: 3.00 (1.0 = perfect balance)

Validation - Label distribution:
PASS: 30 (75.0%)
FAIL: 10 (25.0%)
Balance ratio (PASS:FAIL): 30:10
Balance score: 3.00 (1.0 = perfect balance)

Test - Label distribution:
PASS: 30 (73.2%)
FAIL: 11 (26.8%)
Balance ratio (PASS:FAIL): 30:11
Balance score: 2.73 (1.0 = perfect balance)


In [6]:
# Add dataset column to each split
train_df = train_df.copy()
validation_df = validation_df.copy()
test_df = test_df.copy()

train_df["dataset"] = "train"
validation_df["dataset"] = "validation"
test_df["dataset"] = "test"

print("✅ Added 'dataset' column to each split:")
print(f"Train dataset column: {train_df['dataset'].iloc[0] if len(train_df) > 0 else 'N/A'}")
print(f"Validation dataset column: {validation_df['dataset'].iloc[0] if len(validation_df) > 0 else 'N/A'}")
print(f"Test dataset column: {test_df['dataset'].iloc[0] if len(test_df) > 0 else 'N/A'}")


✅ Added 'dataset' column to each split:
Train dataset column: train
Validation dataset column: validation
Test dataset column: test


In [7]:
# Combine the 3 datasets back together
combined_df = pd.concat([train_df, validation_df, test_df], ignore_index=True)

print(f"📊 Combined dataset shape: {combined_df.shape}")
print(f"📊 Original dataset shape: {df.shape}")

# Verify the counts by grouping on "dataset"
dataset_counts = combined_df.groupby("dataset").size()
print(f"\n📈 Dataset counts:")
print(dataset_counts)

# Show label distribution within each dataset
dataset_label_counts = combined_df.groupby(["dataset", "label"]).size().unstack(fill_value=0)
print(f"\n🏷️ Label distribution by dataset:")
print(dataset_label_counts)

# Calculate percentages
total_rows = len(combined_df)
print(f"\n📊 Dataset percentages:")
for dataset, count in dataset_counts.items():
    print(f"{dataset}: {count} ({count / total_rows:.1%})")

# Highlight the perfect balance achievement
print(f"\n🎯 Balance Achievement Summary:")
val_pass = len(combined_df[(combined_df["dataset"] == "validation") & (combined_df["label"] == "PASS")])
val_fail = len(combined_df[(combined_df["dataset"] == "validation") & (combined_df["label"] == "FAIL")])
test_pass = len(combined_df[(combined_df["dataset"] == "test") & (combined_df["label"] == "PASS")])
test_fail = len(combined_df[(combined_df["dataset"] == "test") & (combined_df["label"] == "FAIL")])

print(f"Validation: {val_pass} PASS = {val_fail} FAIL ✓ Perfect Balance!")
print(f"Test: {test_pass} PASS = {test_fail} FAIL ✓ Perfect Balance!")


📊 Combined dataset shape: (101, 12)
📊 Original dataset shape: (101, 11)

📈 Dataset counts:
dataset
test          41
train         20
validation    40
dtype: int64

🏷️ Label distribution by dataset:
label       FAIL  PASS
dataset               
test          11    30
train          5    15
validation    10    30

📊 Dataset percentages:
test: 41 (40.6%)
train: 20 (19.8%)
validation: 40 (39.6%)

🎯 Balance Achievement Summary:
Validation: 30 PASS = 10 FAIL ✓ Perfect Balance!
Test: 30 PASS = 11 FAIL ✓ Perfect Balance!


In [None]:
# Save the combined dataset
output_path = "data/labeled_traces_dataset.csv"
combined_df.to_csv(output_path, index=False)

print(f"💾 Saved combined dataset to: {output_path}")

# Verify the saved file
saved_df = pd.read_csv(output_path)
print(f"✅ Verification - saved file shape: {saved_df.shape}")
print(f"✅ Verification - dataset column unique values: {sorted(saved_df['dataset'].unique())}")


print(f"📁 Saved to: {output_path}")
print("")
print(f"📊 Total rows: {len(saved_df)}")

# Final summary with balance status
for dataset_name in ["train", "validation", "test"]:
    subset = saved_df[saved_df["dataset"] == dataset_name]
    pass_count = len(subset[subset["label"] == "PASS"])
    fail_count = len(subset[subset["label"] == "FAIL"])
    percentage = len(subset) / len(saved_df) * 100

    if dataset_name in ["validation", "test", "train"]:
        balance_status = "🎯 PERFECT BALANCE" if pass_count == fail_count else "❌ IMBALANCED"

    print(f"🔄 {dataset_name.title()}: {len(subset)} rows ({percentage:.1f}%) - {pass_count} PASS:{fail_count} FAIL - {balance_status}")


💾 Saved combined dataset to: data/labeled_traces_dataset.csv
✅ Verification - saved file shape: (101, 12)
✅ Verification - dataset column unique values: ['test', 'train', 'validation']
📁 Saved to: data/labeled_traces_dataset.csv

📊 Total rows: 101
🔄 Train: 20 rows (19.8%) - 15 PASS:5 FAIL - ❌ IMBALANCED
🔄 Validation: 40 rows (39.6%) - 30 PASS:10 FAIL - ❌ IMBALANCED
🔄 Test: 41 rows (40.6%) - 30 PASS:11 FAIL - ❌ IMBALANCED


## Step 3: Develop Your LLM-as-Judge Prompt


Watch to walk-through video for HW3 where we go over how to use Loop to align our LLM judge with human annotations.


## Step 4: Refine & Validate Your Judge


In [None]:
def get_experiment_results(experiment_id: str):
    cursor = None
    while True:
        response = requests.post(
            "https://api.braintrust.dev/btql",
            json={
                "query": f"""select: input, output, metadata
from:experiment ('{experiment_id}')
filter: scores is not null"""
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={
                # Substitute your API key here
                "Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"],
            },
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return data


In [36]:
experiment_results = get_experiment_results("39ea08f0-bd26-4b59-9190-76f78ac2731b")

print(len(experiment_results))
experiment_results[0]

101


{'input': {'expected': "Certainly! Here's a delicious and portable Paleo breakfast recipe perfect for your commute: **Sweet Potato and Egg Muffin Cups**. These are easy to prepare ahead of time, packed with nutrients, and convenient to eat on the go.\n\n### Sweet Potato and Egg Muffin Cups (Serves 2)\n\n#### Ingredients:\n- 2 medium sweet potatoes\n- 4 large eggs\n- 1/2 teaspoon salt\n- 1/4 teaspoon black pepper\n- Optional: chopped herbs (like parsley or chives), chopped veggies (like bell peppers or spinach), or cooked bacon bits for extra flavor\n\n#### Instructions:\n\n1. **Preheat the oven:** Set your oven to 375°F (190°C). Line a muffin tin with silicone liners or lightly grease it with coconut oil to prevent sticking.\n\n2. **Prepare sweet potatoes:** Wash and peel the sweet potatoes. Use a grater or a food processor with a grating attachment to shred the sweet potatoes into small, fine pieces.\n\n3. **Squeeze out moisture:** Place the grated sweet potatoes in a clean kitchen to

In [None]:
print(f"Number of experiment results: {len(experiment_results)}")
print("Sample result structure:")
if experiment_results:
    sample = experiment_results[0]
    print(f"Keys: {list(sample.keys())}")
    if "metadata" in sample:
        print(f"Metadata keys: {list(sample['metadata'].keys()) if sample['metadata'] is not None else 'None'}")
    if "output" in sample:
        print(f"Output keys: {list(sample['output'].keys()) if sample['output'] is not None else 'None'}")
    print(f"Sample result: {sample}")


Number of experiment results: 101
Sample result structure:
Keys: ['input', 'metadata', 'output']
Metadata keys: ['choice', 'rationale']
Output keys: ['score']
Sample result: {'input': {'expected': "Certainly! Here's a delicious and portable Paleo breakfast recipe perfect for your commute: **Sweet Potato and Egg Muffin Cups**. These are easy to prepare ahead of time, packed with nutrients, and convenient to eat on the go.\n\n### Sweet Potato and Egg Muffin Cups (Serves 2)\n\n#### Ingredients:\n- 2 medium sweet potatoes\n- 4 large eggs\n- 1/2 teaspoon salt\n- 1/4 teaspoon black pepper\n- Optional: chopped herbs (like parsley or chives), chopped veggies (like bell peppers or spinach), or cooked bacon bits for extra flavor\n\n#### Instructions:\n\n1. **Preheat the oven:** Set your oven to 375°F (190°C). Line a muffin tin with silicone liners or lightly grease it with coconut oil to prevent sticking.\n\n2. **Prepare sweet potatoes:** Wash and peel the sweet potatoes. Use a grater or a food 

In [None]:
def calculate_tpr_tnr(experiment_results, threshold=0.5):
    """
    Calculate TPR and TNR for each dataset according to the formula:
    TPR = p/P (true positive rate) = correctly predicted PASS / total actual PASS
    TNR = f/F (true negative rate) = correctly predicted FAIL / total actual FAIL

    Where PASS = 1, FAIL = 0
    """

    # Extract data from experiment results
    data = []
    for result in experiment_results:
        actual_label = result["input"]["metadata"]["label"]
        predicted_score = result["output"].get("score")
        dataset = result["input"]["metadata"]["dataset"]

        data.append({"dataset": dataset, "target": 1 if actual_label == "PASS" else 0, "predicted": predicted_score})

    df = pd.DataFrame(data)
    print(f"Total valid predictions: {len(df)}")
    print(f"Datasets found: {sorted(df['dataset'].unique())}")
    print()

    # Calculate TPR and TNR for each dataset
    results = {}

    for dataset in sorted(df["dataset"].unique()):
        dataset_df = df[df["dataset"] == dataset]

        # Total actual PASS (P) and FAIL (F)
        P = len(dataset_df[dataset_df["target"] == 1])  # Total actual PASS
        F = len(dataset_df[dataset_df["target"] == 0])  # Total actual FAIL

        # Correctly predicted PASS (p) and FAIL (f)
        p = len(dataset_df[(dataset_df["target"] == 1) & (dataset_df["predicted"] == 1)])
        f = len(dataset_df[(dataset_df["target"] == 0) & (dataset_df["predicted"] == 0)])

        # Calculate TPR and TNR
        TPR = p / P if P > 0 else 0
        TNR = f / F if F > 0 else 0

        results[dataset] = {
            "P": P,  # Total actual PASS
            "F": F,  # Total actual FAIL
            "p": p,  # Correctly predicted PASS
            "f": f,  # Correctly predicted FAIL
            "TPR": TPR,
            "TNR": TNR,
        }

        print(f"Dataset: {dataset}")
        print(f"  Total actual PASS (P): {P}")
        print(f"  Total actual FAIL (F): {F}")
        print(f"  Correctly predicted PASS (p): {p}")
        print(f"  Correctly predicted FAIL (f): {f}")
        print(f"  TPR = p/P = {p}/{P} = {TPR:.3f}")
        print(f"  TNR = f/F = {f}/{F} = {TNR:.3f}")
        print()

    return results, df


# Calculate metrics
results, experiment_df = calculate_tpr_tnr(experiment_results)


Total valid predictions: 101
Datasets found: ['test', 'train', 'validation']

Dataset: test
  Total actual PASS (P): 30
  Total actual FAIL (F): 11
  Correctly predicted PASS (p): 28
  Correctly predicted FAIL (f): 8
  TPR = p/P = 28/30 = 0.933
  TNR = f/F = 8/11 = 0.727

Dataset: train
  Total actual PASS (P): 15
  Total actual FAIL (F): 5
  Correctly predicted PASS (p): 12
  Correctly predicted FAIL (f): 3
  TPR = p/P = 12/15 = 0.800
  TNR = f/F = 3/5 = 0.600

Dataset: validation
  Total actual PASS (P): 30
  Total actual FAIL (F): 10
  Correctly predicted PASS (p): 26
  Correctly predicted FAIL (f): 10
  TPR = p/P = 26/30 = 0.867
  TNR = f/F = 10/10 = 1.000



In [16]:
# Create a summary table of the results
if "results" in locals() and results:
    print("=" * 60)
    print("TPR AND TNR SUMMARY TABLE")
    print("=" * 60)

    # Create a DataFrame for better formatting
    summary_data = []
    for dataset, metrics in results.items():
        summary_data.append(
            {
                "Dataset": dataset,
                "Total PASS (P)": metrics["P"],
                "Total FAIL (F)": metrics["F"],
                "Correct PASS (p)": metrics["p"],
                "Correct FAIL (f)": metrics["f"],
                "TPR (p/P)": f"{metrics['TPR']:.3f}",
                "TNR (f/F)": f"{metrics['TNR']:.3f}",
            }
        )

    summary_df = pd.DataFrame(summary_data)
    print(summary_df.to_string(index=False))

    print("\n" + "=" * 60)
    print("INTERPRETATION:")
    print("=" * 60)
    print("TPR (True Positive Rate) = Sensitivity = P(predicted PASS | actual PASS)")
    print("TNR (True Negative Rate) = Specificity = P(predicted FAIL | actual FAIL)")
    print("Higher values (closer to 1.0) indicate better performance")

    # Check if we have the expected balanced validation and test sets
    if "validation" in results and "test" in results:
        val_balanced = results["validation"]["P"] == results["validation"]["F"]
        test_balanced = results["test"]["P"] == results["test"]["F"]
        print(f"\nValidation set balanced: {val_balanced} (P={results['validation']['P']}, F={results['validation']['F']})")
        print(f"Test set balanced: {test_balanced} (P={results['test']['P']}, F={results['test']['F']})")

        if val_balanced and test_balanced:
            print("✅ Perfect balance confirmed in validation and test sets!")
        else:
            print("⚠️  Balance issue detected in validation/test sets")
else:
    print("No results to display. Run the calculation cell first.")


TPR AND TNR SUMMARY TABLE
   Dataset  Total PASS (P)  Total FAIL (F)  Correct PASS (p)  Correct FAIL (f) TPR (p/P) TNR (f/F)
      test              30              11                28                 8     0.933     0.727
     train              15               5                12                 3     0.800     0.600
validation              30              10                26                10     0.867     1.000

INTERPRETATION:
TPR (True Positive Rate) = Sensitivity = P(predicted PASS | actual PASS)
TNR (True Negative Rate) = Specificity = P(predicted FAIL | actual FAIL)
Higher values (closer to 1.0) indicate better performance

Validation set balanced: False (P=30, F=10)
Test set balanced: False (P=30, F=11)
⚠️  Balance issue detected in validation/test sets


In [17]:
# Additional analysis: Confusion matrices and other metrics
if "experiment_df" in locals() and len(experiment_df) > 0:
    print("=" * 60)
    print("DETAILED ANALYSIS BY DATASET")
    print("=" * 60)

    for dataset in sorted(experiment_df["dataset"].unique()):
        dataset_df = experiment_df[experiment_df["dataset"] == dataset]

        print(f"\n{dataset.upper()} DATASET:")
        print("-" * 40)

        # Confusion matrix components
        TP = len(dataset_df[(dataset_df["target"] == 1) & (dataset_df["predicted"] == 1)])  # True Positive
        TN = len(dataset_df[(dataset_df["target"] == 0) & (dataset_df["predicted"] == 0)])  # True Negative
        FP = len(dataset_df[(dataset_df["target"] == 0) & (dataset_df["predicted"] == 1)])  # False Positive
        FN = len(dataset_df[(dataset_df["target"] == 1) & (dataset_df["predicted"] == 0)])  # False Negative

        total = TP + TN + FP + FN

        print(f"Confusion Matrix:")
        print(f"                 Predicted")
        print(f"                FAIL  PASS")
        print(f"Actual FAIL      {TN:2d}    {FP:2d}")
        print(f"Actual PASS      {FN:2d}    {TP:2d}")

        # Calculate additional metrics
        accuracy = (TP + TN) / total if total > 0 else 0
        precision = TP / (TP + FP) if (TP + FP) > 0 else 0
        recall = TP / (TP + FN) if (TP + FN) > 0 else 0  # Same as TPR
        f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        print(f"\nMetrics:")
        print(f"  TPR (Sensitivity/Recall): {recall:.3f}")
        print(f"  TNR (Specificity):        {TN / (TN + FP) if (TN + FP) > 0 else 0:.3f}")
        print(f"  Precision:                {precision:.3f}")
        print(f"  Accuracy:                 {accuracy:.3f}")
        print(f"  F1-Score:                 {f1_score:.3f}")

        # Score distribution
        print(f"\nScore Distribution:")
        print(f"  Mean score: {dataset_df['predicted'].mean():.3f}")
        print(f"  Score range: [{dataset_df['predicted'].min():.3f}, {dataset_df['predicted'].max():.3f}]")

        # Show a few examples
        print(f"\nSample predictions:")
        for i, (idx, row) in enumerate(dataset_df.iterrows()):
            if i >= 3:
                break
            pred_label = "PASS" if row["predicted"] == 1 else "FAIL"
            print(f"  Actual: {row['target']}, Score: {row['predicted']:.3f}, Predicted: {pred_label}")

        print("\n" + "=" * 40)
else:
    print("No experiment data available for detailed analysis.")


DETAILED ANALYSIS BY DATASET

TEST DATASET:
----------------------------------------
Confusion Matrix:
                 Predicted
                FAIL  PASS
Actual FAIL       8     3
Actual PASS       2    28

Metrics:
  TPR (Sensitivity/Recall): 0.933
  TNR (Specificity):        0.727
  Precision:                0.903
  Accuracy:                 0.878
  F1-Score:                 0.918

Score Distribution:
  Mean score: 0.756
  Score range: [0.000, 1.000]

Sample predictions:
  Actual: 1, Score: 1.000, Predicted: PASS
  Actual: 1, Score: 1.000, Predicted: PASS
  Actual: 1, Score: 1.000, Predicted: PASS


TRAIN DATASET:
----------------------------------------
Confusion Matrix:
                 Predicted
                FAIL  PASS
Actual FAIL       3     2
Actual PASS       3    12

Metrics:
  TPR (Sensitivity/Recall): 0.800
  TNR (Specificity):        0.600
  Precision:                0.857
  Accuracy:                 0.750
  F1-Score:                 0.828

Score Distribution:
  Mean 

## Step 5: Measure on "New" Traces


In [None]:
# Load the raw traces data
raw_traces_df = pd.read_csv("data/raw_traces.csv")
print(f"Loaded raw traces data: {raw_traces_df.shape[0]} rows, {raw_traces_df.shape[1]} columns")
print(f"Columns: {list(raw_traces_df.columns)}")


Loaded raw traces data: 2400 rows, 7 columns
Columns: ['query', 'dietary_restriction', 'response', 'success', 'error', 'trace_id', 'query_id']


In [19]:
# Take a random sample of 500 records
sampled_df = raw_traces_df.sample(n=500, random_state=42)
print(f"Sampled {sampled_df.shape[0]} records from the raw traces data")


Sampled 500 records from the raw traces data


In [24]:
[query for query in sampled_df["query"][:5]]

['I eat pretty clean most of the time',
 "Something that feels indulgent but isn't terrible for me",
 "I'm diabetic but I want to make a fruit cake for Christmas. My grandmother's recipe has tons of sugar and dried fruit but it's tradition and I can't just skip it this year. Is there any way to modify it so I can still participate in the family tradition without spiking my blood sugar?",
 "I'm mostly vegetarian but I eat fish sometimes",
 'I eat pretty clean most of the time']

In [26]:
await bt.EvalAsync(
    name="recipe-bot",
    experiment_name=f"fm_dietary_pref_adherence_unlabeled_traces",
    data=[{"input": [{"role": "user", "content": query}]} for query in sampled_df["query"]],  # type: ignore
    task=get_agent_response,
    scores=[bt.init_function(project_name="recipe-bot", slug="dietary-preference-adherence-5fce")],
)

Experiment fm_dietary_pref_adherence_unlabeled_traces is running at https://www.braintrust.dev/app/aie-course-2025/p/recipe-bot/experiments/fm_dietary_pref_adherence_unlabeled_traces
recipe-bot [experiment_name=fm_dietary_pref_adherence_unlabeled_traces] (data): 500it [00:00, 27598.83it/s]
recipe-bot [experiment_name=fm_dietary_pref_adherence_unlabeled_traces] (tasks): 100%|██████████| 500/500 [12:57<00:00,  1.55s/it]  
Retrying request after error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Sleeping for 0.5 seconds



fm_dietary_pref_adherence_unlabeled_traces compared to add_queries_it_20250728_2025:
75.40% 'Dietary Preference Adherence' score

1754427581.11s start
1754428177.42s end
214.36s duration
8.40s llm_duration
1960.42tok prompt_tokens
410.98tok completion_tokens
2371.40tok total_tokens
0.00$ estimated_cost
1912.32tok prompt_cached_tokens
0tok prompt_cache_creation_tokens

See results for fm_dietary_pref_adherence_unlabeled_traces at https://www.braintrust.dev/app/aie-course-2025/p/recipe-bot/experiments/fm_dietary_pref_adherence_unlabeled_traces


EvalResultWithSummary(summary="...", results=[...])

## Step 6: Report Results with judgy


In [31]:
test_labels, test_preds = [], []
for result in experiment_results:
    test_labels.append(int(result["input"]["metadata"]["label"] == "PASS"))
    test_preds.append(result["output"]["score"])

print(len(test_labels), len(test_preds))

print(test_labels[:5])
print(test_preds[:5])

101 101
[1, 1, 1, 1, 1]
[1, 0, 1, 1, 1]


In [37]:
unlabled_experiment_results = get_experiment_results("14be32da-4165-4f38-8a3d-0673ab2a7430")

print(len(unlabled_experiment_results))
unlabled_experiment_results[0]

500


{'input': {'expected': None,
  'input': [{'content': "Gluten-light recipe - I'm not celiac just sensitive",
    'role': 'user'}],
  'metadata': {},
  'output': [{'content': '# Role and Objective\nYou are a friendly, health-conscious culinary assistant. Your primary goal is to provide **easy-to-follow, high-protein, macro-friendly, and meal-prep-friendly recipes**. **You should ALWAYS attempt to provide a recipe recommendation, even if the user\'s request is vague, health-related, or nutrition-focused.**\nAvoid recommending recipes with harmful chemicals (e.g., red dye #5) or known to contain microplastics.\n\n# Instructions\n\n## Core Recipe Principles\n1.  **Protein-Focused**: Prioritize protein-rich, satisfying meals aiming for at least 25-30g protein per serving. Utilize accessible proteins such as chicken, beef, turkey, pork, fish, tofu, eggs, or legumes.\n2.  **Convenience & Meal Prep**: Favor recipes that are quick (prep under 30 minutes), can be batch-cooked, and store well for 

In [38]:
unlabeled_preds = [result["output"]["score"] for result in unlabled_experiment_results]

print(len(unlabeled_preds))
print(unlabeled_preds[:5])


500
[0, 1, 1, 0, 1]


In [None]:
# Estimate true pass rate with 95% confidence interval
theta_hat, lower_bound, upper_bound = estimate_success_rate(test_labels=test_labels, test_preds=test_preds, unlabeled_preds=unlabeled_preds)

print(f"Estimated true pass rate: {theta_hat:.3f}")
print(f"95% Confidence interval: [{lower_bound:.3f}, {upper_bound:.3f}]")

Estimated true pass rate: 0.817
95% Confidence interval: [0.718, 0.924]
