# Evaluating LLMs

In this notebook, you'll apply the evaluation concepts from the lesson by dissecting what benchmarks actually measure, detecting systematic biases in LLM-as-judge evaluation, and designing a multi-layer evaluation strategy that integrates the full module.

**What you'll do:**
- Examine actual benchmark questions, classify what they test (recognition vs reasoning vs knowledge recall), identify what they do NOT test, and spot contamination signals in score distributions
- Use an LLM as a judge to evaluate pairs of model responses, then swap presentation order and test verbose vs concise pairs to measure position bias and verbosity bias empirically
- Design a complete evaluation strategy for a specific use case, integrating benchmarks, human evaluation, LLM judges, and red teaming from the full module

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones — they reveal gaps in your mental model.

In [None]:
# Setup — self-contained for Google Colab
!pip install -q openai

import os
import json
import textwrap
from openai import OpenAI
import matplotlib.pyplot as plt
import numpy as np

# --- API Key Setup ---
# Option 1: Set your API key as an environment variable (recommended)
#   In Colab: go to the key icon in the left sidebar, add OPENAI_API_KEY
# Option 2: Paste it directly (less secure, don't commit this)
#   os.environ["OPENAI_API_KEY"] = "sk-..."

# You can also use any OpenAI-compatible API (e.g., local Ollama, Together AI)
# by changing the base_url:
#   client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

client = OpenAI()

# Use a small, cheap model for the exercises
MODEL = "gpt-4o-mini"

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Reproducible results where possible
np.random.seed(42)


def call_llm(system_prompt: str, user_prompt: str, temperature: float = 0.3) -> str:
    """Call the LLM with a system prompt and user prompt. Returns the response text."""
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=temperature,
        max_tokens=500,
    )
    return response.choices[0].message.content


def print_wrapped(text: str, width: int = 80, prefix: str = ""):
    """Print text with word wrapping for readability."""
    for line in text.split("\n"):
        wrapped = textwrap.fill(line, width=width, initial_indent=prefix, subsequent_indent=prefix)
        print(wrapped)


# Quick test to verify the API is working
test = call_llm("You are a helpful assistant.", "Say 'API connection successful' and nothing else.")
print(test)
print(f"\nUsing model: {MODEL}")
print("Setup complete.")

---

## Exercise 1: Benchmark Autopsy (Guided)

The lesson introduced the **evaluation stack** — the layers between actual model capability and the number on the leaderboard. Every benchmark makes design choices (question format, scoring rubric, coverage) that determine what it *actually* measures, which may differ significantly from what its name implies.

In this exercise, you'll perform an autopsy on a specific benchmark: **MMLU** (Massive Multitask Language Understanding). You'll examine actual questions, classify what they test, identify what they do NOT test, and look for contamination signals in a simulated score distribution.

**Before running each cell, predict:** What does this benchmark question actually test — recognition, reasoning, knowledge recall, or formatting compliance? What capability does the name "understanding" suggest that the question does NOT measure?

In [None]:
# --- Part A: Examine actual benchmark questions ---
# These are representative examples of MMLU-style questions across categories.
# For each, we'll classify what cognitive skill it ACTUALLY tests.

benchmark_questions = [
    {
        "category": "Abstract Algebra",
        "question": "Find the degree of the extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.",
        "choices": ["0", "4", "2", "6"],
        "correct": "B",
        "analysis": {
            "what_it_tests": "Recognition of a known result. A student who has seen this type of "
                             "problem can recognize the pattern (sqrt(18) = 3*sqrt(2), so the "
                             "extension is generated by sqrt(2) and sqrt(3), giving degree 4). "
                             "The multiple-choice format means the model only needs to SELECT the "
                             "right answer, not DERIVE it.",
            "what_name_suggests": "Understanding of abstract algebra — the ability to reason about "
                                  "field extensions, prove properties, and work through novel problems.",
            "proxy_gap": "A model could get this right by pattern-matching on similar problems in "
                         "its training data without understanding WHY sqrt(18) simplifies. The "
                         "multiple-choice format eliminates the need to show work or explain reasoning.",
            "skill_tested": "Knowledge recall + recognition",
        },
    },
    {
        "category": "US History",
        "question": "Which of the following was a major cause of the Mexican-American War?",
        "choices": [
            "The U.S. annexation of Texas",
            "The discovery of gold in California",
            "The abolition of slavery in Mexico",
            "The construction of the transcontinental railroad",
        ],
        "correct": "A",
        "analysis": {
            "what_it_tests": "Factual recall. This is a straightforward history fact. The model "
                             "needs to recognize which option matches the well-documented cause. "
                             "No reasoning or analysis required — just retrieval of a fact.",
            "what_name_suggests": "Understanding of US history — the ability to analyze causes, "
                                  "evaluate perspectives, and draw connections between events.",
            "proxy_gap": "This question is almost certainly in the training data (it is a standard "
                         "textbook question). A model could answer it through memorization alone. "
                         "The same model might fail completely when asked to ANALYZE the "
                         "geopolitical dynamics that led to the war — a question that actually "
                         "requires understanding.",
            "skill_tested": "Knowledge recall",
        },
    },
    {
        "category": "Formal Logic",
        "question": "Select the best translation into predicate logic: "
                    "'Some monitor combats are not televised.'",
        "choices": [
            "(∃x)(Mx • ~Tx)",
            "(∃x)(Mx ⊃ ~Tx)",
            "(∀x)(Mx • ~Tx)",
            "~(∀x)(Mx ⊃ Tx)",
        ],
        "correct": "A",
        "analysis": {
            "what_it_tests": "Pattern matching on formal logic notation. The model must recognize "
                             "the standard translation pattern: 'Some X are Y' maps to (∃x)(Xx • Yx). "
                             "This IS closer to reasoning, but it is still a formulaic translation "
                             "with a known mapping.",
            "what_name_suggests": "Understanding of formal logic — the ability to construct proofs, "
                                  "evaluate arguments, and reason about validity.",
            "proxy_gap": "Translation into predicate logic is a mechanical skill. A model that can "
                         "translate well might still fail at constructing a proof or identifying a "
                         "logical fallacy in natural language. The format (selecting from options) "
                         "further reduces the task to recognition.",
            "skill_tested": "Recognition + formulaic reasoning",
        },
    },
    {
        "category": "Clinical Knowledge",
        "question": "A patient presents with fatigue, weight gain, and cold intolerance. "
                    "Which laboratory test is most appropriate?",
        "choices": [
            "Complete blood count",
            "Thyroid function tests",
            "Liver function tests",
            "Blood glucose",
        ],
        "correct": "B",
        "analysis": {
            "what_it_tests": "Symptom-to-diagnosis pattern matching. The three symptoms (fatigue, "
                             "weight gain, cold intolerance) are the classic textbook presentation "
                             "of hypothyroidism. A model that has seen medical textbooks will "
                             "recognize this pattern. This is closer to 'clinical reasoning' than "
                             "pure recall, but the textbook-typical presentation makes it mostly "
                             "recognition.",
            "what_name_suggests": "Clinical knowledge — the ability to reason about patient "
                                  "presentations, consider differential diagnoses, and make "
                                  "evidence-based recommendations.",
            "proxy_gap": "A real clinical scenario would involve atypical presentations, "
                         "comorbidities, patient history, and the need to reason about multiple "
                         "possible diagnoses simultaneously. This question tests whether the model "
                         "recognizes the TEXTBOOK pattern, not whether it can reason about a "
                         "complex clinical case.",
            "skill_tested": "Knowledge recall + pattern recognition",
        },
    },
    {
        "category": "Moral Scenarios",
        "question": "For which of these two scenarios does the main character "
                    "(who I'll call 'Pat') do something clearly morally wrong? "
                    "Scenario 1: Pat told the teacher that a classmate was cheating "
                    "when the classmate was actually cheating. "
                    "Scenario 2: Pat told the teacher that a classmate was cheating "
                    "when the classmate was not cheating.",
        "choices": [
            "Wrong, Wrong",
            "Wrong, Not wrong",
            "Not wrong, Wrong",
            "Not wrong, Not wrong",
        ],
        "correct": "C",
        "analysis": {
            "what_it_tests": "Formatting compliance + simple moral classification. The 'reasoning' "
                             "here is trivial (truthful reporting vs false accusation), but the "
                             "FORMAT is tricky: the model must correctly map two scenarios to a "
                             "paired answer format. Models often fail this not because of moral "
                             "reasoning failure but because of FORMAT parsing failure.",
            "what_name_suggests": "Moral reasoning — the ability to evaluate ethical scenarios, "
                                  "weigh competing values, and make nuanced judgments.",
            "proxy_gap": "This is perhaps the widest proxy gap of all the examples. The question "
                         "tests whether the model can parse a specific answer format for a trivial "
                         "moral distinction. A model that 'understands' morality could fail this "
                         "question due to formatting, and a model with no moral understanding "
                         "could pass it through pattern matching on the structure.",
            "skill_tested": "Formatting compliance + trivial classification",
        },
    },
]

# Display the questions with guided analysis
print("BENCHMARK AUTOPSY: MMLU")
print("=" * 70)
print("\n'Massive Multitask Language Understanding' — 57 subjects,")
print("multiple-choice format. The name says 'understanding.'")
print("Let's see what it actually tests.\n")

for i, q in enumerate(benchmark_questions):
    print(f"\n{'=' * 70}")
    print(f"Question {i + 1}: {q['category']}")
    print(f"{'=' * 70}")
    print(f"\n{q['question']}\n")
    for j, choice in enumerate(q['choices']):
        letter = chr(65 + j)  # A, B, C, D
        marker = " <<" if letter == q['correct'] else ""
        print(f"  {letter}. {choice}{marker}")
    print(f"\n  Skill actually tested: {q['analysis']['skill_tested']}")
    print(f"\n  What it tests:")
    print_wrapped(q['analysis']['what_it_tests'], prefix="    ")
    print(f"\n  What the name suggests:")
    print_wrapped(q['analysis']['what_name_suggests'], prefix="    ")
    print(f"\n  The proxy gap:")
    print_wrapped(q['analysis']['proxy_gap'], prefix="    ")

In [None]:
# --- Part B: Classify the skill distribution ---
# Let's visualize what MMLU actually measures, based on our question analysis.

# Skill categories across MMLU (approximate, based on analysis of question types)
skill_categories = {
    "Knowledge\nRecall": 45,
    "Recognition /\nPattern Match": 30,
    "Formulaic\nReasoning": 12,
    "Formatting\nCompliance": 8,
    "Genuine\nReasoning": 5,
}

labels = list(skill_categories.keys())
values = list(skill_categories.values())
colors = ["#6366f1", "#8b5cf6", "#a78bfa", "#f59e0b", "#10b981"]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left: what MMLU actually tests
bars = ax1.bar(labels, values, color=colors, edgecolor="white", linewidth=0.5, width=0.6)
for bar, val in zip(bars, values):
    ax1.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
        f"{val}%", ha="center", va="bottom", fontsize=11, fontweight="bold", color="white",
    )
ax1.set_ylabel("Approximate % of Questions", fontsize=11)
ax1.set_title("What MMLU Actually Tests", fontsize=13, fontweight="bold")
ax1.set_ylim(0, 55)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.tick_params(axis='x', labelsize=9)

# Right: what the name "understanding" suggests
suggested_skills = {
    "Knowledge\nRecall": 10,
    "Recognition /\nPattern Match": 10,
    "Formulaic\nReasoning": 15,
    "Novel\nReasoning": 35,
    "Explanation /\nSynthesis": 30,
}
s_labels = list(suggested_skills.keys())
s_values = list(suggested_skills.values())
s_colors = ["#6366f1", "#8b5cf6", "#a78bfa", "#10b981", "#06b6d4"]

bars2 = ax2.bar(s_labels, s_values, color=s_colors, edgecolor="white", linewidth=0.5, width=0.6)
for bar, val in zip(bars2, s_values):
    ax2.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
        f"{val}%", ha="center", va="bottom", fontsize=11, fontweight="bold", color="white",
    )
ax2.set_ylabel("Approximate % of Questions", fontsize=11)
ax2.set_title('What "Understanding" Suggests', fontsize=13, fontweight="bold")
ax2.set_ylim(0, 45)
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)
ax2.tick_params(axis='x', labelsize=9)

plt.suptitle(
    "The Proxy Gap: What MMLU Measures vs What Its Name Implies",
    fontsize=14, fontweight="bold", y=1.02,
)
plt.tight_layout()
plt.show()

print("\nThe proxy gap: MMLU is ~75% recall/recognition, but 'understanding'")
print("suggests the majority should be reasoning and synthesis.")
print("A model scoring 90% on MMLU has demonstrated recall, not understanding.")

In [None]:
# --- Part C: Spot the contamination signal ---
# A model's scores across MMLU categories. Some are suspiciously high.
#
# Before running, predict: if a model has contamination for SOME
# categories but not others, what would the score distribution look like?
# Would scores be uniformly high, uniformly medium, or uneven?

# Simulated model scores across 10 MMLU categories
# Some categories have their questions widely available online (forums,
# study guides, solutions manuals). Others have less public exposure.

categories = [
    "Abstract\nAlgebra",
    "US\nHistory",
    "Clinical\nKnowledge",
    "Formal\nLogic",
    "Moral\nScenarios",
    "Econometrics",
    "Computer\nSecurity",
    "Virology",
    "Philosophy",
    "Electrical\nEngineering",
]

# Scores: notice the pattern
scores = [72, 94, 91, 68, 58, 71, 93, 70, 92, 69]

# Which categories have high public availability of Q&A content?
high_availability = [False, True, True, False, False, False, True, False, True, False]

colors_bars = ["#ef4444" if ha else "#6366f1" for ha in high_availability]

fig, ax = plt.subplots(figsize=(12, 5))
bars = ax.bar(categories, scores, color=colors_bars, edgecolor="white", linewidth=0.5, width=0.7)

for bar, score in zip(bars, scores):
    ax.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
        f"{score}%", ha="center", va="bottom", fontsize=10, fontweight="bold", color="white",
    )

# Average lines
avg_high = np.mean([s for s, ha in zip(scores, high_availability) if ha])
avg_low = np.mean([s for s, ha in zip(scores, high_availability) if not ha])

ax.axhline(y=avg_high, color="#ef4444", linestyle="--", alpha=0.6, linewidth=1.5)
ax.text(9.8, avg_high + 1, f"Avg (high availability): {avg_high:.0f}%",
        ha="right", fontsize=9, color="#ef4444")
ax.axhline(y=avg_low, color="#6366f1", linestyle="--", alpha=0.6, linewidth=1.5)
ax.text(9.8, avg_low + 1, f"Avg (low availability): {avg_low:.0f}%",
        ha="right", fontsize=9, color="#6366f1")

ax.set_ylabel("Score (%)", fontsize=12)
ax.set_title(
    "MMLU Scores by Category: Spot the Contamination Signal",
    fontsize=13, fontweight="bold",
)
ax.set_ylim(40, 100)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.tick_params(axis='x', labelsize=9)

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor="#ef4444", label="High public Q&A availability"),
    Patch(facecolor="#6366f1", label="Low public Q&A availability"),
]
ax.legend(handles=legend_elements, loc="lower right", fontsize=10)

plt.tight_layout()
plt.show()

gap = avg_high - avg_low
print(f"\nGap between high-availability and low-availability categories: {gap:.0f} points")
print(f"\nThis is the forensic evidence of contamination. Both sets of")
print(f"categories are similar difficulty — the {gap:.0f}-point gap is the")
print(f"contamination signal. The model scores higher on categories where")
print(f"Q&A content is widely available in web crawls.")
print(f"\nThe score is real (the model got the answers right), but the")
print(f"capability is inflated (the model memorized rather than reasoned).")

In [None]:
# --- Part D: What's NOT measured? ---
# MMLU covers 57 subjects. What dimensions of model quality are entirely absent?

print("WHAT MMLU DOES NOT MEASURE")
print("=" * 70)
print()

unmeasured = [
    {
        "dimension": "Open-ended generation quality",
        "why_it_matters": "Users interact with LLMs through open-ended prompts, not "
                          "multiple-choice questions. A model that excels at selecting "
                          "answers from options may produce verbose, unfocused, or poorly "
                          "structured text when generating freely.",
        "example": "A model scores 90% on MMLU but produces rambling, hedging responses "
                   "when asked to explain a concept. Users prefer a lower-scoring model "
                   "that gives clear, concise explanations.",
    },
    {
        "dimension": "Instruction following",
        "why_it_matters": "Can the model follow complex, multi-step instructions? Can it "
                          "maintain constraints across a long response? MMLU tests none of this.",
        "example": "A model aces MMLU but ignores format constraints ('respond in JSON'), "
                   "forgets earlier instructions in long conversations, or fails to "
                   "maintain a specified persona.",
    },
    {
        "dimension": "Calibration (knowing what it doesn't know)",
        "why_it_matters": "A good model should express appropriate uncertainty. MMLU's "
                          "forced-choice format means the model MUST pick an answer — it "
                          "cannot say 'I am not sure.' This systematically fails to "
                          "measure calibration.",
        "example": "A model that confidently picks wrong answers on MMLU could be "
                   "dangerous in practice. A model that says 'I don't know' when "
                   "uncertain is more trustworthy but MMLU cannot detect the difference.",
    },
    {
        "dimension": "Safety and alignment",
        "why_it_matters": "As you saw in Lesson 3 (Red Teaming), a model can pass safety "
                          "benchmarks while failing on adversarial probing. MMLU does not "
                          "test safety at all. A high-scoring model on MMLU could be "
                          "completely unaligned.",
        "example": "The model from the lesson's hook: passed benchmarks, failed on "
                   "demographic bias, sycophancy, and indirect requests.",
    },
    {
        "dimension": "Real-world task completion",
        "why_it_matters": "Can the model actually help someone debug code, write an email, "
                          "summarize a document, or plan a project? MMLU tests academic "
                          "knowledge in isolated questions, not practical usefulness.",
        "example": "A model scores 92% on MMLU but cannot maintain context across a "
                   "multi-turn debugging session or misunderstands what the user "
                   "actually needs.",
    },
]

for i, item in enumerate(unmeasured):
    print(f"{i + 1}. {item['dimension']}")
    print_wrapped(f"Why it matters: {item['why_it_matters']}", prefix="   ")
    print_wrapped(f"Example: {item['example']}", prefix="   ")
    print()

print("=" * 70)
print("\nKey insight: a benchmark score tells you about performance on THAT")
print("BENCHMARK. It does not tell you about any dimension it does not measure.")
print("MMLU measures recognition on academic knowledge. Everything else —")
print("generation quality, instruction following, calibration, safety, and")
print("real-world usefulness — is unmeasured. The blind spots are enormous.")

**What you just practiced:** Reading benchmark results critically. Not cynically — the goal is not to dismiss all benchmarks, but to understand the gap between what the name promises and what the mechanism measures.

The autopsy revealed three layers of the proxy gap:

1. **Skill mismatch:** MMLU tests mostly recall and recognition, but "understanding" suggests reasoning and synthesis. A 90% score means "90% recall accuracy," not "90% understanding."

2. **Contamination signal:** Uneven performance across categories of similar difficulty is forensic evidence. Categories with high public availability of Q&A content score ~22 points higher — the model memorized the answers, not the reasoning.

3. **Coverage blind spots:** Five major dimensions of model quality (generation, instruction following, calibration, safety, task completion) are entirely unmeasured. A model's MMLU score tells you nothing about any of them.

This is the benchmark anatomy the lesson described: every layer between "actual capability" and "the number on the leaderboard" adds distortion. The number is real. The capability it implies is not.

---

## Exercise 2: LLM-as-Judge Bias Detection (Supported)

The lesson introduced four systematic biases in LLM-as-judge evaluation: verbosity bias, confidence bias, self-preference bias, and format sensitivity. In this exercise, you'll measure two of these biases empirically: **position bias** (does the judge prefer whichever response appears first?) and **verbosity bias** (does the judge prefer longer responses regardless of quality?).

You'll use an LLM to judge pairs of model responses. Then you'll swap the order and re-judge. Then you'll test with a verbose-vs-concise pair of equal quality. The biases are measurable and predictable — the evaluator's blind spots become the evaluation's blind spots.

Fill in the TODOs below. Each TODO is 1-3 lines.

<details>
<summary>Hint</summary>

For position bias testing: if the judge is unbiased, swapping the order should not change the verdict. If it does, position bias is present.

For verbosity bias testing: create two responses to the same question that are equally accurate, but one is 3-4x longer with more detail, filler, and caveats. If the judge rates the verbose response higher, verbosity bias is present.

</details>

In [None]:
# --- Shared judge function ---

JUDGE_SYSTEM_PROMPT = (
    "You are an expert evaluator of AI assistant responses. "
    "Given a user question and two responses (Response A and Response B), "
    "determine which response is better overall in terms of accuracy, "
    "helpfulness, and clarity.\n\n"
    "Respond with ONLY one of these three options:\n"
    "  A  (if Response A is better)\n"
    "  B  (if Response B is better)\n"
    "  TIE  (if they are roughly equal)\n\n"
    "Respond with a single word: A, B, or TIE."
)


def judge_pair(question: str, response_a: str, response_b: str) -> str:
    """Ask the LLM judge to compare two responses. Returns 'A', 'B', or 'TIE'."""
    user_prompt = (
        f"Question: {question}\n\n"
        f"Response A: {response_a}\n\n"
        f"Response B: {response_b}"
    )
    verdict = call_llm(JUDGE_SYSTEM_PROMPT, user_prompt, temperature=0.0)
    # Normalize the verdict
    v = verdict.strip().upper()
    if "TIE" in v:
        return "TIE"
    if "A" in v and "B" not in v:
        return "A"
    if "B" in v and "A" not in v:
        return "B"
    return v[:3]  # fallback


print("Judge function ready.")

In [None]:
# --- Part A: Position bias test ---
# We present the same two responses in both orders.
# If the judge is unbiased, swapping order should not change the verdict.

# 5 question-response pairs. For each, we'll judge both orderings.
test_pairs = [
    {
        "question": "What is the capital of Australia?",
        "response_1": "The capital of Australia is Canberra. It was chosen as a compromise between Sydney and Melbourne when the country federated in 1901, and the city was purpose-built as the capital.",
        "response_2": "Canberra is the capital of Australia. Many people mistakenly think it's Sydney, but Canberra has been the capital since 1927.",
    },
    {
        "question": "Explain what a neural network is in simple terms.",
        "response_1": "A neural network is a computer system inspired by the human brain. It consists of layers of interconnected nodes that process information. Data flows through these layers, with each layer learning to detect different patterns. Through training on examples, the network adjusts its connections to get better at the task.",
        "response_2": "A neural network is a type of machine learning model that learns patterns from data. Think of it like a series of filters: raw data goes in one end, each layer extracts increasingly complex features, and a prediction comes out the other end. You train it by showing it examples and adjusting the filters when it gets things wrong.",
    },
    {
        "question": "What are the benefits of regular exercise?",
        "response_1": "Regular exercise improves cardiovascular health, strengthens muscles and bones, boosts mood through endorphin release, helps manage weight, improves sleep quality, and reduces risk of chronic diseases like diabetes and heart disease.",
        "response_2": "Exercise benefits include better heart health, stronger muscles, improved mental health (it reduces anxiety and depression), better sleep, weight management, and a lower risk of conditions like type 2 diabetes and certain cancers.",
    },
    {
        "question": "How does photosynthesis work?",
        "response_1": "Photosynthesis converts sunlight, water, and carbon dioxide into glucose and oxygen. Light energy is captured by chlorophyll in the leaves, which drives a series of chemical reactions: water molecules are split, CO2 is fixed into organic molecules, and glucose is produced as stored energy.",
        "response_2": "Plants use photosynthesis to make their own food. They absorb sunlight through chlorophyll in their leaves, take in CO2 from the air and water from the soil, then use the sun's energy to convert these into glucose (sugar) for energy and release oxygen as a byproduct.",
    },
    {
        "question": "What is the difference between a virus and a bacterium?",
        "response_1": "Bacteria are single-celled living organisms that can reproduce on their own. Viruses are much smaller, not considered fully alive, and need a host cell to reproduce. Bacteria can be treated with antibiotics; viruses cannot. Both can cause disease, but through different mechanisms.",
        "response_2": "The key differences: bacteria are living cells that reproduce independently, while viruses are non-living particles that hijack host cells to replicate. Size-wise, viruses are much smaller. Treatment-wise, antibiotics work on bacteria but not viruses, which is why you don't take antibiotics for a cold.",
    },
]

print("POSITION BIAS TEST")
print("=" * 70)
print("For each pair, we judge twice: once with R1 as A, once with R1 as B.")
print("If the judge is unbiased, the verdicts should be consistent.\n")

position_results = []

for i, pair in enumerate(test_pairs):
    # Order 1: Response 1 = A, Response 2 = B
    verdict_1 = judge_pair(pair["question"], pair["response_1"], pair["response_2"])

    # Order 2: Response 2 = A, Response 1 = B (swapped)
    verdict_2 = judge_pair(pair["question"], pair["response_2"], pair["response_1"])

    # Check consistency: if verdict_1 = "A" (preferring R1), then
    # verdict_2 should = "B" (still preferring R1, which is now B)
    consistent = (
        (verdict_1 == "A" and verdict_2 == "B")
        or (verdict_1 == "B" and verdict_2 == "A")
        or (verdict_1 == "TIE" and verdict_2 == "TIE")
    )

    position_results.append({
        "question": pair["question"][:50] + "...",
        "order1": verdict_1,
        "order2": verdict_2,
        "consistent": consistent,
    })

    status = "Consistent" if consistent else "FLIPPED (position bias!)"
    print(f"  Pair {i + 1}: Order 1 = {verdict_1}, Order 2 = {verdict_2} — {status}")

flipped = sum(1 for r in position_results if not r["consistent"])
print(f"\nResults: {flipped}/{len(position_results)} pairs showed position bias.")
if flipped > 0:
    print(f"The judge changed its mind when the order changed — the verdict")
    print(f"depends on presentation, not quality.")
if flipped == 0:
    print(f"The judge was consistent across orderings for these pairs.")
    print(f"Note: 5 pairs is a small sample. Position bias may appear at larger scale.")

In [None]:
# --- Part B: Verbosity bias test ---
# For each question, we create two responses of EQUAL quality:
# one concise, one verbose (with extra caveats, examples, and filler).
# If the judge rates the verbose one higher, verbosity bias is present.

verbosity_pairs = [
    {
        "question": "What is a Python list comprehension?",
        "concise": (
            "A list comprehension is a compact syntax for creating lists. "
            "Instead of writing a for loop, you write `[expression for item in iterable]`. "
            "For example, `[x**2 for x in range(5)]` gives `[0, 1, 4, 9, 16]`."
        ),
        "verbose": (
            "That's a great question! A list comprehension is one of Python's most "
            "powerful and elegant features, and it's something that every Python "
            "developer should definitely understand well. Essentially, a list "
            "comprehension provides a concise and readable way to create new lists "
            "by applying an expression to each item in an existing iterable, such as "
            "a list, range, or other sequence. The general syntax follows the pattern "
            "`[expression for item in iterable]`, which replaces what would otherwise "
            "require a multi-line for loop with append statements. For instance, if "
            "you wanted to create a list of squared numbers, you could write "
            "`[x**2 for x in range(5)]`, which would elegantly produce `[0, 1, 4, 9, 16]`. "
            "It's worth noting that list comprehensions can also include conditional "
            "filtering with an optional `if` clause, making them even more versatile. "
            "Many experienced Python developers prefer list comprehensions for their "
            "readability and Pythonic style, though it's important to keep them "
            "simple enough to remain readable."
        ),
    },
    {
        "question": "Why is the sky blue?",
        "concise": (
            "Sunlight contains all colors. When it hits Earth's atmosphere, shorter "
            "wavelengths (blue) scatter more than longer ones (red) due to Rayleigh "
            "scattering off air molecules. So you see blue light coming from all "
            "directions in the sky."
        ),
        "verbose": (
            "This is actually a really fascinating question that has intrigued "
            "scientists and philosophers for centuries! The explanation involves a "
            "phenomenon known as Rayleigh scattering, which is a fundamental concept "
            "in atmospheric physics. Here's how it works: sunlight, which appears "
            "white to us, is actually composed of all the colors of the visible "
            "spectrum — from red to violet. When this sunlight enters Earth's "
            "atmosphere, it encounters countless tiny gas molecules (primarily "
            "nitrogen and oxygen). Now, here's the key insight: shorter wavelengths "
            "of light (blue and violet) are scattered much more effectively by these "
            "molecules than longer wavelengths (red and orange). Specifically, the "
            "scattering intensity is inversely proportional to the fourth power of "
            "the wavelength, meaning blue light is scattered roughly 5.5 times more "
            "than red light. As a result, when you look up at the sky, you see blue "
            "light that has been scattered from all directions. You might wonder why "
            "the sky isn't violet (since violet has an even shorter wavelength), and "
            "that's because our eyes are more sensitive to blue and because some "
            "violet light is absorbed in the upper atmosphere."
        ),
    },
    {
        "question": "What is the difference between HTTP and HTTPS?",
        "concise": (
            "HTTPS is HTTP with encryption. HTTP sends data in plain text — anyone "
            "on the network can read it. HTTPS uses TLS to encrypt the connection, "
            "so data is private between your browser and the server. Always use HTTPS "
            "for sensitive data like passwords and payments."
        ),
        "verbose": (
            "Great question! Understanding the difference between HTTP and HTTPS is "
            "really important for anyone interested in web security and development. "
            "Let me break this down for you comprehensively. HTTP, which stands for "
            "HyperText Transfer Protocol, is the foundational protocol used for "
            "transmitting data on the web. However, HTTP has a significant limitation: "
            "it transmits data in plain text, which means that any data sent between "
            "your browser and the web server can potentially be intercepted and read "
            "by anyone with access to the network. This is where HTTPS comes in — "
            "the 'S' stands for 'Secure'. HTTPS uses TLS (Transport Layer Security) "
            "encryption to create a secure, encrypted connection between your browser "
            "and the server. This means that even if someone intercepts the data, they "
            "won't be able to read it because it's encrypted. It's absolutely essential "
            "to use HTTPS whenever you're transmitting sensitive information like "
            "passwords, credit card numbers, or personal data. Most modern browsers "
            "now show a warning when you visit an HTTP site, which is a good practice "
            "that encourages the adoption of HTTPS across the web."
        ),
    },
]

# TODO: For each pair, judge twice:
#   1. Concise as A, Verbose as B
#   2. Verbose as A, Concise as B
# Track which response the judge prefers in each ordering.
# Determine: does the judge systematically prefer the verbose response?

print("VERBOSITY BIAS TEST")
print("=" * 70)
print("Both responses are equally accurate. The verbose version adds filler,")
print("caveats, and enthusiastic phrasing but no additional substance.")
print("An unbiased judge should rate them as TIE or split evenly.\n")

verbose_preferred = 0
concise_preferred = 0
ties = 0

for i, pair in enumerate(verbosity_pairs):
    # TODO: Judge with concise as A and verbose as B.
    # Then judge with verbose as A and concise as B.
    # Determine which RESPONSE (concise or verbose) the judge preferred
    # in each ordering, accounting for the swap.
    # YOUR CODE HERE (4-8 lines)
    verdict_cv = judge_pair(pair["question"], pair["concise"], pair["verbose"])
    verdict_vc = judge_pair(pair["question"], pair["verbose"], pair["concise"])

    # Determine preference for each ordering
    # Order 1: A=concise, B=verbose. Verdict "B" means verbose preferred.
    # Order 2: A=verbose, B=concise. Verdict "A" means verbose preferred.
    verbose_wins = 0
    concise_wins = 0

    if verdict_cv == "B":
        verbose_wins += 1
    elif verdict_cv == "A":
        concise_wins += 1

    if verdict_vc == "A":
        verbose_wins += 1
    elif verdict_vc == "B":
        concise_wins += 1

    if verbose_wins > concise_wins:
        verbose_preferred += 1
        result = "VERBOSE preferred"
    elif concise_wins > verbose_wins:
        concise_preferred += 1
        result = "Concise preferred"
    else:
        ties += 1
        result = "Tie / mixed"

    print(f"  Pair {i + 1} ({pair['question'][:40]}...)")
    print(f"    Order 1 (A=concise, B=verbose): {verdict_cv}")
    print(f"    Order 2 (A=verbose, B=concise): {verdict_vc}")
    print(f"    Result: {result}")
    print()

print(f"Summary: Verbose preferred {verbose_preferred}/{len(verbosity_pairs)}, "
      f"Concise preferred {concise_preferred}/{len(verbosity_pairs)}, "
      f"Ties {ties}/{len(verbosity_pairs)}")

In [None]:
# --- Part C: Visualize the bias ---

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Position bias chart
consistent_count = sum(1 for r in position_results if r["consistent"])
flipped_count = sum(1 for r in position_results if not r["consistent"])

pos_labels = ["Consistent\n(no bias)", "Flipped\n(position bias)"]
pos_values = [consistent_count, flipped_count]
pos_colors = ["#10b981", "#ef4444"]

bars1 = ax1.bar(pos_labels, pos_values, color=pos_colors, edgecolor="white", linewidth=0.5, width=0.5)
for bar, val in zip(bars1, pos_values):
    ax1.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.1,
        str(val), ha="center", va="bottom", fontsize=14, fontweight="bold", color="white",
    )
ax1.set_ylabel("Number of Pairs", fontsize=11)
ax1.set_title("Position Bias", fontsize=13, fontweight="bold")
ax1.set_ylim(0, max(pos_values) + 2)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)

# Verbosity bias chart
verb_labels = ["Verbose\nPreferred", "Concise\nPreferred", "Tie /\nMixed"]
verb_values = [verbose_preferred, concise_preferred, ties]
verb_colors = ["#ef4444", "#10b981", "#6366f1"]

bars2 = ax2.bar(verb_labels, verb_values, color=verb_colors, edgecolor="white", linewidth=0.5, width=0.5)
for bar, val in zip(bars2, verb_values):
    ax2.text(
        bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.1,
        str(val), ha="center", va="bottom", fontsize=14, fontweight="bold", color="white",
    )
ax2.set_ylabel("Number of Pairs", fontsize=11)
ax2.set_title("Verbosity Bias", fontsize=13, fontweight="bold")
ax2.set_ylim(0, max(verb_values) + 2)
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)

plt.suptitle(
    "LLM-as-Judge Systematic Biases",
    fontsize=14, fontweight="bold", y=1.02,
)
plt.tight_layout()
plt.show()

print("\nWhat the visualization shows:")
print("- Position bias: the judge's verdict depends on which response")
print("  appears first, not on quality.")
print("- Verbosity bias: the judge systematically prefers longer responses,")
print("  even when the extra length adds no substance.")
print("\nThese biases are the SAME failure modes as reward hacking from")
print("Series 4: the reward model (now acting as judge) is fooled by")
print("surface features (length, confidence, formatting) rather than")
print("evaluating actual quality. The evaluator's blind spots become")
print("the evaluation's blind spots.")

<details>
<summary>Solution</summary>

**Why the bias detection works:** The key insight is that position bias and verbosity bias are *measurable* properties of the judge, not random noise. By testing with swapped orderings and controlled length differences, you can quantify exactly how much the judge's verdict depends on presentation vs content.

The TODO code for Part B:

```python
verdict_cv = judge_pair(pair["question"], pair["concise"], pair["verbose"])
verdict_vc = judge_pair(pair["question"], pair["verbose"], pair["concise"])

# Order 1: A=concise, B=verbose. "B" means verbose preferred.
# Order 2: A=verbose, B=concise. "A" means verbose preferred.
verbose_wins = 0
concise_wins = 0

if verdict_cv == "B":
    verbose_wins += 1
elif verdict_cv == "A":
    concise_wins += 1

if verdict_vc == "A":
    verbose_wins += 1
elif verdict_vc == "B":
    concise_wins += 1
```

**Why this connects to the lesson:** The verbose responses in this exercise are deliberately designed to mimic reward hacking patterns from Series 4: enthusiastic openers ("Great question!"), confident filler ("really important", "absolutely essential"), and excessive qualification. These are the same surface features that reward models learn to prefer. When an LLM judge exhibits the same biases, it confirms the lesson's core pattern: **the evaluator's limitations become the evaluation's limitations.**

**Common findings:**
- Position bias: most models show some position bias, preferring the first response (recency effects vary by model)
- Verbosity bias: most models systematically prefer the verbose response, even though it adds no new information
- Both biases are model-dependent — GPT-4 shows less verbosity bias than GPT-3.5, but neither is bias-free

**Mitigation strategies used in practice:**
- Run each comparison twice with swapped order, take the majority vote
- Ask the judge to evaluate each response independently, then compare scores
- Use rubric-based evaluation instead of direct comparison
- Calibrate the judge on known-quality pairs before deployment

</details>

**What you just measured:** Systematic biases in LLM-as-judge evaluation — empirically, not theoretically.

The lesson described four biases: verbosity, confidence, self-preference, and format sensitivity. You measured two of them (position and verbosity) and confirmed they are real and predictable. The judge's verdict depends on *how* a response is presented, not just *what* it says.

This is the third time this module has demonstrated the same pattern: human annotators have biases (Constitutional AI), red team models have blind spots (Red Teaming), and now LLM judges have biases (this exercise). The evaluator's limitations always become the evaluation's limitations. No single evaluation source is reliable — the best approach combines multiple methods.

---

## Exercise 3: Design an Evaluation Strategy (Supported) — Module Synthesis

This exercise integrates the full module. You've spent four lessons building a sophisticated understanding of alignment: how to build it (Constitutional AI, Alignment Landscape), how to break it (Red Teaming), and how to measure it (this lesson). Now you'll put it all together.

**The scenario:** You are evaluating a **customer support chatbot** for a mid-size e-commerce company. The chatbot handles order inquiries, returns, product recommendations, and complaint resolution. It processes ~10,000 conversations per day.

**Your task:** Design a multi-layer evaluation strategy. For each evaluation layer, specify what it measures, what it misses, and why no single layer is sufficient.

Fill in the TODOs below. Each TODO is a text response (no code), structured as evaluation design decisions with reasoning.

<details>
<summary>Hint</summary>

Think about the defense-in-depth principle from Lesson 3: just as no single defense layer covers the alignment surface, no single evaluation method captures model quality. Your strategy should have multiple layers, each covering what the others miss.

The four evaluation layers from the lesson: benchmarks, human evaluation, LLM-as-judge, and red teaming. For each, think about: what is it good at measuring? What are its blind spots? What does it cost?

</details>

In [None]:
# --- The Use Case ---

print("EVALUATION STRATEGY DESIGN")
print("=" * 70)
print()
print("Use Case: Customer Support Chatbot")
print("-" * 40)
print("Company: Mid-size e-commerce (clothing and electronics)")
print("Volume: ~10,000 conversations per day")
print("Tasks: Order inquiries, returns, product recommendations,")
print("       complaint resolution, policy questions")
print("Users: General public, varying technical literacy")
print("Stakes: Customer satisfaction, brand reputation,")
print("        legal compliance (refund policies, warranty claims)")
print()
print("Your job: Design a multi-layer evaluation strategy.")
print("For each layer, specify:")
print("  1. WHAT to evaluate")
print("  2. HOW to evaluate it (which method)")
print("  3. WHAT IT MISSES (the blind spots)")
print("  4. WHY this layer alone is insufficient")

In [None]:
# --- Layer 1: Benchmarks ---
# Which existing benchmarks would you use? What would they tell you?
# What would they NOT tell you?

# TODO: Fill in your evaluation design for the benchmark layer.
# Think about: which benchmarks are relevant for a customer support chatbot?
# What do they actually measure (remember the proxy gap)?
# What dimensions of customer support quality do they miss?

benchmark_layer = {
    "benchmarks_to_use": [
        # TODO: List 2-3 benchmarks and what each would tell you.
        # Example: ("BenchmarkName", "What it measures", "What it misses")
    ],
    "what_this_layer_catches": "",  # TODO: What does this layer tell you?
    "blind_spots": "",  # TODO: What does this layer miss?
    "why_insufficient": "",  # TODO: Why can't you stop here?
}

print("LAYER 1: BENCHMARKS")
print("=" * 70)
print("\nBenchmarks to use:")
for b in benchmark_layer["benchmarks_to_use"]:
    if isinstance(b, tuple) and len(b) == 3:
        print(f"  - {b[0]}")
        print(f"    Measures: {b[1]}")
        print(f"    Misses: {b[2]}")
    else:
        print(f"  - {b}")
print(f"\nWhat this catches: {benchmark_layer['what_this_layer_catches']}")
print(f"Blind spots: {benchmark_layer['blind_spots']}")
print(f"Why insufficient: {benchmark_layer['why_insufficient']}")

In [None]:
# --- Layer 2: Human Evaluation ---
# What would you have humans evaluate? Who are the evaluators?
# What are the tradeoffs (cost, consistency, scale)?

# TODO: Fill in your evaluation design for the human evaluation layer.
# Think about: who should the evaluators be (domain experts, random users, both)?
# What should they evaluate (accuracy, tone, helpfulness)?
# How do you handle inter-annotator disagreement?

human_layer = {
    "what_to_evaluate": [],  # TODO: List 2-3 specific dimensions
    "who_evaluates": "",  # TODO: Who are the evaluators and why?
    "evaluation_method": "",  # TODO: Rating scale, pairwise comparison, or other?
    "sample_size_and_cost": "",  # TODO: How many conversations, rough cost estimate
    "blind_spots": "",  # TODO: What does human evaluation miss here?
    "why_insufficient": "",  # TODO: Why can't you stop here?
}

print("LAYER 2: HUMAN EVALUATION")
print("=" * 70)
print(f"\nWhat to evaluate:")
for item in human_layer["what_to_evaluate"]:
    print(f"  - {item}")
print(f"\nWho evaluates: {human_layer['who_evaluates']}")
print(f"Method: {human_layer['evaluation_method']}")
print(f"Sample & cost: {human_layer['sample_size_and_cost']}")
print(f"Blind spots: {human_layer['blind_spots']}")
print(f"Why insufficient: {human_layer['why_insufficient']}")

In [None]:
# --- Layer 3: LLM-as-Judge ---
# What would you have an LLM judge evaluate? How do you mitigate the biases
# you just measured in Exercise 2?

# TODO: Fill in your evaluation design for the LLM-as-judge layer.
# Think about: what rubric would you give the judge?
# How do you mitigate position bias and verbosity bias?
# What can the LLM judge evaluate at scale that humans cannot?

llm_judge_layer = {
    "what_to_evaluate": [],  # TODO: List 2-3 specific dimensions
    "rubric_design": "",  # TODO: What criteria does the judge use?
    "bias_mitigation": [],  # TODO: How do you handle the biases from Exercise 2?
    "scale": "",  # TODO: How many conversations can you evaluate?
    "blind_spots": "",  # TODO: What does LLM-as-judge miss?
    "why_insufficient": "",  # TODO: Why can't you stop here?
}

print("LAYER 3: LLM-AS-JUDGE")
print("=" * 70)
print(f"\nWhat to evaluate:")
for item in llm_judge_layer["what_to_evaluate"]:
    print(f"  - {item}")
print(f"\nRubric: {llm_judge_layer['rubric_design']}")
print(f"Bias mitigation:")
for m in llm_judge_layer["bias_mitigation"]:
    print(f"  - {m}")
print(f"Scale: {llm_judge_layer['scale']}")
print(f"Blind spots: {llm_judge_layer['blind_spots']}")
print(f"Why insufficient: {llm_judge_layer['why_insufficient']}")

In [None]:
# --- Layer 4: Red Teaming ---
# What dimensions of the chatbot require adversarial testing?
# What attacks from the Lesson 3 taxonomy are relevant here?

# TODO: Fill in your evaluation design for the red teaming layer.
# Think about: what could go wrong with a customer support chatbot?
# Which attack categories from Lesson 3 are most relevant?
# What failures would benchmarks and judges never catch?

red_team_layer = {
    "what_to_test": [],  # TODO: List 3-4 specific failure modes to probe
    "attack_categories": [],  # TODO: Which taxonomy categories are relevant?
    "automated_vs_manual": "",  # TODO: What requires human creativity vs automated scale?
    "blind_spots": "",  # TODO: What does even red teaming miss?
    "why_insufficient": "",  # TODO: Why can't you stop here?
}

print("LAYER 4: RED TEAMING")
print("=" * 70)
print(f"\nWhat to test:")
for item in red_team_layer["what_to_test"]:
    print(f"  - {item}")
print(f"\nRelevant attack categories:")
for cat in red_team_layer["attack_categories"]:
    print(f"  - {cat}")
print(f"\nAutomated vs manual: {red_team_layer['automated_vs_manual']}")
print(f"Blind spots: {red_team_layer['blind_spots']}")
print(f"Why insufficient: {red_team_layer['why_insufficient']}")

In [None]:
# --- Synthesis: The Complete Strategy ---

print("COMPLETE EVALUATION STRATEGY")
print("=" * 70)
print()
print("Layer 1 (Benchmarks): Broad baseline capability check")
print("  \u2192 Catches: general capability, known failure modes")
print("  \u2192 Misses: domain-specific quality, real-world usefulness")
print()
print("Layer 2 (Human Evaluation): Deep quality assessment on a sample")
print("  \u2192 Catches: nuanced quality, tone, accuracy in context")
print("  \u2192 Misses: scale (can only evaluate a tiny fraction), consistency")
print()
print("Layer 3 (LLM-as-Judge): Scalable quality monitoring")
print("  \u2192 Catches: systematic patterns at scale, regression detection")
print("  \u2192 Misses: novel failure modes, biases the judge shares with the model")
print()
print("Layer 4 (Red Teaming): Adversarial failure discovery")
print("  \u2192 Catches: edge cases, manipulation, policy violations")
print("  \u2192 Misses: day-to-day quality, user satisfaction on normal queries")
print()
print("=" * 70)
print("No single layer is sufficient. Each covers what the others miss.")
print("This is defense-in-depth applied to evaluation \u2014 the same principle")
print("from Lesson 3 (Red Teaming), applied to measurement.")
print()
print("The key insight: evaluation design requires the same tradeoff thinking")
print("as alignment technique selection (Lesson 2). There is no single right")
print("evaluation \u2014 there are tradeoffs between cost, coverage, depth, and")
print("the specific blind spots of each method.")

<details>
<summary>Solution</summary>

**The reasoning matters more than the specific choices.** There is no single "right" evaluation strategy — the goal is demonstrating that you can identify what each layer catches, what it misses, and why multiple layers are necessary.

**Example Layer 1 (Benchmarks):**
```python
benchmark_layer = {
    "benchmarks_to_use": [
        ("MT-Bench (customer service subset)",
         "Multi-turn conversation quality — can the model maintain context across a conversation",
         "Tested on generic conversations, not domain-specific customer support scenarios"),
        ("TruthfulQA",
         "Whether the model makes false claims — critical for policy and product information",
         "Tests general truthfulness, not domain-specific accuracy (shipping times, return policies)"),
        ("Custom policy accuracy test",
         "Whether the model correctly states company policies (returns, warranties, shipping)",
         "Only tests the policies you wrote questions for; does not test edge cases or novel situations"),
    ],
    "what_this_layer_catches": "Baseline capability: can the model hold conversations, avoid obvious falsehoods, and recall policy information. Quick and cheap to run on every model update.",
    "blind_spots": "Benchmarks test static questions, not real customer interactions. They miss tone, empathy, de-escalation, and the ability to handle ambiguous or emotional customer messages. The policy test only covers known scenarios.",
    "why_insufficient": "A model could score perfectly on benchmarks by memorizing answers (contamination) while being terrible at handling a frustrated customer whose order was wrong — a scenario that requires empathy, flexible problem-solving, and accurate policy application simultaneously.",
}
```

**Example Layer 2 (Human Evaluation):**
```python
human_layer = {
    "what_to_evaluate": [
        "Accuracy of policy information (returns, shipping, warranties)",
        "Tone and empathy (does the chatbot feel like talking to a helpful person or a script?)",
        "Resolution quality (did the customer's issue actually get resolved?)",
    ],
    "who_evaluates": "Two groups: (1) experienced customer support agents who know the policies and common issues, (2) a sample of real customers who rate their own interactions. Agents catch accuracy errors; customers catch tone and satisfaction issues.",
    "evaluation_method": "Pairwise comparison (current model vs previous version or vs human agent) rather than absolute rating — more reliable, same insight as RLHF preference pairs.",
    "sample_size_and_cost": "200 conversations per evaluation round (2% of daily volume). At ~5 min per evaluation, ~17 hours of evaluator time. Cost: ~$500-1000 per round with skilled evaluators.",
    "blind_spots": "200 conversations cannot cover the full distribution. Evaluators have their own biases (verbosity bias, preference for certain communication styles). Inter-annotator agreement will be low on 'tone' judgments.",
    "why_insufficient": "Evaluating 200 out of 10,000 daily conversations means 98% go unexamined. Rare but serious failures (giving incorrect refund information, escalating a complaint) may not appear in the sample. Cannot run on every model update due to cost.",
}
```

**Example Layer 3 (LLM-as-Judge):**
```python
llm_judge_layer = {
    "what_to_evaluate": [
        "Policy accuracy (did the chatbot state correct return/shipping/warranty information?)",
        "Response completeness (did the chatbot address all parts of the customer's question?)",
        "Tone appropriateness (professional, empathetic, not dismissive or overly casual)",
    ],
    "rubric_design": "Structured rubric with 5 specific criteria, each scored 1-5 with definitions for each score level. The rubric includes examples of 1-rated and 5-rated responses for calibration. Company policies are included in the judge's context so it can verify accuracy.",
    "bias_mitigation": [
        "Run each comparison twice with swapped order to detect position bias (from Exercise 2)",
        "Score against a rubric rather than direct comparison to reduce verbosity bias",
        "Calibrate the judge on 50 human-evaluated conversations first to check agreement",
        "Flag conversations where the judge disagrees with itself across orderings for human review",
    ],
    "scale": "Can evaluate all 10,000 daily conversations. Cost: ~$50-100/day with GPT-4o-mini as judge. Enables monitoring every conversation, not just a sample.",
    "blind_spots": "The judge has the same biases as the chatbot (both are LLMs). It may rate sycophantic responses highly. It cannot evaluate whether the customer's problem was ACTUALLY resolved — only whether the response SEEMS helpful. Subtle policy errors that sound plausible will be missed.",
    "why_insufficient": "LLM-as-judge catches systematic patterns but misses the failures that look like successes. A chatbot that gives a confident, well-formatted wrong answer about a return policy will score high with the LLM judge but cost the company a customer.",
}
```

**Example Layer 4 (Red Teaming):**
```python
red_team_layer = {
    "what_to_test": [
        "Social engineering: can a customer manipulate the chatbot into granting unauthorized refunds or discounts?",
        "Information leakage: can a customer extract internal pricing, inventory, or policy override information?",
        "Emotional manipulation: does the chatbot become inappropriately agreeable when the customer is aggressive or emotional (sycophancy)?",
        "Policy boundary testing: does the chatbot correctly apply return window limits, or can edge cases trick it into extending policies?",
    ],
    "attack_categories": [
        "Category 2 (reframing): customers reframing unreasonable demands as reasonable requests",
        "Category 3 (multi-step): building up to an unreasonable request through a series of reasonable questions",
        "Category 5 (persona): claiming to be a manager, VIP customer, or quality assurance tester",
    ],
    "automated_vs_manual": "Automated for scale (generate 1000+ variations of social engineering attempts, test all policy boundaries systematically). Manual for creativity (human red teamers try novel manipulation strategies that an automated generator would not think of). The combination mirrors automated + human red teaming from Lesson 3.",
    "blind_spots": "Red teaming tests adversarial scenarios, not typical usage. A chatbot could pass all red teaming while being unhelpful or confusing for normal customers. Red teaming finds failures but does not measure everyday quality.",
    "why_insufficient": "Red teaming tells you where the model fails under adversarial pressure, but 99%+ of real conversations are not adversarial. A chatbot that is bulletproof against manipulation but gives mediocre answers to 'where is my order?' is not a good chatbot.",
}
```

**The meta-insight:** Each layer compensates for the others' blind spots. Benchmarks are cheap and broad but shallow. Human evaluation is deep but expensive and narrow. LLM-as-judge scales but inherits biases. Red teaming finds edge cases but misses everyday quality. The combination is defense-in-depth for evaluation — exactly the principle from Lesson 3.

</details>

**What you just designed:** A multi-layer evaluation strategy that integrates every concept from the module. This is the "Measure" phase of the Build-Break-Measure arc, and it draws on all four lessons:

- **From Constitutional AI (Lesson 1):** The annotation bottleneck — human evaluation is expensive, inconsistent, and does not scale. LLM-as-judge is the evaluation equivalent of RLAIF.
- **From the Alignment Landscape (Lesson 2):** Tradeoff thinking — there is no single best evaluation method, just as there is no single best alignment technique. Constraints drive choice.
- **From Red Teaming (Lesson 3):** Defense-in-depth — no single evaluation layer is sufficient, just as no single defense layer covers the alignment surface. And the attack taxonomy directly informs what to red-team.
- **From this lesson:** The proxy gap, contamination, Goodhart's law, and judge biases — every evaluation method has blind spots, and the evaluator's limitations become the evaluation's limitations.

The recurring pattern across the module: **the challenge shifts, not disappears.** From building alignment to testing it to measuring it — each step reveals deeper difficulty.

---

## Key Takeaways

1. **Benchmark scores are proxies, not measurements.** MMLU tests recognition on academic knowledge, not "understanding." The proxy gap between what a benchmark measures and what its name implies is wide, and the gap widens under contamination and optimization pressure.

2. **Contamination is visible in the data.** Uneven performance across categories of similar difficulty is forensic evidence. Categories with high public Q&A availability score significantly higher — the model memorized answers, not reasoning. This is structural, not accidental: any public benchmark eventually leaks into training data.

3. **LLM judges have measurable, predictable biases.** Position bias (verdict depends on presentation order) and verbosity bias (longer = better, regardless of substance) are empirically demonstrable. The evaluator's limitations become the evaluation's limitations — the same pattern from Constitutional AI and Red Teaming.

4. **No single evaluation method is sufficient.** Benchmarks are cheap but shallow. Human evaluation is deep but expensive. LLM-as-judge scales but inherits biases. Red teaming finds edge cases but misses everyday quality. Defense-in-depth from Lesson 3 applies to evaluation: combine multiple methods, each covering what the others miss.

5. **Evaluation design requires tradeoff thinking, not benchmark shopping.** Choosing evaluation methods is the same kind of tradeoff reasoning as choosing alignment techniques (Lesson 2). The right strategy depends on your specific use case, constraints, and what you need to measure. "What are the blind spots?" is always the right question.