# 01 - Metrics for LLM Assessment

## Context

Assessment is how you know if your model actually works. Without
rigorous measurement, you are guessing -- and in legal AI, guessing
has consequences.

**CoCounsel context:** For legal AI, "works" means factually accurate,
properly cited, and appropriately uncertain. A model that generates
fluent, well-structured legal text is useless if it fabricates
citations, misstates holdings, or expresses false confidence.
Standard NLP metrics measure surface-level text overlap -- they
tell you whether the generated text *looks like* the reference,
not whether it is *correct*.

This notebook covers:
1. **Perplexity** -- how well the model predicts the next token.
2. **BLEU and ROUGE** -- standard overlap-based metrics.
3. **Where metrics break down** -- concrete examples of misleading scores.
4. **Citation accuracy** -- a legal-specific metric.
5. **Hallucination detection** -- measuring ungrounded claims.

## Perplexity

Perplexity measures how well a language model predicts the next token
in a sequence. It is the exponential of the average cross-entropy loss:

$$\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(x_i \mid x_{<i})\right) = \exp(\text{loss})$$

**Lower perplexity = better language modeling.** A perplexity of 1
means the model perfectly predicts every token. A perplexity of 10
means the model is, on average, as uncertain as choosing uniformly
among 10 equally likely tokens.

Perplexity is useful for:
- Comparing model quality on held-out data (domain adaptation).
- Detecting distribution shift (high perplexity = unfamiliar text).
- Sanity-checking training (perplexity should decrease over training).

Perplexity is **not** useful for:
- Measuring factual accuracy.
- Comparing models of different sizes (larger models always win).
- Judging generation quality (a model can be fluent but wrong).

In [None]:
import json
import math
import re
from pathlib import Path

import numpy as np
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.manual_seed(42)
np.random.seed(42)

In [None]:
# Load a small model for perplexity computation
model_name = "HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model: {model_name}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Load held-out legal text from court opinions
data_path = Path("../../datasets/sample/court_opinions.jsonl")
opinions = []
with open(data_path) as f:
    for line in f:
        opinions.append(json.loads(line))

print(f"Loaded {len(opinions)} court opinions")
print(f"First opinion: {opinions[0]['case_name']}")

In [None]:
def compute_perplexity(text, model, tokenizer, max_length=512):
    """Compute perplexity of a model on a text string.

    Perplexity = exp(average cross-entropy loss).

    Args:
        text: The input text to measure.
        model: A HuggingFace causal language model.
        tokenizer: The corresponding tokenizer.
        max_length: Maximum number of tokens.

    Returns:
        Tuple of (perplexity, loss).
    """
    encodings = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
    )
    input_ids = encodings["input_ids"]

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss.item()

    perplexity = math.exp(loss)
    return perplexity, loss


# Compute perplexity on each court opinion
print("Perplexity on held-out legal text:")
print("=" * 65)
print(f"{'Case':>45}  {'PPL':>8}  {'Loss':>6}")
print("-" * 65)

perplexities = []
for opinion in opinions:
    ppl, loss = compute_perplexity(opinion["text"], model, tokenizer)
    perplexities.append(ppl)
    name = opinion["case_name"][:45]
    print(f"{name:>45}  {ppl:>8.1f}  {loss:>6.3f}")

print("-" * 65)
avg_ppl = np.mean(perplexities)
avg_loss = np.mean([math.log(p) for p in perplexities])
print(f"{'Mean':>45}  {avg_ppl:>8.1f}  {avg_loss:>6.3f}")
print()
print("Lower perplexity means the model finds the text more predictable.")
print("Legal text tends to have moderate perplexity: formal but domain-specific.")

In [None]:
# Compare perplexity across different text types
comparison_texts = {
    "Legal (court opinion)": opinions[0]["text"][:500],
    "Casual English": (
        "Hey, so I was thinking about going to the store later today. "
        "Do you want me to pick up anything? I heard they have a sale "
        "on those chips you like. Also, did you see the game last night? "
        "It was pretty wild. The team really came through in the fourth "
        "quarter. Anyway, let me know if you need anything."
    ),
    "Python code": (
        "def fibonacci(n):\n"
        "    if n <= 1:\n"
        "        return n\n"
        "    a, b = 0, 1\n"
        "    for _ in range(2, n + 1):\n"
        "        a, b = b, a + b\n"
        "    return b\n\n"
        "for i in range(10):\n"
        "    print(f'fib({i}) = {fibonacci(i)}')"
    ),
    "Random tokens": "glorp xyzzy fleem quux bazinga wibble fnord plugh",
}

print("Perplexity across text types:")
print("=" * 50)
print(f"{'Text type':<25}  {'PPL':>8}  {'Loss':>6}")
print("-" * 50)

for label, text in comparison_texts.items():
    ppl, loss = compute_perplexity(text, model, tokenizer)
    print(f"{label:<25}  {ppl:>8.1f}  {loss:>6.3f}")

print()
print("Random tokens have very high perplexity -- the model has no idea")
print("what comes next. Domain-specific text (legal, code) will vary based")
print("on how much of that domain appeared in the training data.")

## BLEU and ROUGE

**BLEU** (Bilingual Understudy) measures the **precision**
of n-gram overlap between generated text and a reference. Originally
designed for machine translation, it asks: "Of the n-grams in the
generated text, how many also appear in the reference?"

**ROUGE** (Recall-Oriented Understudy for Gisting) measures
the **recall** of n-gram overlap. Designed for summarization, it asks:
"Of the n-grams in the reference, how many also appear in the generated
text?"

Key variants:
- **ROUGE-1**: unigram overlap (individual words)
- **ROUGE-2**: bigram overlap (word pairs)
- **ROUGE-L**: longest common subsequence

Both metrics produce scores between 0 and 1, where 1 means perfect
overlap with the reference.

In [None]:
import evaluate

# Load BLEU and ROUGE metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Legal summarization examples: reference summaries and generated summaries
summarization_examples = [
    {
        "reference": (
            "The Seventh Circuit reversed the grant of summary judgment "
            "in an ADA employment discrimination case, finding genuine "
            "issues of material fact regarding whether the plaintiff "
            "could perform essential job functions with a modified "
            "travel schedule as a reasonable accommodation."
        ),
        "generated": (
            "The appellate court reversed summary judgment in the ADA "
            "discrimination case. The court found disputed facts about "
            "whether the employee could perform essential functions with "
            "a reasonable accommodation of a modified travel schedule."
        ),
        "label": "Good summary (accurate, covers key facts)",
    },
    {
        "reference": (
            "The court granted a preliminary injunction against hydraulic "
            "fracturing operations under the Clean Water Act, finding "
            "likelihood of success on the merits and irreparable harm "
            "from potential groundwater contamination."
        ),
        "generated": (
            "The court issued an injunction stopping fracking operations "
            "because the company was discharging pollutants without "
            "proper permits under the Clean Water Act, and contamination "
            "of drinking water supplies constituted irreparable harm."
        ),
        "label": "Good summary (accurate, different wording)",
    },
    {
        "reference": (
            "The DC Circuit denied review of SEC sanctions against an "
            "investment adviser for failing to disclose conflicts of "
            "interest in IPO share allocation and lacking adequate "
            "compliance procedures."
        ),
        "generated": (
            "The appellate court upheld SEC penalties against Westbrook "
            "Capital for prioritizing proprietary trading over client "
            "interests and making misrepresentations in regulatory "
            "filings, finding the sanctions proportionate."
        ),
        "label": "Good summary (accurate, emphasizes different details)",
    },
]

# Compute BLEU and ROUGE for each example
print("BLEU and ROUGE scores on legal summarization:")
print("=" * 70)

for i, example in enumerate(summarization_examples):
    bleu_result = bleu_metric.compute(
        predictions=[example["generated"]],
        references=[[example["reference"]]],
    )
    rouge_result = rouge_metric.compute(
        predictions=[example["generated"]],
        references=[example["reference"]],
    )

    print(f"\nExample {i + 1}: {example['label']}")
    print(f"  BLEU:    {bleu_result['bleu']:.4f}")
    print(f"  ROUGE-1: {rouge_result['rouge1']:.4f}")
    print(f"  ROUGE-2: {rouge_result['rouge2']:.4f}")
    print(f"  ROUGE-L: {rouge_result['rougeL']:.4f}")

print()
print("These scores reflect word overlap with the reference.")
print("Scores below 0.5 are common -- legal text uses varied phrasing.")

## Where Metrics Break Down

This is the critical section. Traditional metrics can be **actively
misleading** for legal AI. A model that scores well on BLEU and ROUGE
may still be dangerous to use.

We demonstrate two failure modes:
1. A summary with **fabricated citations** that scores high on ROUGE.
2. A **factually wrong** answer with good word overlap that scores well on BLEU.

In [None]:
# Failure mode 1: Fabricated citations score high on ROUGE
#
# The generated summary matches the reference structure and most of the
# key terms, but cites a completely fabricated case. ROUGE does not know
# whether citations are real -- it only counts word overlap.

reference_summary = (
    "The Seventh Circuit reversed the grant of summary judgment, "
    "holding that genuine issues of material fact exist regarding "
    "whether the plaintiff could perform essential job functions with "
    "reasonable accommodation. The court applied the de novo standard "
    "of review and cited Anderson v. Liberty Lobby, Inc., 477 U.S. 242 "
    "(1986) for the summary judgment framework."
)

# This summary is WRONG -- it cites a fabricated case -- but structurally
# matches the reference very well.
fabricated_citation_summary = (
    "The Seventh Circuit reversed the grant of summary judgment, "
    "holding that genuine issues of material fact exist regarding "
    "whether the plaintiff could perform essential job functions with "
    "reasonable accommodation. The court applied the de novo standard "
    "of review and cited Richardson v. National Labor Board, 485 U.S. 117 "
    "(1988) for the summary judgment framework."
)

# This summary is CORRECT but uses very different wording.
correct_but_different = (
    "An appeals court overturned the lower court's decision dismissing "
    "an ADA disability discrimination claim. The employer failed to show "
    "that modified travel requirements could not serve as an accommodation. "
    "The panel reviewed the record independently, following the precedent "
    "set in Anderson v. Liberty Lobby."
)

print("FAILURE MODE 1: Fabricated citations score high on ROUGE")
print("=" * 70)

for label, generated in [
    ("Fabricated citation (WRONG)", fabricated_citation_summary),
    ("Correct but different wording", correct_but_different),
]:
    rouge_result = rouge_metric.compute(
        predictions=[generated],
        references=[reference_summary],
    )
    print(f"\n  {label}:")
    print(f"    ROUGE-1: {rouge_result['rouge1']:.4f}")
    print(f"    ROUGE-2: {rouge_result['rouge2']:.4f}")
    print(f"    ROUGE-L: {rouge_result['rougeL']:.4f}")

print()
print("The fabricated-citation summary scores HIGHER because it copies the")
print("reference structure almost verbatim. The correct summary scores LOWER")
print("because it paraphrases. ROUGE rewards copying, not correctness.")

In [None]:
# Failure mode 2: Factually wrong answer with high BLEU
#
# The generated answer reuses many of the same words and phrases as the
# reference, but gets the critical legal conclusion wrong.

reference_answer = (
    "The court granted the preliminary injunction because the state "
    "demonstrated a likelihood of success on the merits of its Clean "
    "Water Act claims and showed that potential contamination of drinking "
    "water constitutes irreparable harm that outweighs economic burden "
    "on the defendant."
)

# WRONG: gets the outcome backwards (denied vs granted)
wrong_outcome = (
    "The court denied the preliminary injunction because the state "
    "failed to demonstrate a likelihood of success on the merits of its "
    "Clean Water Act claims. The court found that potential contamination "
    "of drinking water does not constitute irreparable harm sufficient "
    "to outweigh the economic burden on the defendant."
)

# CORRECT: right outcome, different words
correct_different_words = (
    "The judge issued an emergency order halting fracking operations. "
    "California proved it would likely win its pollution case and that "
    "groundwater poisoning posed an irreversible threat exceeding any "
    "financial harm to Pacific Coast Energy."
)

print("FAILURE MODE 2: Factually wrong answer with high BLEU")
print("=" * 70)

for label, generated in [
    ("Wrong outcome (INCORRECT)", wrong_outcome),
    ("Correct but different wording", correct_different_words),
]:
    bleu_result = bleu_metric.compute(
        predictions=[generated],
        references=[[reference_answer]],
    )
    rouge_result = rouge_metric.compute(
        predictions=[generated],
        references=[reference_answer],
    )
    print(f"\n  {label}:")
    print(f"    BLEU:    {bleu_result['bleu']:.4f}")
    print(f"    ROUGE-1: {rouge_result['rouge1']:.4f}")
    print(f"    ROUGE-L: {rouge_result['rougeL']:.4f}")

print()
print("The factually WRONG answer scores higher on both BLEU and ROUGE.")
print("It reuses the same sentence structure and vocabulary but reverses")
print("the legal outcome. Standard metrics cannot detect this -- they")
print("measure surface overlap, not semantic correctness.")
print()
print("This is why legal AI needs domain-specific metrics.")

## Citation Accuracy

Legal text makes verifiable factual claims in the form of case
citations. Unlike general text, where "correctness" is hard to define
programmatically, we can check whether a cited case actually exists.

The metric:
1. Extract all case citations from the generated text using regex.
2. Check each citation against a known corpus of real cases.
3. Compute: `citation_accuracy = verified / total`.

A score of 1.0 means every citation is real. A score of 0.0 means
every citation is fabricated. This is a **verifiable** metric -- it
measures ground truth, not surface similarity.

In [None]:
# Build a citation corpus from our court opinions dataset.
# In production, this would be a comprehensive legal database.

citation_corpus = set()
for opinion in opinions:
    for cite in opinion.get("citations", []):
        citation_corpus.add(cite)

# Add some well-known cases for a richer corpus
additional_citations = [
    "Marbury v. Madison, 5 U.S. 137 (1803)",
    "Brown v. Board of Education, 347 U.S. 483 (1954)",
    "Miranda v. Arizona, 384 U.S. 436 (1966)",
    "Roe v. Wade, 410 U.S. 113 (1973)",
    "Chevron U.S.A., Inc. v. NRDC, 467 U.S. 837 (1984)",
    "Celotex Corp. v. Catrett, 477 U.S. 317 (1986)",
    "Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993)",
    "Katz v. United States, 389 U.S. 347 (1967)",
    "Terry v. Ohio, 392 U.S. 1 (1968)",
    "Gideon v. Wainwright, 372 U.S. 335 (1963)",
    "Bell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007)",
    "Ashcroft v. Iqbal, 556 U.S. 662 (2009)",
    "McDonnell Douglas Corp. v. Green, 411 U.S. 792 (1973)",
]
citation_corpus.update(additional_citations)

print(f"Citation corpus: {len(citation_corpus)} known cases")
print()
for cite in sorted(citation_corpus)[:10]:
    print(f"  {cite}")
print(f"  ... and {len(citation_corpus) - 10} more")

In [None]:
def extract_citations(text):
    """Extract legal case citations from text using regex.

    Matches standard US case citation formats:
    - Case Name v. Case Name, VOL REPORTER PAGE (COURT YEAR)
    - Case Name v. Case Name, VOL REPORTER PAGE (YEAR)

    Supported reporters: U.S., S.Ct., F.2d, F.3d, F.4th,
    F.Supp., F.Supp.2d, F.Supp.3d, F.App'x.

    Args:
        text: Input text to search for citations.

    Returns:
        List of extracted citation strings.
    """
    # Pattern for standard case citations:
    # Name v. Name, ### Reporter ### (Court/Year)
    pattern = (
        r"([A-Z][A-Za-z.'\-\s]+"
        r"v\."
        r"\s+[A-Z][A-Za-z.'\-\s,]+"
        r"\d+\s+"
        r"(?:U\.S\.|S\.\s*Ct\.|F\.\d+[a-z]*|F\.\s*(?:Supp|App)[.'\s]*\d*[a-z]*)"
        r"\s+\d+"
        r"\s*\([^)]+\))"
    )
    matches = re.findall(pattern, text)
    # Clean up whitespace in each match
    cleaned = [" ".join(m.split()) for m in matches]
    return cleaned


# Test extraction on court opinions
print("Citation extraction tests:")
print("=" * 70)

for opinion in opinions[:3]:
    extracted = extract_citations(opinion["text"])
    print(f"\n  {opinion['case_name']}")
    print(f"  Known:     {opinion['citations']}")
    print(f"  Extracted: {extracted}")

In [None]:
def citation_accuracy(generated_text, corpus):
    """Compute citation accuracy: fraction of citations that are verified.

    Extracts all case citations from the generated text, then checks each
    against the known corpus using fuzzy matching on the case name.

    Args:
        generated_text: The model output text to check.
        corpus: A set of known real citation strings.

    Returns:
        Tuple of (accuracy_score, details_dict).
        accuracy_score is in [0, 1]. Returns 1.0 if no citations found.
    """
    citations = extract_citations(generated_text)

    if not citations:
        return 1.0, {"total": 0, "verified": [], "unverified": []}

    verified = []
    unverified = []

    for cite in citations:
        # Extract the case name portion (everything before the volume number)
        name_match = re.match(
            r"([A-Z][A-Za-z.'\-\s]+v\.\s+[A-Z][A-Za-z.'\-\s]+?),?\s*\d",
            cite,
        )
        if name_match:
            case_name = name_match.group(1).strip().rstrip(",")
            # Check if any citation in the corpus contains this case name
            found = any(case_name in known for known in corpus)
            if found:
                verified.append(cite)
            else:
                unverified.append(cite)
        else:
            unverified.append(cite)

    accuracy = len(verified) / len(citations) if citations else 1.0

    return accuracy, {
        "total": len(citations),
        "verified": verified,
        "unverified": unverified,
    }


# Test on text with real citations
real_citation_text = (
    "The summary judgment standard requires demonstrating that no genuine "
    "dispute of material fact exists. Anderson v. Liberty Lobby, Inc., "
    "477 U.S. 242 (1986). The moving party bears the initial burden. "
    "Celotex Corp. v. Catrett, 477 U.S. 317 (1986). In the Fourth "
    "Amendment context, Katz v. United States, 389 U.S. 347 (1967) "
    "established the reasonable expectation of privacy test."
)

# Test on text with fabricated citations
fabricated_citation_text = (
    "Employment discrimination claims must satisfy the burden-shifting "
    "framework. Williams v. National Employment Board, 523 U.S. 891 "
    "(2001). The employer must then articulate a legitimate reason. "
    "Roberts v. Federal Labor Commission, 498 U.S. 332 (1995). "
    "Failure to do so results in automatic liability per Davidson v. "
    "Interstate Commerce Authority, 512 U.S. 445 (1999)."
)

# Test on text with a mix
mixed_citation_text = (
    "The Fourth Amendment protects against unreasonable searches. "
    "Katz v. United States, 389 U.S. 347 (1967). The exclusionary "
    "rule was applied to the states in Mapp v. Ohio, 367 U.S. 643 "
    "(1961). Additionally, the good-faith exception was established "
    "in Patterson v. State Police Authority, 501 U.S. 223 (1990)."
)

print("Citation Accuracy Metric:")
print("=" * 70)

for label, text in [
    ("All real citations", real_citation_text),
    ("All fabricated citations", fabricated_citation_text),
    ("Mixed (2 real, 1 fabricated)", mixed_citation_text),
]:
    score, details = citation_accuracy(text, citation_corpus)
    print(f"\n  {label}:")
    print(f"    Score: {score:.2f} ({details['total']} citations found)")
    if details["verified"]:
        print(f"    Verified: {details['verified']}")
    if details["unverified"]:
        print(f"    Unverified: {details['unverified']}")

print()
print("Unlike ROUGE, this metric catches fabricated citations directly.")
print("A model that invents plausible-sounding cases scores 0.0, not high.")

## Hallucination Detection

Hallucination in legal AI means generating claims that are not grounded
in the source material. This is distinct from citation fabrication --
a model can hallucinate facts even without citing cases.

Our approach:
1. Extract factual claims from the generated text (sentences that make
   assertions about specific entities, dates, or legal conclusions).
2. Check each claim against the source context using token overlap.
3. Compute: `hallucination_rate = ungrounded / total_claims`.

This is a simple, heuristic approach. Production systems use NLI models
or LLM-based verification for more accurate grounding checks.

In [None]:
def extract_claims(text):
    """Extract factual claims from text.

    A simple heuristic: split into sentences, keep those that contain
    specific entities (proper nouns, dates, numbers, legal terms).

    Args:
        text: Input text.

    Returns:
        List of claim strings.
    """
    # Split into sentences (simple heuristic)
    sentences = re.split(r"(?<=[.!?])\s+", text.strip())

    claims = []
    for sent in sentences:
        sent = sent.strip()
        if not sent:
            continue
        # Keep sentences that contain specific factual indicators:
        # proper nouns, numbers, dates, legal keywords
        has_proper_noun = bool(re.search(r"[A-Z][a-z]+\s+[A-Z]", sent))
        has_number = bool(re.search(r"\d+", sent))
        has_legal_term = bool(
            re.search(
                r"\b(held|ruled|found|granted|denied|reversed|affirmed|court|statute|amendment)\b",
                sent,
                re.IGNORECASE,
            )
        )
        if has_proper_noun or has_number or has_legal_term:
            claims.append(sent)

    return claims


def check_grounding(claim, source, threshold=0.5):
    """Check if a claim is grounded in source text using token overlap.

    Computes the fraction of content words in the claim that appear
    somewhere in the source text. A claim is grounded if the overlap
    exceeds the threshold.

    Args:
        claim: A factual claim string.
        source: The source/context text.
        threshold: Minimum overlap ratio to consider grounded.

    Returns:
        Tuple of (is_grounded, overlap_ratio).
    """
    # Tokenize into lowercase words, removing stopwords
    stopwords = {
        "the", "a", "an", "is", "are", "was", "were", "be", "been",
        "being", "have", "has", "had", "do", "does", "did", "will",
        "would", "could", "should", "may", "might", "shall", "can",
        "to", "of", "in", "for", "on", "with", "at", "by", "from",
        "as", "into", "through", "during", "before", "after", "above",
        "below", "and", "but", "or", "nor", "not", "so", "yet",
        "both", "either", "neither", "each", "every", "all", "any",
        "few", "more", "most", "other", "some", "such", "no",
        "than", "too", "very", "just", "because", "if", "when",
        "where", "how", "what", "which", "who", "whom", "this",
        "that", "these", "those", "it", "its", "he", "she", "they",
        "them", "their", "we", "our", "your", "my",
    }

    def content_words(text):
        words = re.findall(r"\b\w+\b", text.lower())
        return [w for w in words if w not in stopwords and len(w) > 2]

    claim_words = content_words(claim)
    source_words = set(content_words(source))

    if not claim_words:
        return True, 1.0

    overlap = sum(1 for w in claim_words if w in source_words)
    ratio = overlap / len(claim_words)

    return ratio >= threshold, ratio


def hallucination_rate(generated_text, source_context, threshold=0.5):
    """Compute the hallucination rate of generated text.

    Hallucination rate = (ungrounded claims) / (total claims).
    Lower is better: 0.0 means all claims are grounded in the source.

    Args:
        generated_text: The model output to check.
        source_context: The source material the model should draw from.
        threshold: Overlap threshold for grounding check.

    Returns:
        Tuple of (hallucination_rate, details_dict).
    """
    claims = extract_claims(generated_text)

    if not claims:
        return 0.0, {"total": 0, "grounded": [], "ungrounded": []}

    grounded = []
    ungrounded = []

    for claim in claims:
        is_grounded, overlap = check_grounding(claim, source_context, threshold)
        if is_grounded:
            grounded.append((claim, overlap))
        else:
            ungrounded.append((claim, overlap))

    rate = len(ungrounded) / len(claims)

    return rate, {
        "total": len(claims),
        "grounded": grounded,
        "ungrounded": ungrounded,
    }


# Demonstrate with a court opinion as source context
source = opinions[0]["text"]

# Grounded response: draws facts from the source
grounded_response = (
    "The Seventh Circuit reversed the district court's grant of summary "
    "judgment in Henderson v. Meridian Health Systems. Henderson alleged "
    "that Meridian violated the ADA by terminating his employment after "
    "he disclosed his diagnosis of multiple sclerosis. The court found "
    "genuine issues of material fact regarding whether Henderson could "
    "perform essential functions with a modified travel schedule."
)

# Hallucinated response: invents facts not in the source
hallucinated_response = (
    "The Seventh Circuit reversed the district court's grant of summary "
    "judgment in Henderson v. Meridian Health Systems. The court awarded "
    "$2.5 million in compensatory damages to Henderson. The jury found "
    "that Meridian's CEO personally ordered the termination after learning "
    "of the diagnosis. The court also imposed punitive damages of $10 million "
    "and ordered the company to implement a new disability accommodation "
    "program within 90 days."
)

print("Hallucination Detection:")
print("=" * 70)

for label, response in [
    ("Grounded response", grounded_response),
    ("Hallucinated response", hallucinated_response),
]:
    rate, details = hallucination_rate(response, source)
    print(f"\n  {label}:")
    print(f"    Hallucination rate: {rate:.2f} ({details['total']} claims)")
    print(f"    Grounded claims: {len(details['grounded'])}")
    print(f"    Ungrounded claims: {len(details['ungrounded'])}")
    if details["ungrounded"]:
        print("    Examples of ungrounded claims:")
        for claim, overlap in details["ungrounded"][:3]:
            truncated = claim[:80] + "..." if len(claim) > 80 else claim
            print(f"      - [{overlap:.2f}] {truncated}")

print()
print("The grounded response draws from the source text and has low")
print("hallucination rate. The hallucinated response invents specific")
print("facts (dollar amounts, jury findings, timelines) not in the source.")

## Exercises

### Exercise (a): Appropriate Uncertainty Metric

Design a metric for "appropriate uncertainty" -- does the model say
"I'm not sure" when it should?

Consider:
- Create a test set of questions where the correct answer involves
  uncertainty (e.g., "What will the Supreme Court rule on X?", "Is
  this contract enforceable in all states?").
- Define hedging indicators: phrases like "may", "could", "it depends",
  "in some jurisdictions", "consult an attorney".
- Define overconfidence indicators: "always", "never", "certainly",
  "guaranteed", "will definitely".
- Compute: `uncertainty_score = hedging_count / (hedging_count + overconfidence_count)`.

```python
HEDGING_PHRASES = [
    "may", "could", "might", "it depends", "varies by jurisdiction",
    "consult an attorney", "generally", "typically", "in most cases",
]
OVERCONFIDENCE_PHRASES = [
    "always", "never", "certainly", "guaranteed", "will definitely",
    "absolutely", "without exception", "in all cases",
]

def uncertainty_score(text):
    """Compute appropriate uncertainty score."""
    # Your implementation here
    pass
```

### Exercise (b): Cross-Model Metric Comparison

If you have trained models from Modules 06 and 07 (base, SFT, DPO),
generate responses from each model for the same set of legal questions,
then compare:

1. Perplexity on held-out legal text.
2. Citation accuracy on generated responses.
3. Hallucination rate when given source context.
4. ROUGE scores against reference answers.

Create a comparison table:

```python
import pandas as pd

results = {
    "Model": ["Base", "SFT", "DPO"],
    "Perplexity": [...],
    "Citation Accuracy": [...],
    "Hallucination Rate": [...],
    "ROUGE-L": [...],
}
df = pd.DataFrame(results)
print(df.to_string(index=False))
```

Which model improves most on which metric? Does alignment (DPO)
help with citation accuracy? Does SFT reduce hallucination rate?