# 02 - Building an Assessment Harness

## Harness Design

An assessment harness is a systematic way to test model capabilities.
Instead of ad-hoc manual testing, a harness provides:

- **Reproducibility**: same test cases, same scoring, every time.
- **Coverage**: structured test cases that cover different capabilities.
- **Automation**: scoring runs without human intervention.
- **Tracking**: results are stored and compared across model versions.

Components of a harness:
1. **Test cases**: structured inputs with expected outputs.
2. **Automated scoring**: metrics applied to each test case.
3. **Reporting**: aggregated results in tables and charts.

This follows the pattern used by projects like EleutherAI's
lm-evaluation-harness, adapted for legal domain testing.

**CoCounsel context:** For a legal AI product, the harness must test
citation accuracy, factual grounding, and appropriate uncertainty --
not just fluency or generic helpfulness.

In [None]:
import json
import re
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(42)

## Building Test Cases

Each test case has:
- `question`: a legal question the model should answer.
- `context`: relevant source material (from court opinions).
- `ground_truth`: the expected answer or key facts.
- `expected_citations`: list of citations the answer should include.

We create 18 test cases spanning different legal domains: employment
law, environmental law, securities regulation, education law, patent
law, constitutional law, and more.

In [None]:
# Load court opinions for context
data_path = Path("../../datasets/sample/court_opinions.jsonl")
opinions = []
with open(data_path) as f:
    for line in f:
        opinions.append(json.loads(line))

print(f"Loaded {len(opinions)} court opinions for test case context")
for op in opinions:
    print(f"  - {op['case_name']}")

In [None]:
# 18 legal test cases for the harness
test_cases = [
    # --- Cases derived from court opinions dataset ---
    {
        "id": "TC-001",
        "question": "What standard of review applies when an appellate court reviews a grant of summary judgment?",
        "context": opinions[0]["text"],
        "ground_truth": (
            "An appellate court reviews a grant of summary judgment de novo, "
            "construing all facts and drawing all reasonable inferences in "
            "favor of the nonmoving party."
        ),
        "expected_citations": ["Anderson v. Liberty Lobby, Inc., 477 U.S. 242 (1986)"],
    },
    {
        "id": "TC-002",
        "question": "What elements must a plaintiff establish in an ADA employment discrimination case?",
        "context": opinions[0]["text"],
        "ground_truth": (
            "Under the ADA, a plaintiff must show: (1) he is disabled within the "
            "meaning of the Act, (2) he is qualified to perform the essential "
            "functions of the job with or without reasonable accommodation, and "
            "(3) he suffered an adverse employment action because of his disability."
        ),
        "expected_citations": ["Hoffman v. Caterpillar, Inc., 256 F.3d 568 (7th Cir. 2001)"],
    },
    {
        "id": "TC-003",
        "question": "What factors must a court consider when deciding a motion for preliminary injunction?",
        "context": opinions[1]["text"],
        "ground_truth": (
            "A preliminary injunction requires demonstrating: (1) likelihood of "
            "success on the merits, (2) likelihood of irreparable harm absent "
            "relief, (3) balance of equities favoring the movant, and (4) an "
            "injunction serves the public interest."
        ),
        "expected_citations": ["Winter v. Natural Resources Defense Council, Inc., 555 U.S. 7 (2008)"],
    },
    {
        "id": "TC-004",
        "question": "Can potential groundwater contamination constitute irreparable harm in environmental cases?",
        "context": opinions[1]["text"],
        "ground_truth": (
            "Yes. Courts have recognized that potential contamination of "
            "groundwater supplies serving as drinking water sources constitutes "
            "irreparable harm in environmental cases."
        ),
        "expected_citations": ["Amoco Production Co. v. Village of Gambell, 480 U.S. 531 (1987)"],
    },
    {
        "id": "TC-005",
        "question": "What standard governs judicial review of SEC sanctions?",
        "context": opinions[2]["text"],
        "ground_truth": (
            "Review of SEC sanctions is deferential. Courts uphold findings of "
            "fact if supported by substantial evidence and defer to the "
            "Commission's choice of sanction unless unwarranted in law or "
            "without justification in fact."
        ),
        "expected_citations": ["Steadman v. SEC, 603 F.2d 1126 (5th Cir. 1979)"],
    },
    {
        "id": "TC-006",
        "question": "What fiduciary duties do investment advisers owe their clients regarding IPO allocations?",
        "context": opinions[2]["text"],
        "ground_truth": (
            "Investment advisers must disclose material conflicts of interest. "
            "Allocating IPO shares to proprietary accounts before satisfying "
            "client orders, combined with misrepresentations in Form ADV, "
            "demonstrates conduct incompatible with fiduciary duties."
        ),
        "expected_citations": ["SEC v. Capital Gains Research Bureau, Inc., 375 U.S. 180 (1963)"],
    },
    {
        "id": "TC-007",
        "question": "What standard of review applies to IDEA eligibility decisions?",
        "context": opinions[3]["text"],
        "ground_truth": (
            "A reviewing court applies a modified de novo standard, giving "
            "due weight to the determinations of the administrative hearing "
            "officer while independently weighing the evidence. The party "
            "challenging the administrative decision bears the burden of persuasion."
        ),
        "expected_citations": [
            "Board of Education v. Rowley, 458 U.S. 176 (1982)",
            "Schaffer ex rel. Schaffer v. Weast, 546 U.S. 49 (2005)",
        ],
    },
    {
        "id": "TC-008",
        "question": "What must a patent infringement complaint allege to survive a motion to dismiss?",
        "context": opinions[4]["text"],
        "ground_truth": (
            "A plaintiff must allege facts that plausibly establish ownership "
            "of a valid patent and infringement by the defendant. Detailed "
            "claim-by-claim analysis is not required at the pleading stage, but "
            "sufficient factual content must allow inference of infringement."
        ),
        "expected_citations": [
            "Nalco Co. v. Chem-Mod, LLC, 883 F.3d 1337 (Fed. Cir. 2018)",
            "Ashcroft v. Iqbal, 556 U.S. 662 (2009)",
        ],
    },
    # --- General legal knowledge test cases (no specific opinion context) ---
    {
        "id": "TC-009",
        "question": "What is the summary judgment standard under Federal Rule of Civil Procedure 56?",
        "context": (
            "Federal Rule of Civil Procedure 56 provides that a court shall "
            "grant summary judgment if the movant shows that there is no genuine "
            "dispute as to any material fact and the movant is entitled to "
            "judgment as a matter of law. The Supreme Court's trilogy of cases "
            "in 1986 clarified the standard."
        ),
        "ground_truth": (
            "Summary judgment is appropriate when there is no genuine dispute "
            "of material fact and the movant is entitled to judgment as a matter "
            "of law. The moving party bears the initial burden of demonstrating "
            "the absence of a genuine issue."
        ),
        "expected_citations": ["Celotex Corp. v. Catrett, 477 U.S. 317 (1986)"],
    },
    {
        "id": "TC-010",
        "question": "What constitutional amendment protects against unreasonable searches and seizures?",
        "context": (
            "The Bill of Rights contains several amendments protecting individual "
            "liberties against government intrusion. The Fourth Amendment "
            "specifically addresses the right of the people to be secure in "
            "their persons, houses, papers, and effects against unreasonable "
            "searches and seizures."
        ),
        "ground_truth": (
            "The Fourth Amendment protects against unreasonable searches and "
            "seizures. Under Katz v. United States, a search occurs when the "
            "government violates a reasonable expectation of privacy."
        ),
        "expected_citations": ["Katz v. United States, 389 U.S. 347 (1967)"],
    },
    {
        "id": "TC-011",
        "question": "What is the burden-shifting framework for employment discrimination cases?",
        "context": (
            "Employment discrimination claims under Title VII of the Civil Rights "
            "Act of 1964 often rely on circumstantial evidence. The Supreme Court "
            "established a framework for analyzing such claims that allocates "
            "burdens of production and proof between the parties."
        ),
        "ground_truth": (
            "The McDonnell Douglas framework requires the plaintiff to establish "
            "a prima facie case, then shifts the burden to the employer to "
            "articulate a legitimate nondiscriminatory reason, and finally "
            "allows the plaintiff to show pretext."
        ),
        "expected_citations": ["McDonnell Douglas Corp. v. Green, 411 U.S. 792 (1973)"],
    },
    {
        "id": "TC-012",
        "question": "What is the plausibility standard for federal complaints?",
        "context": (
            "Federal pleading standards require that a complaint contain a short "
            "and plain statement of the claim showing the pleader is entitled to "
            "relief. The Supreme Court has interpreted this requirement in recent "
            "landmark decisions that moved away from the old notice pleading standard."
        ),
        "ground_truth": (
            "A complaint must contain factual content that allows the court to "
            "draw a reasonable inference that the defendant is liable. Threadbare "
            "recitals of elements supported by conclusory statements do not suffice."
        ),
        "expected_citations": [
            "Ashcroft v. Iqbal, 556 U.S. 662 (2009)",
            "Bell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007)",
        ],
    },
    {
        "id": "TC-013",
        "question": "When does the statute of limitations begin to run for latent injuries?",
        "context": (
            "Statutes of limitations set deadlines for filing lawsuits. For most "
            "torts, the clock begins at the time of injury. However, some injuries "
            "are not immediately discoverable, creating a tension between the "
            "policy of repose and the right to seek redress."
        ),
        "ground_truth": (
            "Under the discovery rule, the statute of limitations begins to run "
            "when the plaintiff knew or should have known of the injury and its "
            "cause. This tolls the limitations period for latent injuries that "
            "are not immediately apparent."
        ),
        "expected_citations": [],
    },
    {
        "id": "TC-014",
        "question": "What is qualified immunity and how does it protect government officials?",
        "context": (
            "Section 1983 of Title 42 allows individuals to sue state officials "
            "for constitutional violations. However, officials may assert "
            "affirmative defenses based on their official status. The doctrine "
            "of immunity has evolved significantly through Supreme Court decisions."
        ),
        "ground_truth": (
            "Qualified immunity shields government officials from civil liability "
            "unless their conduct violates clearly established statutory or "
            "constitutional rights of which a reasonable person would have known."
        ),
        "expected_citations": ["Harlow v. Fitzgerald, 457 U.S. 800 (1982)"],
    },
    {
        "id": "TC-015",
        "question": "What does the Clean Water Act require regarding discharge of pollutants?",
        "context": opinions[1]["text"],
        "ground_truth": (
            "The Clean Water Act prohibits the discharge of pollutants into "
            "waters of the United States without the required National Pollutant "
            "Discharge Elimination System (NPDES) permits."
        ),
        "expected_citations": [],
    },
    {
        "id": "TC-016",
        "question": "What must a school district do with independent educational evaluations under IDEA?",
        "context": opinions[3]["text"],
        "ground_truth": (
            "Under 34 C.F.R. section 300.502(c), a school district's evaluation "
            "team must consider independent evaluations submitted by parents. "
            "Failure to do so constitutes a procedural violation of the IDEA."
        ),
        "expected_citations": ["Board of Education v. Rowley, 458 U.S. 176 (1982)"],
    },
    {
        "id": "TC-017",
        "question": "Can the SEC revoke an investment adviser's registration for conflicts of interest?",
        "context": opinions[2]["text"],
        "ground_truth": (
            "Yes. The SEC has broad discretion in selecting sanctions to protect "
            "the investing public. Systematic prioritization of proprietary "
            "trading over client interests, combined with misrepresentations, "
            "can justify revocation even without actual client financial loss."
        ),
        "expected_citations": ["Steadman v. SEC, 603 F.2d 1126 (5th Cir. 1979)"],
    },
    {
        "id": "TC-018",
        "question": "What is the difference between essential and marginal job functions under the ADA?",
        "context": opinions[0]["text"],
        "ground_truth": (
            "Essential functions are fundamental duties of a position. Courts "
            "examine whether other employees in the role perform the same "
            "function, the employer's judgment, and the consequences of not "
            "requiring the function. A disputed characterization of a function "
            "as essential can defeat summary judgment."
        ),
        "expected_citations": ["Hoffman v. Caterpillar, Inc., 256 F.3d 568 (7th Cir. 2001)"],
    },
]

print(f"Created {len(test_cases)} legal test cases")
print()
for tc in test_cases:
    n_cites = len(tc["expected_citations"])
    print(f"  {tc['id']}: {tc['question'][:65]}... [{n_cites} citations]")

## Automated Scoring

We build a scoring pipeline that measures each test case on multiple
dimensions. For this notebook, we simulate model outputs rather than
requiring a running model -- the focus is on the harness design, not
the model.

Scoring dimensions:
1. **Citation accuracy**: fraction of citations that are real.
2. **ROUGE-L**: overlap with ground truth answer.
3. **Hallucination rate**: fraction of claims not grounded in context.
4. **Overall**: weighted combination.

In [None]:
# Re-implement the metrics from notebook 01 so this notebook is self-contained.

import evaluate

rouge_metric = evaluate.load("rouge")


# -- Citation corpus --
citation_corpus = set()
for opinion in opinions:
    for cite in opinion.get("citations", []):
        citation_corpus.add(cite)

citation_corpus.update([
    "Marbury v. Madison, 5 U.S. 137 (1803)",
    "Brown v. Board of Education, 347 U.S. 483 (1954)",
    "Miranda v. Arizona, 384 U.S. 436 (1966)",
    "Chevron U.S.A., Inc. v. NRDC, 467 U.S. 837 (1984)",
    "Celotex Corp. v. Catrett, 477 U.S. 317 (1986)",
    "Katz v. United States, 389 U.S. 347 (1967)",
    "Bell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007)",
    "Ashcroft v. Iqbal, 556 U.S. 662 (2009)",
    "McDonnell Douglas Corp. v. Green, 411 U.S. 792 (1973)",
    "Harlow v. Fitzgerald, 457 U.S. 800 (1982)",
])


# -- Citation extraction --
def extract_citations(text):
    """Extract legal case citations from text using regex."""
    pattern = (
        r"([A-Z][A-Za-z.'\-\s]+"
        r"v\."
        r"\s+[A-Z][A-Za-z.'\-\s,]+"
        r"\d+\s+"
        r"(?:U\.S\.|S\.\s*Ct\.|F\.\d+[a-z]*|F\.\s*(?:Supp|App)[.'\s]*\d*[a-z]*)"
        r"\s+\d+"
        r"\s*\([^)]+\))"
    )
    matches = re.findall(pattern, text)
    return [" ".join(m.split()) for m in matches]


def citation_accuracy(generated_text, corpus):
    """Compute citation accuracy: verified / total citations."""
    citations = extract_citations(generated_text)
    if not citations:
        return 1.0, {"total": 0, "verified": [], "unverified": []}

    verified, unverified = [], []
    for cite in citations:
        name_match = re.match(
            r"([A-Z][A-Za-z.'\-\s]+v\.\s+[A-Z][A-Za-z.'\-\s]+?),?\s*\d", cite
        )
        if name_match:
            case_name = name_match.group(1).strip().rstrip(",")
            if any(case_name in known for known in corpus):
                verified.append(cite)
            else:
                unverified.append(cite)
        else:
            unverified.append(cite)

    return len(verified) / len(citations), {
        "total": len(citations), "verified": verified, "unverified": unverified,
    }


# -- Hallucination detection --
def extract_claims(text):
    """Extract factual claims from text (sentences with entities/numbers/legal terms)."""
    sentences = re.split(r"(?<=[.!?])\s+", text.strip())
    claims = []
    for sent in sentences:
        sent = sent.strip()
        if not sent:
            continue
        has_indicator = (
            bool(re.search(r"[A-Z][a-z]+\s+[A-Z]", sent))
            or bool(re.search(r"\d+", sent))
            or bool(re.search(
                r"\b(held|ruled|found|granted|denied|reversed|affirmed|court|statute|amendment)\b",
                sent, re.IGNORECASE,
            ))
        )
        if has_indicator:
            claims.append(sent)
    return claims


def hallucination_rate(generated_text, source_context, threshold=0.5):
    """Compute hallucination rate: ungrounded claims / total claims."""
    stopwords = {
        "the", "a", "an", "is", "are", "was", "were", "be", "been",
        "being", "have", "has", "had", "do", "does", "did", "will",
        "would", "could", "should", "may", "might", "shall", "can",
        "to", "of", "in", "for", "on", "with", "at", "by", "from",
        "as", "into", "through", "during", "before", "after",
        "and", "but", "or", "nor", "not", "so", "yet",
        "than", "too", "very", "just", "because", "if", "when",
        "where", "how", "what", "which", "who", "whom", "this",
        "that", "these", "those", "it", "its",
    }

    def content_words(text):
        words = re.findall(r"\b\w+\b", text.lower())
        return [w for w in words if w not in stopwords and len(w) > 2]

    claims = extract_claims(generated_text)
    if not claims:
        return 0.0, {"total": 0, "grounded": [], "ungrounded": []}

    source_words = set(content_words(source_context))
    grounded, ungrounded = [], []

    for claim in claims:
        claim_words = content_words(claim)
        if not claim_words:
            grounded.append(claim)
            continue
        overlap = sum(1 for w in claim_words if w in source_words) / len(claim_words)
        if overlap >= threshold:
            grounded.append(claim)
        else:
            ungrounded.append(claim)

    return len(ungrounded) / len(claims), {
        "total": len(claims), "grounded": grounded, "ungrounded": ungrounded,
    }


print("Metrics loaded: citation_accuracy, hallucination_rate, ROUGE")
print(f"Citation corpus: {len(citation_corpus)} known cases")

In [None]:
# Simulated model outputs for each test case.
# In production, these would come from model.generate().
# We simulate a mix of good and bad outputs to demonstrate the harness.

simulated_outputs = {
    "TC-001": (
        "A grant of summary judgment is reviewed de novo on appeal. The "
        "appellate court construes all facts and draws all reasonable "
        "inferences in favor of the nonmoving party. Anderson v. Liberty "
        "Lobby, Inc., 477 U.S. 242 (1986)."
    ),
    "TC-002": (
        "Under the ADA, the plaintiff must show three elements: (1) disability "
        "within the meaning of the Act, (2) qualification to perform essential "
        "job functions with or without reasonable accommodation, and (3) adverse "
        "employment action because of the disability. Hoffman v. Caterpillar, "
        "Inc., 256 F.3d 568 (7th Cir. 2001)."
    ),
    "TC-003": (
        "A preliminary injunction requires showing: (1) likelihood of success, "
        "(2) irreparable harm, (3) balance of equities, and (4) public interest. "
        "Winter v. Natural Resources Defense Council, Inc., 555 U.S. 7 (2008)."
    ),
    "TC-004": (
        "Yes, courts have recognized groundwater contamination as irreparable "
        "harm. Amoco Production Co. v. Village of Gambell, 480 U.S. 531 (1987). "
        "The potential for drinking water contamination is sufficient."
    ),
    # TC-005: Good answer but fabricated citation
    "TC-005": (
        "SEC sanctions are reviewed under a deferential standard. Findings of "
        "fact must be supported by substantial evidence. The Commission has "
        "broad discretion in selecting sanctions. Morrison v. Securities "
        "Regulatory Board, 588 U.S. 201 (2015)."
    ),
    "TC-006": (
        "Investment advisers owe fiduciary duties to clients. Allocating IPO "
        "shares to proprietary accounts before satisfying client orders violates "
        "these duties. SEC v. Capital Gains Research Bureau, Inc., 375 U.S. 180 "
        "(1963)."
    ),
    # TC-007: Partially correct, missing one citation
    "TC-007": (
        "IDEA eligibility decisions are reviewed under a modified de novo standard. "
        "The court gives due weight to administrative determinations while "
        "independently reviewing the evidence. Board of Education v. Rowley, "
        "458 U.S. 176 (1982)."
    ),
    "TC-008": (
        "A patent infringement complaint must allege facts plausibly establishing "
        "ownership of a valid patent and infringement. Detailed claim-by-claim "
        "analysis is not required at the pleading stage. Ashcroft v. Iqbal, "
        "556 U.S. 662 (2009)."
    ),
    # TC-009: Hallucinated facts mixed with correct info
    "TC-009": (
        "Summary judgment requires no genuine dispute of material fact. The "
        "Supreme Court's 1986 trilogy established the modern standard. The "
        "movant must file within 30 days of discovery closing. Celotex Corp. "
        "v. Catrett, 477 U.S. 317 (1986). Courts grant summary judgment in "
        "approximately 40% of federal cases."
    ),
    "TC-010": (
        "The Fourth Amendment protects against unreasonable searches and seizures. "
        "Katz v. United States, 389 U.S. 347 (1967) established the reasonable "
        "expectation of privacy test."
    ),
    # TC-011: Overconfident and partially wrong
    "TC-011": (
        "The McDonnell Douglas framework always results in employer liability if "
        "the plaintiff shows any evidence of discrimination. The employer can never "
        "overcome the presumption once established. McDonnell Douglas Corp. v. "
        "Green, 411 U.S. 792 (1973)."
    ),
    "TC-012": (
        "Under Iqbal and Twombly, a complaint must contain factual content "
        "allowing the court to draw a reasonable inference of liability. "
        "Conclusory statements do not suffice. Ashcroft v. Iqbal, 556 U.S. 662 "
        "(2009). Bell Atlantic Corp. v. Twombly, 550 U.S. 544 (2007)."
    ),
    "TC-013": (
        "Under the discovery rule, the statute of limitations begins when the "
        "plaintiff knew or should have known of the injury. This tolls the "
        "limitations period for latent injuries not immediately apparent."
    ),
    # TC-014: All fabricated citations
    "TC-014": (
        "Qualified immunity protects officials from liability unless they "
        "violated clearly established rights. Thompson v. Federal Bureau of "
        "Investigations, 534 U.S. 289 (2003). The standard requires a "
        "reasonable person to have known the conduct was unlawful. Richardson "
        "v. State Law Enforcement Agency, 521 U.S. 445 (1998)."
    ),
    "TC-015": (
        "The Clean Water Act prohibits discharge of pollutants into waters "
        "of the United States without NPDES permits. Violations can result "
        "in injunctive relief and civil penalties."
    ),
    "TC-016": (
        "Under IDEA regulations, school districts must consider independent "
        "evaluations submitted by parents. Failure to do so is a procedural "
        "violation. Board of Education v. Rowley, 458 U.S. 176 (1982)."
    ),
    # TC-017: Good content, one real and one fabricated citation
    "TC-017": (
        "The SEC can revoke registration even without actual client losses. "
        "Systematic conflicts of interest and misrepresentations justify "
        "severe sanctions. Steadman v. SEC, 603 F.2d 1126 (5th Cir. 1979). "
        "See also Porter v. Investment Advisory Commission, 478 U.S. 331 (1988)."
    ),
    "TC-018": (
        "Essential functions are fundamental job duties. Courts examine whether "
        "other employees perform the function and the employer's judgment. A "
        "disputed characterization can create genuine issues of material fact "
        "defeating summary judgment."
    ),
}

print(f"Simulated outputs for {len(simulated_outputs)} test cases")

In [None]:
# Score every test case on all dimensions

results = []

for tc in test_cases:
    tc_id = tc["id"]
    output = simulated_outputs[tc_id]

    # 1. Citation accuracy
    cite_score, cite_details = citation_accuracy(output, citation_corpus)

    # 2. ROUGE-L against ground truth
    rouge_result = rouge_metric.compute(
        predictions=[output],
        references=[tc["ground_truth"]],
    )
    rouge_l = rouge_result["rougeL"]

    # 3. Hallucination rate against context
    hall_rate, hall_details = hallucination_rate(output, tc["context"])

    # 4. Overall score (weighted average)
    # Citation accuracy and grounding weighted higher for legal domain
    overall = (
        0.35 * cite_score
        + 0.25 * rouge_l
        + 0.40 * (1.0 - hall_rate)  # invert: lower hallucination = better
    )

    results.append({
        "id": tc_id,
        "question": tc["question"][:50] + "...",
        "citation_acc": cite_score,
        "rouge_l": rouge_l,
        "hallucination": hall_rate,
        "overall": overall,
        "n_citations": cite_details["total"],
        "n_unverified": len(cite_details["unverified"]),
    })

df = pd.DataFrame(results)

print("Assessment Results:")
print("=" * 95)
print(
    df[["id", "citation_acc", "rouge_l", "hallucination", "overall"]]
    .to_string(index=False, float_format="{:.3f}".format)
)
print("=" * 95)
print()
print("Summary statistics:")
for col in ["citation_acc", "rouge_l", "hallucination", "overall"]:
    print(f"  {col:>15}: mean={df[col].mean():.3f}, min={df[col].min():.3f}, max={df[col].max():.3f}")

In [None]:
# Flag test cases that fail quality thresholds

THRESHOLDS = {
    "citation_acc": 0.8,   # At least 80% of citations must be verified
    "hallucination": 0.3,  # At most 30% hallucination rate
    "overall": 0.6,        # At least 0.6 overall score
}

print("Flagged test cases (below quality thresholds):")
print("=" * 80)

flagged = []
for _, row in df.iterrows():
    reasons = []
    if row["citation_acc"] < THRESHOLDS["citation_acc"] and row["n_citations"] > 0:
        reasons.append(f"citation_acc={row['citation_acc']:.2f} < {THRESHOLDS['citation_acc']}")
    if row["hallucination"] > THRESHOLDS["hallucination"]:
        reasons.append(f"hallucination={row['hallucination']:.2f} > {THRESHOLDS['hallucination']}")
    if row["overall"] < THRESHOLDS["overall"]:
        reasons.append(f"overall={row['overall']:.2f} < {THRESHOLDS['overall']}")

    if reasons:
        flagged.append(row["id"])
        print(f"\n  {row['id']}: {row['question']}")
        for reason in reasons:
            print(f"    - {reason}")

print(f"\n{len(flagged)}/{len(df)} test cases flagged")
print(f"Pass rate: {(len(df) - len(flagged)) / len(df):.0%}")

## LLM-as-Judge

Automated metrics like ROUGE and citation accuracy capture specific
dimensions of quality. But some aspects -- helpfulness, coherence,
tone -- are hard to measure with rules or word overlap.

**LLM-as-judge** uses a larger, more capable model to score a smaller
model's outputs. The judge model receives the question, the response,
and a rubric, then produces structured ratings.

Advantages:
- Captures subjective quality dimensions.
- Scales better than human annotation.
- Can be customized with domain-specific rubrics.

Limitations:
- Judge bias (models prefer their own style).
- Cost (requires API calls for each judgment).
- Not a substitute for ground-truth metrics like citation accuracy.

In [None]:
# Judge prompt template for legal response quality

JUDGE_PROMPT = """Rate the following legal response on:
1. Accuracy (1-5): Are the legal citations and facts correct?
2. Helpfulness (1-5): Does it answer the question?
3. Citation quality (1-5): Are sources properly cited?

Question: {question}
Response: {response}

Provide ratings as JSON: {{"accuracy": N, "helpfulness": N, "citation_quality": N}}"""


def format_judge_prompt(question, response):
    """Format the judge prompt with a specific question and response."""
    return JUDGE_PROMPT.format(question=question, response=response)


def call_llm_judge(question, response, api_key=None):
    """Call an LLM API to judge a response.

    Requires an API key for the LLM provider (e.g., OpenAI, Anthropic).
    Returns parsed JSON ratings.

    NOTE: This function requires an API key. For offline testing,
    use mock_judge() below.
    """
    prompt = format_judge_prompt(question, response)

    if api_key is None:
        print("No API key provided. Use mock_judge() for offline testing.")
        return None

    # Example with OpenAI (uncomment and configure as needed):
    # import openai
    # client = openai.OpenAI(api_key=api_key)
    # completion = client.chat.completions.create(
    #     model="gpt-4",
    #     messages=[{"role": "user", "content": prompt}],
    #     temperature=0.0,
    # )
    # return json.loads(completion.choices[0].message.content)

    return None


def mock_judge(question, response, citation_corpus=citation_corpus):
    """A deterministic mock judge for offline testing.

    Scores based on heuristics:
    - Accuracy: based on citation verification.
    - Helpfulness: based on response length and question-word overlap.
    - Citation quality: based on number of properly formatted citations.
    """
    # Accuracy: citation-based
    cite_score, cite_details = citation_accuracy(response, citation_corpus)
    if cite_details["total"] == 0:
        accuracy = 3  # neutral if no citations
    else:
        accuracy = max(1, min(5, round(cite_score * 5)))

    # Helpfulness: does the response address the question?
    q_words = set(re.findall(r"\b\w+\b", question.lower()))
    r_words = set(re.findall(r"\b\w+\b", response.lower()))
    overlap = len(q_words & r_words) / max(len(q_words), 1)
    length_factor = min(1.0, len(response) / 200)  # reward adequate length
    helpfulness = max(1, min(5, round((overlap * 0.5 + length_factor * 0.5) * 5)))

    # Citation quality: format and count
    citations = extract_citations(response)
    if len(citations) == 0:
        citation_quality = 2
    elif len(citations) >= 2:
        citation_quality = min(5, 3 + len(citations))
    else:
        citation_quality = 3

    return {
        "accuracy": accuracy,
        "helpfulness": helpfulness,
        "citation_quality": citation_quality,
    }


# Demonstrate the judge on a few test cases
print("LLM-as-Judge (mock) results:")
print("=" * 70)
print(f"{'ID':>7}  {'Accuracy':>8}  {'Helpful':>8}  {'Citations':>9}  {'Avg':>6}")
print("-" * 70)

judge_results = []
for tc in test_cases:
    output = simulated_outputs[tc["id"]]
    scores = mock_judge(tc["question"], output)
    avg = np.mean(list(scores.values()))
    judge_results.append({"id": tc["id"], **scores, "avg": avg})
    print(
        f"{tc['id']:>7}  {scores['accuracy']:>8}  "
        f"{scores['helpfulness']:>8}  {scores['citation_quality']:>9}  {avg:>6.1f}"
    )

print()
print("Note: The mock judge uses heuristics. A real LLM judge would")
print("provide more nuanced assessments of response quality.")
print()

# Show what the prompt looks like for one example
example_prompt = format_judge_prompt(
    test_cases[0]["question"], simulated_outputs["TC-001"]
)
print("Example judge prompt:")
print("-" * 70)
print(example_prompt)

## Benchmark Gaming

**Goodhart's Law**: "When a measure becomes a target, it ceases to be
a good measure."

This applies directly to LLM benchmarks. Common gaming strategies:

### 1. Training on Benchmark Data
If a model is trained (or fine-tuned) on the exact questions from a
benchmark, it memorizes the answers rather than learning the skill.
Scores go up, but real-world performance does not.

### 2. Prompt Engineering for Benchmarks
Models can be optimized for the specific format of benchmark questions
(e.g., multiple choice with options A-D) without improving general
capability. A model that scores well on MMLU-style questions may
still fail when asked the same question in free-form.

### 3. Leaderboard Contamination
Public benchmarks inevitably end up in training data. As web-scraped
datasets grow, the chance that benchmark questions appeared in training
increases. This inflates scores across the board.

### Why This Matters for Product Teams

For a legal AI product like CoCounsel, the implications are:

- **Internal test sets must stay private.** If your assessment questions
  are public, models will eventually train on them.
- **Rotate test sets periodically.** Even private sets lose value if
  the same questions are used for too many iterations.
- **Use held-out test cases.** Never optimize directly against your
  assessment set -- use a separate validation set for model selection.
- **Combine automated and human review.** No single metric captures
  everything. Human lawyers reviewing model outputs remain essential.
- **Measure what matters in production.** Citation accuracy on real
  user queries is more informative than any benchmark score.

## Results Visualization

Good assessment reporting makes results actionable. We create:
1. A radar chart showing average scores across dimensions.
2. A per-question breakdown.
3. A comparison table summarizing pass/fail status.

In [None]:
# Radar chart of average metric scores

categories = ["Citation\nAccuracy", "ROUGE-L", "Grounding\n(1 - Halluc.)", "Overall"]
values = [
    df["citation_acc"].mean(),
    df["rouge_l"].mean(),
    1.0 - df["hallucination"].mean(),
    df["overall"].mean(),
]

# Close the radar chart
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
values_closed = values + [values[0]]
angles_closed = angles + [angles[0]]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

ax.fill(angles_closed, values_closed, color="steelblue", alpha=0.25)
ax.plot(angles_closed, values_closed, color="steelblue", linewidth=2, marker="o", markersize=8)

ax.set_xticks(angles)
ax.set_xticklabels(categories, size=12)
ax.set_ylim(0, 1)
ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_yticklabels(["0.2", "0.4", "0.6", "0.8", "1.0"], size=9, color="gray")
ax.set_title("Model Quality Radar Chart", size=14, fontweight="bold", pad=20)

# Annotate values
for angle, value, label in zip(angles, values, categories):
    ax.annotate(
        f"{value:.2f}",
        xy=(angle, value),
        xytext=(5, 10),
        textcoords="offset points",
        fontsize=10,
        fontweight="bold",
        color="darkblue",
    )

plt.tight_layout()
plt.show()

print("The radar chart shows average scores across all test cases.")
print("Weak dimensions indicate areas for model improvement.")

In [None]:
# Per-question breakdown: horizontal bar chart

fig, axes = plt.subplots(1, 3, figsize=(18, 8))

metrics_to_plot = [
    ("citation_acc", "Citation Accuracy", "steelblue"),
    ("rouge_l", "ROUGE-L", "#2ecc71"),
    ("hallucination", "Hallucination Rate", "#e74c3c"),
]

for ax, (col, title, color) in zip(axes, metrics_to_plot):
    y_pos = range(len(df))
    ax.barh(y_pos, df[col], color=color, alpha=0.7)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(df["id"], fontsize=9)
    ax.set_xlabel(title)
    ax.set_title(title)
    ax.set_xlim(0, 1)
    ax.invert_yaxis()

    # Add threshold line
    if col == "hallucination":
        ax.axvline(x=0.3, color="black", linestyle="--", alpha=0.5, label="Threshold")
    elif col == "citation_acc":
        ax.axvline(x=0.8, color="black", linestyle="--", alpha=0.5, label="Threshold")

    ax.legend(loc="lower right", fontsize=9)
    ax.grid(axis="x", alpha=0.3)

plt.suptitle("Per-Question Score Breakdown", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

print("Each row is one test case. Dashed lines show quality thresholds.")
print("For hallucination rate, LOWER is better (below the threshold line).")
print("For citation accuracy, HIGHER is better (above the threshold line).")

In [None]:
# Final summary table combining automated metrics and judge scores

judge_df = pd.DataFrame(judge_results)
combined = df.merge(judge_df[["id", "accuracy", "helpfulness", "citation_quality"]], on="id")

# Add pass/fail column
def assess_status(row):
    if row["citation_acc"] < 0.8 and row["n_citations"] > 0:
        return "FAIL"
    if row["hallucination"] > 0.3:
        return "WARN"
    if row["overall"] < 0.6:
        return "WARN"
    return "PASS"

combined["status"] = combined.apply(assess_status, axis=1)

print("Combined Results: Automated Metrics + Judge Scores")
print("=" * 100)
display_cols = [
    "id", "citation_acc", "rouge_l", "hallucination",
    "accuracy", "helpfulness", "citation_quality", "status",
]
print(
    combined[display_cols].to_string(
        index=False,
        float_format="{:.2f}".format,
    )
)
print("=" * 100)

status_counts = combined["status"].value_counts()
print(f"\nStatus breakdown:")
for status in ["PASS", "WARN", "FAIL"]:
    count = status_counts.get(status, 0)
    print(f"  {status}: {count}/{len(combined)}")

print(f"\nOverall pass rate: {status_counts.get('PASS', 0) / len(combined):.0%}")

## Exercises

### Exercise (a): Add a Conciseness Dimension

Add a new scoring dimension: "conciseness". A good legal response
is thorough but not unnecessarily verbose.

1. Modify the judge prompt to include a conciseness rating (1-5).
2. Implement a heuristic conciseness metric: penalize responses that
   are more than 2x the length of the ground truth.
3. Add it to the scoring pipeline and results table.

```python
JUDGE_PROMPT_V2 = """Rate the following legal response on:
1. Accuracy (1-5): Are the legal citations and facts correct?
2. Helpfulness (1-5): Does it answer the question?
3. Citation quality (1-5): Are sources properly cited?
4. Conciseness (1-5): Is the response appropriately concise?

Question: {question}
Response: {response}

Provide ratings as JSON:
{"accuracy": N, "helpfulness": N, "citation_quality": N, "conciseness": N}"""

def conciseness_score(response, ground_truth):
    """Score conciseness: penalize verbosity beyond 2x ground truth length."""
    ratio = len(response.split()) / max(len(ground_truth.split()), 1)
    if ratio <= 1.5:
        return 1.0
    elif ratio <= 2.0:
        return 0.8
    elif ratio <= 3.0:
        return 0.5
    else:
        return 0.2
```

### Exercise (b): Preventing Benchmark Contamination

Discuss: If the test cases in this notebook were made public (e.g.,
published in a paper or open-source repository), how would you
prevent benchmark contamination? Consider:

1. **Detection**: How would you check if a model has seen these
   test cases during training? (Hint: canary strings, perplexity
   analysis on exact test case text.)

2. **Prevention**: What operational practices would you implement?
   (Hint: private test sets, periodic rotation, held-out splits.)

3. **Mitigation**: If contamination is suspected, how would you
   adjust scores? (Hint: compare performance on contaminated vs.
   fresh test cases, decontamination benchmarks.)

4. **Design**: How would you design test cases that are harder to
   contaminate? (Hint: procedurally generated, parameterized
   questions, adversarial variations.)