# Building an Agentic Evaluation System

The `impact-engine-evaluate` package automates evidence assessment as the bridge between the MEASURE and ALLOCATE stages of the decision pipeline. This lecture develops both the principles and the engineering patterns that make automated evidence assessment trustworthy. The principles — failure modes, defensibility pillars, escalation, and the Assess vs. Improve discipline — apply to any agentic evaluation system. The design patterns — registry dispatch, prompt engineering as software, layered specialization, and structured output — show how these principles are instantiated in production code.

In Part I we develop the principles for trustworthy automated assessment and then examine the design patterns that implement them. In Part II we read the source code of the `impact-engine-evaluate` package to see each pattern in practice.

---

## Part I: Principles and Design Patterns

Agentic systems use LLMs as reasoning components within a structured software pipeline — not open-ended chat, but constrained evaluation with typed inputs and outputs. We develop the material in two layers. The first establishes the principles that make automated assessment trustworthy — what can go wrong and what must hold. The second examines the software design patterns that implement those principles in the `impact-engine-evaluate` package.

## 1. Principles for Trustworthy Automated Assessment

The diagnostic framework in Lecture 1 applies when a human analyst reviews a study. When we automate this review — using an LLM to assess measurement artifacts and assign a confidence score — four additional failure modes emerge that must be addressed by design.

### Four Failure Modes of Automated Confidence Scoring

| Failure Mode | What Goes Wrong | Example |
|---|---|---|
| **Ungroundedness** | The score is not traceable to any observable artifact | The system produces "confidence: 0.73" without citing any diagnostic |
| **Incorrectness** | Artifacts are present but misread | An unbalanced covariate table is narrated as supportive |
| **Opacity** | No audit trail — the score cannot be challenged or inspected | "Why 0.73 and not 0.85?" yields a plausible but unfalsifiable answer |
| **Instability** | Same evidence described slightly differently yields a different score | Confidence becomes a function of prompt phrasing, not measurement quality |

A confidence score that cannot be defended is worse than no confidence score at all. In enterprise settings where decisions are audited and revisited, these failure modes are fatal.

### Four Pillars of Defensible Confidence

Each pillar directly addresses one failure mode:

| Pillar | Principle | Failure Mode Addressed | Mechanism |
|---|---|---|---|
| **Groundedness** | Every confidence claim traces to an observable statistical artifact | Ungroundedness | Common support checks, assumption tests, robustness diagnostics |
| **Correctness** | The system interprets those artifacts accurately | Incorrectness | Evals on synthetic artifacts with known ground truth |
| **Traceability** | Full audit trail from score to source data | Opacity | Per-dimension scores linked to specific diagnostics |
| **Reproducibility** | Same pipeline + same data = same assessment | Instability | Fixed prompts, structured schemas, version-pinned backends |

The pillars have a natural dependency order. Groundedness is the precondition: without observable artifacts, there is nothing to be correct about. Correctness builds on groundedness: the system must read the evidence accurately. Traceability makes correctness inspectable: when interpretations are wrong, the audit trail reveals where. Reproducibility ensures all three hold across runs, not just on a single evaluation.

### The LLM as Narrator, Not Oracle

The key architectural implication of these pillars is a precise division of labor. The LLM's role is to *contextualize* diagnostics that the measurement engine already produces — it does not generate confidence from its own internal probabilities or invent evidence. It is a narrator reading a score sheet, not an oracle pronouncing judgment.

This bounds the LLM's contribution to interpretation within a constrained evidence set, which is what makes the output auditable. The correctness of that interpretation is then the empirical question that the evaluation framework in Lecture 3 assesses. The sections that follow examine how these four pillars are instantiated: first through the escalation patterns and the Assess vs. Improve discipline, then through the software design patterns — registry dispatch, prompt engineering, layered specialization, and structured output — that enforce the pillars in the `impact-engine-evaluate` codebase.

## 2. Evaluation Escalation: Judge, Jury, Reviewer, Debate

**The problem**: A single-pass LLM evaluation may carry systematic biases that are invisible without a second perspective. High-stakes decisions require stronger guarantees than a base single-pass review provides. But adding complexity indiscriminately raises cost without commensurate benefit.

**The pattern**: Four evaluation configurations form a complexity ladder. The decision to move up the ladder is always driven by evidence from the evaluation suite — not by default.

| Pattern | Structure | What It Adds | When to Use |
|---|---|---|---|
| **Judge** | One LLM, one pass | Baseline | Default for all evaluations |
| **Jury** | Multiple LLMs, parallel | Robustness through independence | Backend sensitivity is high on specific diagnostic types |
| **Reviewer** | Two passes, sequential | Depth through self-correction | Subtle flaws are systematically missed |
| **Debate** | Two LLMs, adversarial | Rigor through opposition | High-stakes estimates that need stress-testing |

**Judge** is the right default. It is simple, fast, and fully covered by the four pillars developed above. Its limitation is that a single LLM may carry systematic biases — consistently lenient on robustness, consistently harsh on data quality — that are invisible without a second perspective.

**Jury** runs multiple LLMs independently on the same artifact with the same prompt (the Panel of LLMs pattern). Agreement across backends reduces intra-model bias. Importantly, *disagreement* is a direct signal of rubric under-specification: the same diagnostic described the same way should not produce divergent scores. Promote to Jury when internal validity tests reveal high backend sensitivity.

**Reviewer** adds a sequential critique pass: a second LLM evaluates whether the first pass's justification adequately addresses the diagnostic evidence and challenges gaps. The revised output is more thorough. Promote to Reviewer when external validity tests show the system misses subtle diagnostic features.

**Debate** places two LLMs in opposing positions on the same artifact — one argues for high confidence, the other against — and a structured resolution produces the final score. This forces explicit engagement with the strongest counterargument. Reserve for high-stakes evaluations where a wrong score drives large allocation decisions.

### Pattern Selection

The patterns form a ladder; you climb it only when the eval suite tells you to:

| Signal from Evaluation | Pattern Response |
|---|---|
| Scores are stable and correct | Stay at **Judge** |
| Backend sensitivity is high | Move to **Jury** |
| Subtle flaws are missed | Add **Reviewer** |
| High-stakes edge cases need stress-testing | Deploy **Debate** selectively |

## 3. Assess vs. Improve

**The problem**: When an evaluation system produces wrong answers, the instinct is to fix the prompts against the artifacts that revealed the problem. This overfits the fix to specific cases while potentially introducing new failures elsewhere — the same mistake made when tuning a model against its test set.

**The pattern**: Measuring how the system performs (Assess mode) and acting on what you learn (Improve mode) must be strictly separated.

| | Assess | Improve |
|---|---|---|
| **Purpose** | Measure current performance | Act on identified failure patterns |
| **Activity** | Run internal and external validity test suites; produce a scorecard | Modify prompts, refine rubrics, tighten output schemas |
| **Output** | Eval matrix: artifacts × configurations, pass/fail per cell | Updated prompt templates, rubrics, or schemas |
| **Constraint** | **Read-only.** No changes to the system under test. | Validate fixes on **held-out** artifacts — never the ones that revealed the problem. |

### Internal vs. External Validity of the Evaluation System

The Assess mode applies the same internal/external validity distinction introduced in Lecture 1 — now to the automated system itself.

**Internal validity** tests whether the system behaves coherently under variation in its own components, without requiring ground truth:
- *Run-to-run stability*: same artifact + same config → same score
- *Prompt sensitivity*: semantically equivalent prompts → consistent scores
- *Backend sensitivity*: same artifact across different LLM backends → low divergence
- *Score distribution*: does the system use the full [0, 1] range, or cluster around a narrow band?

**External validity** tests whether the system gets the right answer, using synthetic artifacts with known properties:
- *Known-flaw detection*: deliberately flawed diagnostics (poor balance, high attrition) must be flagged
- *Known-clean scoring*: a well-powered RCT with perfect diagnostics must score highly without invented concerns
- *Severity calibration*: scores must decrease monotonically as flaw severity increases

Internal validity is the precondition for external validity: a system that gives different answers on every run cannot be meaningfully tested against ground truth.

### Failure Pattern → Intervention

When Assess mode reveals a pattern, Improve mode responds with a targeted fix:

| Failure Pattern | Root Cause | Intervention |
|---|---|---|
| All backends misread a diagnostic type | Rubric under-specified for that artifact | Add explicit scoring criteria to the dimension prompt |
| One backend consistently misses an issue | Model lacks required capability | Simplify prompt or add backend-specific variant |
| Scores shift across equivalent prompts | Evaluation task not well-defined | Tighten rubric to leave less interpretive room |

Every intervention is validated on held-out artifacts — artifacts that were not part of the diagnosis. The Assess–Improve cycle governs how the evaluation framework matures over time without access to downstream outcome data.

## 4. Registry + Dispatch

**The problem**: A system needs to route different inputs to different handlers — but hardcoding that routing creates fragile, hard-to-extend code.

**The pattern**: Each handler registers itself with a central registry, keyed by an identifier. At runtime, the system reads the identifier from the input (e.g., from a configuration file or manifest) and dispatches to the correct handler automatically.

```python
@MethodReviewerRegistry.register("experiment")
class ExperimentReviewer(MethodReviewer):
    confidence_range = (0.85, 1.0)
    ...
```

The registry pattern separates *what handlers exist* from *how they are selected*. Adding support for a new methodology means implementing a new class and registering it — the dispatch logic remains unchanged. The same pattern works for LLM backends: a `BackendRegistry` routes to Anthropic, OpenAI, or LiteLLM based on configuration.

## 5. Prompt Engineering as Software

**The problem**: Prompts written inline as strings become unmaintainable — they mix concerns, lack versioning, and cannot be reviewed like code.

**The pattern**: Prompts are versioned artifacts stored in structured files (YAML, JSON) with explicit metadata. Templates use a dedicated engine (e.g., Jinja2) to inject context at runtime, separating *what to evaluate* from *how to evaluate*.

```yaml
name: experiment_review
version: "1.0"
dimensions:
  - randomization_integrity
  - statistical_inference
  ...

system: |
  You are a methodological reviewer...
  {{ knowledge_context }}

user: |
  Review the following artifact:
  {{ artifact }}
```

**Knowledge injection** is a key element: domain expertise files — markdown documents encoding design principles, common pitfalls, and diagnostic standards — are loaded from disk and injected into the prompt. This grounds the LLM's assessment in documented domain knowledge rather than relying solely on its training data.

## 6. Layered Specialization

**The problem**: Multiple reviewers share common orchestration logic but differ in method-specific details. Duplicating the orchestration code across reviewers creates maintenance burden.

**The pattern**: An abstract base class defines the interface — the contract that all reviewers must satisfy. Concrete subclasses supply only the method-specific details (prompt templates, knowledge files, confidence ranges). The orchestration layer operates against the interface, unaware of which concrete class it is using.

| Layer | Responsibility |
|-------|----------------|
| `MethodReviewer` (ABC) | Interface: `load_artifact()`, `prompt_template_dir()`, `knowledge_content_dir()`, `confidence_range` |
| `ExperimentReviewer` | Experiment-specific: prompt directory, knowledge directory, `(0.85, 1.0)` range |

Adding a new methodology (e.g., difference-in-differences) requires implementing one class with the method-specific details — the `evaluate_confidence` function and `ReviewEngine` orchestration logic remain unchanged.

## 7. Structured Output

**The problem**: LLMs produce free-form text, but downstream systems need typed, machine-readable data. Free-form output cannot be consumed reliably by code.

**The pattern**: The system constrains the LLM's output format in the prompt, then parses the response into typed objects. Multiple fallback strategies handle format deviations:

1. The prompt specifies an exact format: `DIMENSION: / SCORE: / JUSTIFICATION:` blocks
2. A regex parser extracts per-dimension scores and justifications
3. A JSON fallback handles alternative response formats
4. All scores are clamped to `[0.0, 1.0]` and assembled into a typed `ReviewResult`

This **constrain → parse → validate** cycle is fundamental to agentic systems. The LLM provides reasoning and judgment; the surrounding code ensures the output is reliable and machine-readable.

---

## Part II: Implementation

We now read the source code of the evaluate tool to see each pattern in practice. The goal is not to memorize the implementation, but to recognize the design choices that each pattern reflects.

In [None]:
import inspect

from impact_engine_evaluate import evaluate_confidence
from impact_engine_evaluate.api import EvaluationRouter
from impact_engine_evaluate.review.methods.base import MethodReviewer, MethodReviewerRegistry
from impact_engine_evaluate.review.methods.experiment.reviewer import ExperimentReviewer
from impact_engine_evaluate.review.models import ReviewDimension, ReviewResult
from IPython.display import Code

## Registry + Dispatch in Practice

The `MethodReviewerRegistry` class implements the registry pattern. It maintains a dictionary mapping method names to reviewer classes, and exposes `register()` as a class decorator:

In [None]:
Code(inspect.getsource(MethodReviewerRegistry), language="python")

The registry is populated when reviewer modules are imported. We can inspect which methods are currently registered and what confidence range each carries:

In [None]:
for name, (lo, hi) in MethodReviewerRegistry.confidence_map().items():
    print(f"  {name}: confidence_range = ({lo:.2f}, {hi:.2f})")

The `ExperimentReviewer` shows how registration works in practice — a single decorator line registers the class with the registry:

In [None]:
Code(inspect.getsource(ExperimentReviewer), language="python")

## Prompt Engineering as Software in Practice

The prompt template for experiment review is a YAML file with explicit metadata, structured dimensions, and Jinja2 template variables. We read it directly from the package:

In [None]:
reviewer = ExperimentReviewer()
template_path = reviewer.prompt_template_dir() / "experiment_review.yaml"
print(template_path.read_text())

Knowledge files ground the LLM's assessment in documented domain expertise. Each file encodes a specific aspect of experimental design:

In [None]:
knowledge_dir = reviewer.knowledge_content_dir()
for path in sorted(knowledge_dir.iterdir()):
    if path.suffix in (".md", ".txt"):
        lines = path.read_text().splitlines()
        print(f"--- {path.name} ({len(lines)} lines) ---")
        print("\n".join(lines[:8]))
        print("...")
        print()

## Layered Specialization in Practice

The `MethodReviewer` abstract base class defines the interface that all reviewers must satisfy. Concrete subclasses provide method-specific implementations:

In [None]:
Code(inspect.getsource(MethodReviewer), language="python")

## Structured Output in Practice

The `ReviewResult` and `ReviewDimension` dataclasses define the typed output structure that the parsing logic assembles from the LLM's free-form response:

In [None]:
Code(inspect.getsource(ReviewResult), language="python")

In [None]:
Code(inspect.getsource(ReviewDimension), language="python")

## Connecting the Patterns: evaluate_confidence

The `evaluate_confidence` function shows how all four patterns compose into the full pipeline:

1. **Registry dispatch**: `EvaluationRouter.route()` calls `MethodReviewerRegistry.create(manifest.model_type)` to select the reviewer
2. **Strategy dispatch**: branches on `"score"` vs `"review"` — only the confidence source differs
3. **Prompt templates + knowledge injection**: handled inside `review()`, using the reviewer's template and knowledge directories
4. **Structured output**: `review()` returns a typed `ReviewResult`; the function returns a typed `EvaluateResult` with guaranteed fields

In [None]:
Code(inspect.getsource(evaluate_confidence), language="python")

## Escalation and Assess vs. Improve in Practice

The `evaluate_confidence` function currently implements the **Judge** pattern by default: a single LLM pass constrained to structured output. The `evaluate_strategy` field in the manifest selects between deterministic scoring (`"score"`) and the full agentic reviewer (`"review"`), but both branches use a single-pass Judge.

The escalation ladder and the Assess vs. Improve discipline are implemented *around* this function, not within it:

- **Jury** would call `evaluate_confidence` with multiple different backend configurations and aggregate or compare their `ReviewResult` outputs.
- **Reviewer** would call `evaluate_confidence` once, then pass the `ReviewResult` back to a critique pass that checks whether the justifications adequately address each diagnostic.
- **Debate** would call `evaluate_confidence` twice with adversarial system prompts, then run a resolution step.

The `confidence_map()` output seen above is relevant here: the confidence ranges reflect the *hierarchy of evidence* (a structural prior about methodology quality), while the escalation pattern choice reflects *how thoroughly to interrogate a specific estimate*. These are orthogonal decisions — you can run a Debate-level review on a quasi-experimental study, or stay at Judge for a well-powered RCT.

The Assess vs. Improve discipline likewise operates around `evaluate_confidence`: Assess mode calls it repeatedly on synthetic artifacts with known properties (clean vs. flawed), records pass/fail per dimension, and produces a scorecard. Improve mode then modifies the prompt templates and rubrics based on patterns in that scorecard — and validates every change on held-out artifacts that were not part of the diagnosis. Lecture 3 demonstrates the Assess mode concretely: we run known-clean and known-flaw artifacts through the reviewer and verify that the system discriminates appropriately.

## Additional resources

- [Anthropic: Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) — Design patterns for LLM-powered systems
- [Anthropic: Claude Agent SDK](https://docs.anthropic.com/en/docs/agents-and-tools/claude-agent-sdk/overview) — SDK for building agents with Claude