# Building an Agentic Evaluation System

The `impact-engine-evaluate` package automates evidence assessment as the bridge between the MEASURE and ALLOCATE stages of the decision pipeline. This lecture examines how it is built — not how to use it, but what design patterns make it work. Understanding these patterns matters because the same patterns appear in any production agentic system: registry dispatch, versioned prompt templates, layered specialization, and structured output parsing.

In Part I we develop the conceptual foundation for each pattern. In Part II we read the source code of the evaluate tool to see each pattern in practice.

---

## Part I: Design Patterns for Agentic Systems

Agentic systems use LLMs as reasoning components within a structured software pipeline — not open-ended chat, but constrained evaluation with typed inputs and outputs. Four design patterns recur across well-engineered agentic systems.

## 1. Registry + Dispatch

**The problem**: A system needs to route different inputs to different handlers — but hardcoding that routing creates fragile, hard-to-extend code.

**The pattern**: Each handler registers itself with a central registry, keyed by an identifier. At runtime, the system reads the identifier from the input (e.g., from a configuration file or manifest) and dispatches to the correct handler automatically.

```python
@MethodReviewerRegistry.register("experiment")
class ExperimentReviewer(MethodReviewer):
    confidence_range = (0.85, 1.0)
    ...
```

The registry pattern separates *what handlers exist* from *how they are selected*. Adding support for a new methodology means implementing a new class and registering it — the dispatch logic remains unchanged. The same pattern works for LLM backends: a `BackendRegistry` routes to Anthropic, OpenAI, or LiteLLM based on configuration.

## 2. Prompt Engineering as Software

**The problem**: Prompts written inline as strings become unmaintainable — they mix concerns, lack versioning, and cannot be reviewed like code.

**The pattern**: Prompts are versioned artifacts stored in structured files (YAML, JSON) with explicit metadata. Templates use a dedicated engine (e.g., Jinja2) to inject context at runtime, separating *what to evaluate* from *how to evaluate*.

```yaml
name: experiment_review
version: "1.0"
dimensions:
  - randomization_integrity
  - statistical_inference
  ...

system: |
  You are a methodological reviewer...
  {{ knowledge_context }}

user: |
  Review the following artifact:
  {{ artifact }}
```

**Knowledge injection** is a key element: domain expertise files — markdown documents encoding design principles, common pitfalls, and diagnostic standards — are loaded from disk and injected into the prompt. This grounds the LLM's assessment in documented domain knowledge rather than relying solely on its training data.

## 3. Layered Specialization

**The problem**: Multiple reviewers share common orchestration logic but differ in method-specific details. Duplicating the orchestration code across reviewers creates maintenance burden.

**The pattern**: An abstract base class defines the interface — the contract that all reviewers must satisfy. Concrete subclasses supply only the method-specific details (prompt templates, knowledge files, confidence ranges). The orchestration layer operates against the interface, unaware of which concrete class it is using.

| Layer | Responsibility |
|-------|----------------|
| `MethodReviewer` (ABC) | Interface: `load_artifact()`, `prompt_template_dir()`, `knowledge_content_dir()`, `confidence_range` |
| `ExperimentReviewer` | Experiment-specific: prompt directory, knowledge directory, `(0.85, 1.0)` range |

Adding a new methodology (e.g., difference-in-differences) requires implementing one class with the method-specific details — the `Evaluate` and `ReviewEngine` orchestration logic remains unchanged.

## 4. Structured Output

**The problem**: LLMs produce free-form text, but downstream systems need typed, machine-readable data. Free-form output cannot be consumed reliably by code.

**The pattern**: The system constrains the LLM's output format in the prompt, then parses the response into typed objects. Multiple fallback strategies handle format deviations:

1. The prompt specifies an exact format: `DIMENSION: / SCORE: / JUSTIFICATION:` blocks
2. A regex parser extracts per-dimension scores and justifications
3. A JSON fallback handles alternative response formats
4. All scores are clamped to `[0.0, 1.0]` and assembled into a typed `ReviewResult`

This **constrain → parse → validate** cycle is fundamental to agentic systems. The LLM provides reasoning and judgment; the surrounding code ensures the output is reliable and machine-readable.

---

## Part II: Implementation

We now read the source code of the evaluate tool to see each pattern in practice. The goal is not to memorize the implementation, but to recognize the design choices that each pattern reflects.

In [None]:
import inspect

from impact_engine_evaluate import Evaluate
from impact_engine_evaluate.review.methods.base import MethodReviewer, MethodReviewerRegistry
from impact_engine_evaluate.review.methods.experiment.reviewer import ExperimentReviewer
from impact_engine_evaluate.review.models import ReviewDimension, ReviewResult
from IPython.display import Code

## 1. Registry + Dispatch in Practice

The `MethodReviewerRegistry` class implements the registry pattern. It maintains a dictionary mapping method names to reviewer classes, and exposes `register()` as a class decorator:

In [None]:
Code(inspect.getsource(MethodReviewerRegistry), language="python")

The registry is populated when reviewer modules are imported. We can inspect which methods are currently registered and what confidence range each carries:

In [None]:
for name, (lo, hi) in MethodReviewerRegistry.confidence_map().items():
    print(f"  {name}: confidence_range = ({lo:.2f}, {hi:.2f})")

The `ExperimentReviewer` shows how registration works in practice — a single decorator line registers the class with the registry:

In [None]:
Code(inspect.getsource(ExperimentReviewer), language="python")

## 2. Prompt Engineering as Software

The prompt template for experiment review is a YAML file with explicit metadata, structured dimensions, and Jinja2 template variables. We read it directly from the package:

In [None]:
reviewer = ExperimentReviewer()
template_path = reviewer.prompt_template_dir() / "experiment_review.yaml"
print(template_path.read_text())

Knowledge files ground the LLM's assessment in documented domain expertise. Each file encodes a specific aspect of experimental design:

In [None]:
knowledge_dir = reviewer.knowledge_content_dir()
for path in sorted(knowledge_dir.iterdir()):
    if path.suffix in (".md", ".txt"):
        lines = path.read_text().splitlines()
        print(f"--- {path.name} ({len(lines)} lines) ---")
        print("\n".join(lines[:8]))
        print("...")
        print()

## 3. Layered Specialization

The `MethodReviewer` abstract base class defines the interface that all reviewers must satisfy. Concrete subclasses provide method-specific implementations:

In [None]:
Code(inspect.getsource(MethodReviewer), language="python")

## 4. Structured Output

The `ReviewResult` and `ReviewDimension` dataclasses define the typed output structure that the parsing logic assembles from the LLM's free-form response:

In [None]:
Code(inspect.getsource(ReviewResult), language="python")

In [None]:
Code(inspect.getsource(ReviewDimension), language="python")

## Connecting the Patterns: Evaluate.execute

The `Evaluate.execute` method shows how all four patterns compose into the full pipeline:

1. **Registry dispatch**: `MethodReviewerRegistry.create(manifest.model_type)` selects the reviewer
2. **Strategy dispatch**: branches on `"score"` vs `"review"` — only the confidence source differs
3. **Prompt templates + knowledge injection**: handled inside `review()`, using the reviewer's template and knowledge directories
4. **Structured output**: `review()` returns a typed `ReviewResult`; the method returns `asdict(result)` — a typed dict with guaranteed keys

In [None]:
Code(inspect.getsource(Evaluate), language="python")

## Additional Resources

- [Anthropic: Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) — Design patterns for LLM-powered systems
- [Anthropic: Claude Agent SDK](https://docs.anthropic.com/en/docs/agents-and-tools/claude-agent-sdk/overview) — SDK for building agents with Claude