# Agentic Review

This tutorial walks through the agentic evaluation path. It sends measurement
artifacts to an LLM for structured, per-dimension review with scores and
justifications.

```{note}
This notebook requires an LLM API key and is not executed during the docs
build. Code cells include pre-computed output.
```

## Workflow overview

1. Prepare a job directory
2. Configure the LLM backend
3. Run `review()`
4. Inspect the `ReviewResult`
5. Examine the output files

## Prerequisites

Install the package with an LLM backend extra:

```bash
pip install "impact-engine-evaluate[anthropic]"
```

Set the API key:

```bash
export ANTHROPIC_API_KEY="sk-ant-..."
```

## Step 1: Prepare a job directory

An upstream producer writes a job directory with `manifest.json` and
`impact_results.json`. Here we create one manually with realistic
experiment results.

In [1]:
import json
from pathlib import Path

job_dir = Path("/tmp/job-impact-engine-review-demo")
job_dir.mkdir(exist_ok=True)

manifest = {
    "schema_version": "2.0",
    "model_type": "experiment",
    "evaluate_strategy": "agentic",
    "created_at": "2025-06-01T12:00:00+00:00",
    "files": {
        "impact_results": {"path": "impact_results.json", "format": "json"},
    },
}

impact_results = {
    "schema_version": "2.0",
    "model_type": "experiment",
    "ci_upper": 15.2,
    "effect_estimate": 10.5,
    "ci_lower": 5.8,
    "cost_to_scale": 250.0,
    "sample_size": 1200,
    "data": {
        "model_params": {
            "dependent_variable": "revenue",
            "treatment_variable": "treatment",
            "covariates": ["region", "segment"],
        },
        "impact_estimates": {
            "effect_estimate": 10.5,
            "ci_lower": 5.8,
            "ci_upper": 15.2,
            "p_value": 0.001,
            "standard_error": 2.4,
        },
        "model_summary": {
            "r_squared": 0.42,
            "f_statistic": 38.7,
            "n_observations": 1200,
            "n_treatment": 600,
            "n_control": 600,
        },
    },
}

(job_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
(job_dir / "impact_results.json").write_text(json.dumps(impact_results, indent=2))

print(f"Job directory: {job_dir}")
print(f"Files: {[p.name for p in sorted(job_dir.iterdir())]}")

Job directory: /tmp/job-impact-engine-review-demo
Files: ['manifest.json', 'impact_results.json']


## Step 2: Configure the backend

The review engine needs to know which LLM to call. Configuration can come
from a YAML file, a dict, or environment variables. Here we use a dict.

In [2]:
config = {
    "backend": {
        "type": "anthropic",
        "model": "claude-sonnet-4-5-20250929",
        "temperature": 0.0,
        "max_tokens": 4096,
    }
}

## Step 3: Run `review()`

The `review()` function reads the manifest, dispatches to the experiment
reviewer, renders the prompt with domain knowledge, calls the LLM, parses
the structured response, writes `review_result.json` back to the job
directory, and returns a `ReviewResult`.

In [3]:
from impact_engine_evaluate import review

result = review(job_dir, config=config)
print(f"Review complete. Overall score: {result.overall_score:.2f}")

Review complete. Overall score: 0.82


## Step 4: Inspect the ReviewResult

The result contains per-dimension scores with justifications, an overall
score (the mean of dimension scores), and the raw LLM response for audit.

In [4]:
print(f"Initiative:     {result.initiative_id}")
print(f"Prompt:         {result.prompt_name} v{result.prompt_version}")
print(f"Backend:        {result.backend_name} ({result.model})")
print(f"Overall score:  {result.overall_score:.2f}")
print()
print("Dimensions:")
for dim in result.dimensions:
    print(f"  {dim.name:30s} {dim.score:.2f}  {dim.justification[:50]}...")

Initiative:     job-impact-engine-review-demo
Prompt:         experiment_review v1.0
Backend:        anthropic (claude-sonnet-4-5-20250929)
Overall score:  0.82

Dimensions:
  randomization_integrity:   0.90  Balanced treatment/control split (600/600)...
  specification_adequacy:    0.85  OLS with covariates (region, segment)...
  statistical_inference:     0.88  Strong p-value (0.001), narrow CI...
  threats_to_validity:       0.70  No attrition data reported...
  effect_size_plausibility:  0.78  Effect of 10.5 is within plausible range...


The experiment reviewer evaluates five dimensions:

| Dimension | What it checks |
|-----------|---------------|
| `randomization_integrity` | Covariate balance between treatment and control |
| `specification_adequacy` | OLS formula, covariates, functional form |
| `statistical_inference` | CIs, p-values, F-statistic, multiple testing |
| `threats_to_validity` | Attrition, non-compliance, spillover, SUTVA |
| `effect_size_plausibility` | Whether the treatment effect is realistic |

## Step 5: Examine the output files

After review, the evaluate stage writes `review_result.json` alongside the
original artifacts. The manifest is treated as read-only — it is not modified.

In [None]:
review_data = json.loads((job_dir / "review_result.json").read_text())
print("Review result keys:", list(review_data.keys()))
print(f"Overall score: {review_data['overall_score']}")
print(f"Dimensions: {len(review_data['dimensions'])}")
print()
print("Job directory contents:")
print([p.name for p in sorted(job_dir.iterdir())])

The job directory now contains:

```
job-impact-engine-review-demo/
├── manifest.json          # read-only (created by the producer)
├── impact_results.json    # original upstream output
└── review_result.json     # structured review from the LLM
```

## Using the Evaluate adapter

In the orchestrator pipeline, `Evaluate.execute()` wraps `review()` behind
a unified interface. When `evaluate_strategy` is `"agentic"`, the adapter
calls `review()` internally and uses `overall_score` as the confidence
value in the common 8-key output.

In [6]:
from impact_engine_evaluate import Evaluate

evaluator = Evaluate(config=config)
output = evaluator.execute({"job_dir": str(job_dir)})
print(json.dumps(output, indent=2))

{
  "initiative_id": "job-impact-engine-review-demo",
  "confidence": 0.82,
  "cost": 250.0,
  "return_best": 15.2,
  "return_median": 10.5,
  "return_worst": 5.8,
  "model_type": "experiment",
  "sample_size": 1200
}


## Summary

- `review()` runs an end-to-end LLM review of a job directory.
- The experiment reviewer evaluates five methodology-specific dimensions.
- Results are written to the job directory as `review_result.json`.
- The manifest is read-only — evaluate never modifies it.
- The `Evaluate` adapter wraps this behind a unified interface for the
  orchestrator.
- The `overall_score` from the review becomes the `confidence` value
  downstream.