# Agentic Review

This tutorial walks through the agentic evaluation path. It sends measurement
artifacts to an LLM for structured, per-dimension review with scores and
justifications.

```{note}
This notebook requires an LLM API key and is not executed during the docs
build. Code cells include pre-computed output.
```

## Workflow overview

1. Prepare a job directory
2. Configure the LLM backend
3. Run `review()`
4. Inspect the `ReviewResult`
5. Examine the output files

## Prerequisites

Install the package with an LLM backend extra:

```bash
pip install "impact-engine-evaluate[anthropic]"
```

Set the API key:

```bash
export ANTHROPIC_API_KEY="sk-ant-..."
```

## Step 1: Prepare a job directory

An upstream producer writes a job directory with `manifest.json` and
`impact_results.json`. Here we create one manually with realistic
experiment results.

In [1]:
import json
from pathlib import Path

job_dir = Path("/tmp/job-impact-engine-review-demo")
job_dir.mkdir(exist_ok=True)

manifest = {
    "model_type": "experiment",
    "evaluate_strategy": "agentic",
    "created_at": "2025-06-01T12:00:00+00:00",
    "files": {
        "impact_results": {"path": "impact_results.json", "format": "json"},
    },
}

impact_results = {
    "model_type": "experiment",
    "ci_upper": 15.2,
    "effect_estimate": 10.5,
    "ci_lower": 5.8,
    "cost_to_scale": 250.0,
    "sample_size": 1200,
    "data": {
        "model_params": {
            "dependent_variable": "revenue",
            "treatment_variable": "treatment",
            "covariates": ["region", "segment"],
        },
        "impact_estimates": {
            "effect_estimate": 10.5,
            "ci_lower": 5.8,
            "ci_upper": 15.2,
            "p_value": 0.001,
            "standard_error": 2.4,
        },
        "model_summary": {
            "r_squared": 0.42,
            "f_statistic": 38.7,
            "n_observations": 1200,
            "n_treatment": 600,
            "n_control": 600,
        },
    },
}

(job_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
(job_dir / "impact_results.json").write_text(json.dumps(impact_results, indent=2))

print(f"Job directory: {job_dir}")
print(f"Files: {[p.name for p in sorted(job_dir.iterdir())]}")

Job directory: /tmp/job-impact-engine-review-demo
Files: ['manifest.json', 'impact_results.json']


## Step 2: Configure the backend

The review engine needs to know which LLM to call. The recommended approach
is a YAML config file — reusable across many jobs and easy to swap backends:

```yaml
backend:
  model: "claude-sonnet-4-6"   # or "ollama_chat/llama3.2" for local
  temperature: 0.0
  max_tokens: 4096
```

Pass the file path to `evaluate_confidence()`. A dict also works for quick
experiments.

In [None]:
config = {
    "backend": {
        "model": "claude-sonnet-4-6",
        "temperature": 0.0,
        "max_tokens": 4096,
    }
}

## Step 3: Run `evaluate_confidence()`

`evaluate_confidence(config, job_dir)` is the package-level entry point.
It reads the manifest, dispatches to the registered reviewer, renders the
prompt with domain knowledge, calls the LLM, parses the structured response,
and writes `evaluate_result.json` and `review_result.json` to the job directory.

In [None]:
from impact_engine_evaluate import evaluate_confidence

result = evaluate_confidence(config, job_dir)
print(f"Review complete. Overall score: {result.confidence:.2f}")

## Step 4: Inspect the EvaluateResult

The result contains the confidence score, strategy, and a full per-dimension
breakdown accessible via `result.report`:

In [None]:
print(f"Initiative : {result.initiative_id}")
print(f"Strategy   : {result.strategy}")
print(f"Confidence : {result.confidence:.2f}")
print(f"Range      : [{result.confidence_range[0]:.2f}, {result.confidence_range[1]:.2f}]")
print()
print("Dimensions (from result.report):")
for dim in result.report["dimensions"]:
    print(f"  {dim['name']:30s} {dim['score']:.2f}  {dim['justification'][:50]}...")

The experiment reviewer evaluates five dimensions:

| Dimension | What it checks |
|-----------|---------------|
| `randomization_integrity` | Covariate balance between treatment and control |
| `specification_adequacy` | OLS formula, covariates, functional form |
| `statistical_inference` | CIs, p-values, F-statistic, multiple testing |
| `threats_to_validity` | Attrition, non-compliance, spillover, SUTVA |
| `effect_size_plausibility` | Whether the treatment effect is realistic |

## Step 5: Examine the output files

After review, the evaluate stage writes `review_result.json` alongside the
original artifacts. The manifest is treated as read-only — it is not modified.

In [None]:
review_data = json.loads((job_dir / "review_result.json").read_text())
print("Review result keys:", list(review_data.keys()))
print(f"Overall score: {review_data['overall_score']}")
print(f"Dimensions: {len(review_data['dimensions'])}")
print()
print("Job directory contents:")
print([p.name for p in sorted(job_dir.iterdir())])

The job directory now contains:

```
job-impact-engine-review-demo/
├── manifest.json          # read-only (created by the producer)
├── impact_results.json    # original upstream output
└── review_result.json     # structured review from the LLM
```

## Pipeline integration

In the orchestrator pipeline, the `Evaluate` component (defined in `impact_engine_orchestrator`) wraps `evaluate_confidence()` behind a unified `PipelineComponent` interface. It accepts `event["job_dir"]` and returns the same `EvaluateResult` fields as a plain dict.

The `evaluate` package is a pure science library — orchestration adapters live in the orchestrator.

In [None]:
from impact_engine_orchestrator.components.evaluate.evaluate import Evaluate

evaluator = Evaluate(config=config)
output = evaluator.execute({"job_dir": str(job_dir)})
print(json.dumps(output, indent=2))

## Summary

- `evaluate_confidence(config, job_dir)` is the package-level entry point —
  symmetric with `evaluate_impact()` in the measure component.
- The config file specifies the LLM backend; the job directory is a per-call
  runtime argument pointing to existing artifacts.
- The experiment reviewer evaluates five methodology-specific dimensions; the
  overall confidence score is returned as `result.confidence`.
- Per-dimension detail is accessible via `result.report["dimensions"]`.
- Results are written to `evaluate_result.json` and `review_result.json`
  in the job directory.
- The manifest is read-only — evaluate never modifies it.
- For pipeline use, the orchestrator's `Evaluate` component wraps
  `evaluate_confidence()` behind a `PipelineComponent` interface.