# Local LLM Review with Ollama

This tutorial demonstrates the review evaluation path using a locally hosted
model via [Ollama](https://ollama.com). No API key or internet connection is
required — the model runs entirely on your machine.

```{note}
This notebook requires Ollama to be running locally and is not executed
during the docs build. Code cells include pre-computed output.
```

## Workflow overview

1. Inspect the job directory — a synthetic RCT with realistic artifacts
2. Configure the backend to call `ollama_chat/llama3.2`
3. Run `review()`
4. Inspect the `ReviewResult`
5. Examine the output file

## Prerequisites

Install and start Ollama, then pull a model:

```bash
ollama pull llama3.2
ollama serve          # already running if the desktop app is open
```

No extra Python dependencies are needed beyond the base install:

```bash
pip install impact-engine-evaluate
```

## Step 1: Inspect the job directory

The `rct_job/` directory alongside this notebook is a synthetic early-literacy
RCT. It contains:

- `manifest.json` — metadata, file references, and evaluation strategy
- `impact_results.json` — summary statistics (effect estimate, CI, sample size)
- `regression_output.json` — full OLS output with balance check and attrition data

In [1]:
import json
from pathlib import Path

JOB_DIR = Path("rct_job")

print(f"Job directory: {JOB_DIR}")
print(f"Files: {sorted(p.name for p in JOB_DIR.iterdir())}")
print()
print("manifest.json:")
manifest = json.loads((JOB_DIR / "manifest.json").read_text())
print(json.dumps(manifest, indent=2))

Job directory: rct_job
Files: ['impact_results.json', 'manifest.json', 'regression_output.json', 'review_result.json']

manifest.json:
{
  "schema_version": "2.0",
  "model_type": "experiment",
  "evaluate_strategy": "review",
  "initiative_id": "literacy-rct-2024",
  "created_at": "2025-03-15T09:00:00+00:00",
  "files": {
    "impact_results": {"path": "impact_results.json", "format": "json"},
    "regression_output": {"path": "regression_output.json", "format": "json"}
  }
}


## Step 2: Configure the backend

Pass a config dict to specify the model. litellm routes `ollama_chat/` prefixed
model names to `http://localhost:11434` automatically — no extra configuration
is needed. Swap the model name to use any model you have pulled locally.

In [2]:
config = {
    "backend": {
        "model": "ollama_chat/llama3.2",
        "temperature": 0.0,
        "max_tokens": 2048,
    }
}

## Step 3: Run `review()`

The `review()` function:

1. Reads `manifest.json` and loads the registered `ExperimentReviewer`
2. Concatenates all artifact files into a single text payload
3. Renders the prompt with domain knowledge from `knowledge/`
4. Calls the model via litellm and parses the structured JSON response
5. Writes `review_result.json` back to the job directory

In [3]:
from impact_engine_evaluate.review.api import review

result = review(JOB_DIR, config=config)
print(f"Review complete. Overall score: {result.overall_score:.2f}")

Review complete. Overall score: 0.75


## Step 4: Inspect the ReviewResult

The result contains per-dimension scores with justifications and an overall
score. Five dimensions are evaluated for experiments:

In [4]:
print(f"Initiative  : {result.initiative_id}")
print(f"Model       : {result.model}")
print(f"Prompt      : {result.prompt_name} v{result.prompt_version}")
print(f"Timestamp   : {result.timestamp}")
print(f"\nOverall score : {result.overall_score:.3f}")
print("\nDimensions:")
for dim in result.dimensions:
    bar = "#" * int(dim.score * 20)
    print(f"  {dim.name:<30} {dim.score:.3f}  |{bar:<20}|")
    print(f"    {dim.justification}")
    print()

Initiative  : literacy-rct-2024
Model       : ollama_chat/llama3.2
Prompt      : experiment_review v1.0
Timestamp   : 2026-02-28T20:44:30.894861+00:00

Overall score : 0.750

Dimensions:
  randomization_integrity        0.800  |################    |
    The study reports a balanced attrition rate (5%) and differential attrition
    test p-value of 0.61, indicating that the treatment effect estimate is
    likely unbiased under worst-case attrition scenarios.

  specification_adequacy         0.900  |##################  |
    The study uses a well-specified OLS model with robust standard errors
    (HC2 heteroskedasticity-robust) and reports the F-statistic for covariate
    balance, indicating that the model is adequately specified.

  statistical_inference          0.700  |##############      |
    The study reports a p-value of 0.007 for the treatment coefficient, which
    is statistically significant. However, the confidence intervals (CI) are
    not reported in a way that allows 

The experiment reviewer evaluates five dimensions:

| Dimension | What it checks |
|-----------|---------------|
| `randomization_integrity` | Attrition, balance, differential dropout |
| `specification_adequacy` | OLS formula, covariates, robust SEs |
| `statistical_inference` | CIs, p-values, F-statistic, multiple testing |
| `threats_to_validity` | Spillover, non-compliance, SUTVA, Hawthorne |
| `effect_size_plausibility` | Whether the treatment effect is realistic |

## Step 5: Examine the output file

`review()` writes `review_result.json` to the job directory alongside the
original artifacts. The manifest is treated as read-only.

In [5]:
print(f"Job directory contents: {sorted(p.name for p in JOB_DIR.iterdir())}")
print()
review_data = json.loads((JOB_DIR / "review_result.json").read_text())
print(f"review_result.json keys: {list(review_data.keys())}")
print(f"Overall score : {review_data['overall_score']}")
print(f"Dimensions    : {len(review_data['dimensions'])}")

Job directory contents: ['impact_results.json', 'manifest.json', 'regression_output.json', 'review_result.json']

review_result.json keys: ['initiative_id', 'prompt_name', 'prompt_version', 'backend_name', 'model', 'dimensions', 'overall_score', 'raw_response', 'timestamp']
Overall score : 0.75
Dimensions    : 5


The job directory now contains:

```
rct_job/
├── manifest.json           # read-only (created by the producer)
├── impact_results.json     # summary statistics
├── regression_output.json  # full OLS output
└── review_result.json      # structured review written by evaluate
```

## Summary

- `review()` runs an end-to-end LLM review of a job directory — no API key
  needed when using a local Ollama model.
- The `ollama_chat/<model>` prefix routes requests to `http://localhost:11434`
  via litellm; swap the model name to use any locally available model.
- The experiment reviewer evaluates five methodology-specific dimensions and
  returns scores with free-text justifications.
- Results are written to `review_result.json` in the job directory.
- The manifest is read-only — `evaluate` never modifies it.

For cloud-hosted models (Anthropic, OpenAI), see the
[Agentic Review](demo_agentic_review.ipynb) tutorial.