Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 21 additions & 14 deletions .github/workflows/eval-nightly.yml
Original file line number Diff line number Diff line change
@@ -1,22 +1,29 @@
# Eval harness nightly — disabled-by-default.
#
# This workflow runs the golden QA dataset against the agent / LLM loop. It
# is `workflow_dispatch`-only by default to prevent accidental LLM API
# spend. To enable nightly runs:
# This workflow runs the golden QA dataset + worked-pattern cases against a
# real Azure OpenAI deployment. It is `workflow_dispatch`-only by default
# to prevent accidental API spend. To enable nightly runs:
#
# 1. Set the Azure OpenAI secrets in repo settings:
# AZURE_OPENAI_ENDPOINT e.g. https://my.openai.azure.com
# AZURE_OPENAI_API_KEY the Azure resource key
# AZURE_OPENAI_DEPLOYMENT deployment name, e.g. gpt-4o-mini
# AZURE_OPENAI_API_VERSION optional, defaults to 2024-10-21
#
# 1. Set the LLM secrets in repo settings (LLM_API_KEY at minimum;
# LLM_BASE_URL / LLM_MODEL / LLM_PROVIDER if your judge differs from
# OpenAI defaults).
# 2. Replace the `on:` block below with:
#
# on:
# schedule:
# - cron: "0 6 * * *" # daily 06:00 UTC
# workflow_dispatch:
#
# 3. Add the `eval-nightly.yml` to EXEMPT_WORKFLOWS in
# `.github/scripts/check_required_contexts.py` if it's not already
# there (it is, by default — scheduled runs never gate PRs).
# 3. Confirm `eval-nightly.yml` is in EXEMPT_WORKFLOWS in
# `.github/scripts/check_required_contexts.py` (it is, by default
# — scheduled runs never gate PRs).
#
# When the Azure secrets are absent, eval/test_golden_patterns.py is
# skipped via pytestmark — the toy eval/test_golden_qa.py case still
# runs as a smoke check on the runner mechanics.
#
# See docs/EVAL_HARNESS.md for the full setup story.

Expand All @@ -43,11 +50,11 @@ jobs:
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5
with:
python-version: ${{ inputs.python_version || '3.14' }}
- run: uv sync --frozen --extra dev
- run: uv sync --frozen --extra dev --extra eval
- name: Run pytest eval/
env:
LLM_PROVIDER: ${{ secrets.LLM_PROVIDER }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
LLM_MODEL: ${{ secrets.LLM_MODEL }}
AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
AZURE_OPENAI_DEPLOYMENT: ${{ secrets.AZURE_OPENAI_DEPLOYMENT }}
AZURE_OPENAI_API_VERSION: ${{ secrets.AZURE_OPENAI_API_VERSION }}
run: uv run pytest eval/ -v
52 changes: 44 additions & 8 deletions docs/EVAL_HARNESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,19 @@ LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, th

```
src/eval/
├── models.py # EvalCase, EvalResult (Pydantic)
├── runner.py # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py # LLMClient Protocol + semantic-similarity judge
├── report.py # Markdown report generator
└── __main__.py # python -m src.eval
├── models.py # EvalCase, EvalResult (Pydantic)
├── runner.py # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py # LLMClient Protocol + semantic-similarity judge
├── report.py # Markdown report generator
├── __main__.py # python -m src.eval
└── adapters/
└── azure_openai.py # Concrete LLMClient for Azure OpenAI (optional extra)

eval/
├── golden_qa.json # The dataset (one trivial example case ships)
└── test_golden_qa.py # Parametrised pytest runner
├── golden_qa.json # Toy smoke case — runs without LLM credentials
├── test_golden_qa.py # Parametrised runner for the toy case
├── golden_patterns.json # Four worked-pattern cases — require Azure OpenAI
└── test_golden_patterns.py # Skipped unless AZURE_OPENAI_* env vars are set
```

## How it works
Expand Down Expand Up @@ -86,11 +90,43 @@ python -m src.eval # CLI runner — prints the markdown report

The pytest invocation is marked `@pytest.mark.eval`, so the default `pytest tests/` skips it.

## Worked patterns (Azure OpenAI)

The four cases in `eval/golden_patterns.json` are *not* benchmarks. They exist to demonstrate what an eval case looks like against each of the runner's tolerance modes; together they cover the four LLM-eval patterns you most often need to write:

| Case ID | Tolerance | Pattern demonstrated |
|---|---|---|
| `factual-http-200` | `exact_match` | Format-constrained factual recall. The prompt forces a single canonical token; if the model wraps the answer in prose, the case fails loudly. |
| `numeric-seconds-per-day` | `numeric_close` | Numeric reasoning with extraction tolerance. The runner pulls the first number from each side and compares within 1 %, so `86,400` and `86400 seconds` both match. |
| `definitional-fastapi-depends` | `semantic_similar` | Free-form prose scored by an LLM judge at ≥ 0.8. Use for explanations and any case where wording can vary but the underlying claim is checkable. |
| `structured-json-status` | `exact_match` | Structured-output adherence. The prompt asks for raw JSON; markdown-fenced or prose-wrapped responses fail — which is the failure mode downstream parsers also hit. |

The cases all call a real Azure OpenAI deployment via the adapter at `src/eval/adapters/azure_openai.py`. When you fork the template for a real project, replace these four with cases that exercise your own product's prompts; the patterns transfer.

### Setup

```sh
uv sync --extra dev --extra eval # installs the openai SDK

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini" # or whatever you deployed
export AZURE_OPENAI_API_VERSION="2024-10-21" # optional, this is the default

uv run pytest eval/test_golden_patterns.py -v
```

Without the env vars, `eval/test_golden_patterns.py` is skipped via `pytestmark` — `eval/test_golden_qa.py` still runs as a smoke check on the runner mechanics, so `uv run pytest eval/` always exits 0 on a fresh checkout.

### Swapping providers

`src/eval/judge.py` defines `LLMClient` as a `Protocol` — the eval core does not import `openai` anywhere. To target a different provider (Anthropic, vLLM, vanilla OpenAI), write a new adapter under `src/eval/adapters/` that implements `complete_json(*, model, prompt) -> str` and update the runner fixture in your test file. Nothing in `src/eval/` itself changes.

## Nightly opt-in

`.github/workflows/eval-nightly.yml` ships `workflow_dispatch`-only by default to avoid accidental LLM API spend. To turn on a real nightly:

1. Add the LLM secrets in repo settings: `LLM_API_KEY` (required), `LLM_PROVIDER`, `LLM_BASE_URL`, `LLM_MODEL` (optional, depending on adapter).
1. Add the Azure OpenAI secrets in repo settings: `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`, and optionally `AZURE_OPENAI_API_VERSION`.

2. Replace the workflow's `on:` block with:

Expand Down
38 changes: 38 additions & 0 deletions eval/golden_patterns.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[
{
"id": "factual-http-200",
"question": "What HTTP status code means OK? Respond with only the number, no prose.",
"category": "factual-recall",
"expected_answer": "200",
"tolerance": "exact_match",
"difficulty": "easy",
"notes": "Pattern: factual recall with format-constrained output. exact_match works because the prompt forces a single canonical token. If the model adds prose (\"The status code is 200.\") this fails loudly — which is the point: format adherence is part of the assertion."
},
{
"id": "numeric-seconds-per-day",
"question": "How many seconds are in 24 hours? Respond with the integer only.",
"category": "numeric-reasoning",
"expected_answer": "86400",
"tolerance": "numeric_close",
"difficulty": "easy",
"notes": "Pattern: numeric extraction with 1% tolerance. The runner pulls the first number from each side and compares ratios, so '86,400', '86400 seconds', and '86400.0' all match. Use this tolerance for math, conversions, and any case where formatting around the number is uninteresting."
},
{
"id": "definitional-fastapi-depends",
"question": "In one sentence: what does FastAPI's Depends() do?",
"category": "definitional",
"expected_answer": "Depends declares a callable that FastAPI resolves at request time and injects the result into the parameter, enabling dependency injection for things like authentication, database sessions, or settings.",
"tolerance": "semantic_similar",
"difficulty": "medium",
"notes": "Pattern: free-form prose scored by LLM judge. semantic_similar passes at score >= 0.8 via the judge in src/eval/judge.py. Use this for definitions, explanations, and any case where wording can legitimately vary but the underlying claim is checkable."
},
{
"id": "structured-json-status",
"question": "Return exactly this JSON object and nothing else (no markdown fence, no prose, no trailing newline): {\"ok\": true, \"version\": 1}",
"category": "structured-output",
"expected_answer": "{\"ok\": true, \"version\": 1}",
"tolerance": "exact_match",
"difficulty": "medium",
"notes": "Pattern: format adherence on structured output. Models commonly wrap JSON in ```json``` fences or add a preamble; exact_match after normalisation (lowercase + whitespace-collapse) accepts a clean response but rejects the fenced or prose-wrapped version. This is the failure mode you want to catch — downstream parsers break the same way."
}
]
86 changes: 86 additions & 0 deletions eval/test_golden_patterns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""LLM-eval pattern showcase — four worked cases that exercise the existing
tolerance modes against a real Azure OpenAI deployment.

Each case demonstrates a different eval *pattern* (see notes inside
`eval/golden_patterns.json`):

- factual recall with exact_match
- numeric reasoning with numeric_close
- free-form definitional with semantic_similar
- structured-output adherence with exact_match

This file is *skipped entirely* unless the Azure OpenAI env vars are set
(`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`).
Run with::

uv sync --extra eval --extra dev
AZURE_OPENAI_ENDPOINT=... AZURE_OPENAI_API_KEY=... \\
AZURE_OPENAI_DEPLOYMENT=... uv run pytest eval/test_golden_patterns.py

The toy `eval/test_golden_qa.py` runs without any credentials — that one
exercises the runner mechanics; this one exercises the runner against a
real model.
"""

from __future__ import annotations

import os
from pathlib import Path

import pytest

from src.eval.models import EvalCase
from src.eval.runner import EvalRunner, load_golden_dataset

_PATTERNS_PATH = Path(__file__).resolve().parent / "golden_patterns.json"
_REQUIRED_ENV = (
"AZURE_OPENAI_ENDPOINT",
"AZURE_OPENAI_API_KEY",
"AZURE_OPENAI_DEPLOYMENT",
)

_missing = [name for name in _REQUIRED_ENV if not os.environ.get(name)]
pytestmark = [
pytest.mark.eval,
pytest.mark.skipif(
bool(_missing),
reason=f"requires Azure OpenAI env vars: missing {', '.join(_missing)}",
),
]

patterns = load_golden_dataset(_PATTERNS_PATH)

# Sentinel passed to EvalRunner.judge_model. The runner threads this through
# to LLMClient.complete_json(model=...), where the Azure adapter discards it
# — Azure addresses by deployment name (set at adapter construction), not by
# the model parameter. Named constant makes the intent obvious to a reader
# of this fixture without needing to chase into the adapter.
_AZURE_DEPLOYMENT_SENTINEL = "azure-deployment-from-env"


@pytest.fixture(scope="module")
def runner() -> EvalRunner:
"""Construct the runner with one Azure client serving both roles
(answer_fn and judge_client). Same deployment for cost simplicity;
a real project might split subject and judge models."""
from src.eval.adapters.azure_openai import AzureOpenAIClient

client = AzureOpenAIClient()
return EvalRunner(
answer_fn=client.complete,
judge_client=client,
judge_model=_AZURE_DEPLOYMENT_SENTINEL,
)


@pytest.mark.parametrize("case", patterns, ids=lambda c: c.id)
def test_golden_patterns(case: EvalCase, runner: EvalRunner) -> None:
"""Run one worked pattern case against the live Azure deployment."""
result = runner.evaluate(case)
assert result.pass_result, (
f"[{case.id}] {case.category}/{case.difficulty}\n"
f"Q: {case.question}\n"
f"Expected: {case.expected_answer}\n"
f"Got: {result.actual_answer}\n"
f"Reason: {result.failure_reason}"
)
14 changes: 13 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "harness-python-react"
version = "0.2.11"
version = "0.2.12"
description = "Production-quality LLM-driven coding harness — Python (FastAPI) backend, Vite + React + TypeScript frontend."
readme = "README.md"
requires-python = ">=3.14"
Expand Down Expand Up @@ -55,6 +55,13 @@ dev = [
"commitizen>=4.0.0",
"pyyaml>=6.0.3",
]
# Optional extra for the eval harness's LLM-backed pattern cases. Kept
# separate from `dev` so a contributor working on backend/frontend code
# never pulls the openai SDK or its transitive deps. See
# docs/EVAL_HARNESS.md for the full setup.
eval = [
"openai>=1.40.0",
]

[project.urls]
Homepage = "https://github.com/constk/harness-python-react"
Expand Down Expand Up @@ -122,6 +129,11 @@ warn_unused_ignores = true
[[tool.mypy.overrides]]
module = [
"opentelemetry.*",
# `openai` is an optional extra (see [project.optional-dependencies]).
# mypy on a stock `uv sync --extra dev` checkout doesn't see it; the
# adapter in src/eval/adapters/azure_openai.py wraps it in `Any` at
# the import boundary so the rest of src/ stays fully typed.
"openai.*",
]
ignore_missing_imports = true

Expand Down
40 changes: 40 additions & 0 deletions src/eval/adapters/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# `src/eval/adapters`

Concrete `LLMClient` adapters for the eval harness. The judge in [`src/eval/judge.py`](../judge.py) calls an `LLMClient` Protocol — never a vendor SDK directly. Each adapter in this package implements that Protocol for one provider, so the eval core stays vendor-neutral and a downstream consumer can swap providers by changing one wiring line in their test fixture.

## Key interfaces

Exported from this package:

- **`AzureOpenAIClient`** — implements `src.eval.judge.LLMClient`. Construct from env via `AzureOpenAIClient()`; call `complete(prompt)` for runner `answer_fn` use, `complete_json(*, model, prompt)` for judge use. The `model` argument on `complete_json` is accepted for Protocol conformance and discarded — Azure addresses by deployment name (set at construction time, read from `AZURE_OPENAI_DEPLOYMENT`).
- **`AzureOpenAIConfigError`** — raised at construction when required env is missing or the optional `openai` extra is not installed. Subclass of `RuntimeError`. The error message names every missing env var in one go so the caller doesn't have to fix-and-retry.

## Why this layer exists

Without the Protocol seam, swapping LLM providers would mean touching the eval core. With it, vendor lock-in is confined to one file per provider. The layer demonstrates that the harness's "provider-agnostic" claim is structural, not aspirational: the eval core has zero imports of any vendor SDK.

## Current adapters

| File | Provider | Optional extra | Env contract |
|---|---|---|---|
| [`azure_openai.py`](azure_openai.py) | Azure OpenAI | `uv sync --extra eval` | `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT`, optional `AZURE_OPENAI_API_VERSION` (default `2024-10-21`) |

## Adding a new adapter

1. Add the SDK to `[project.optional-dependencies]` in `pyproject.toml` — either to the existing `eval` extra or a new provider-scoped one.
2. Add the SDK's top-level module to `[[tool.mypy.overrides]]` with `ignore_missing_imports = true`, matching the existing `openai.*` / `opentelemetry.*` entries. This keeps mypy clean on stock `uv sync --extra dev` checkouts.
3. Implement `complete_json(*, model: str, prompt: str) -> str` per the `LLMClient` Protocol in [`src/eval/judge.py`](../judge.py). Optionally add a `complete(prompt: str) -> str` for use as an `EvalRunner.answer_fn`.
4. **Lazy-import the SDK inside `__init__`** so the adapter module remains importable without the optional extra installed. The import error path should raise a clear, named exception (e.g. `AzureOpenAIConfigError`) telling the reader which `uv sync --extra ...` to run.
5. Read configuration from environment variables at construction time. Raise the same named exception listing every missing var when env is incomplete — fail fast, fail clear.
6. Add an offline unit test in [`tests/`](../../../tests/) that mocks the SDK at the `sys.modules` level (see `tests/test_eval_azure_openai_adapter.py` for the pattern). This keeps the unit suite credential-free; live-credential paths are exercised by [`eval/test_golden_patterns.py`](../../../eval/test_golden_patterns.py).
7. Document the env contract in this README's table above and in [`docs/EVAL_HARNESS.md`](../../../docs/EVAL_HARNESS.md)'s "Worked patterns" section.

## Why adapters live under `src/eval/`

The import-linter contract in `pyproject.toml` puts `src.eval` at the top of the layered import order:

```
api | eval -> agent -> tools -> data -> observability -> models
```

Adapters can therefore depend on anything in `src/`; nothing in `src/` depends on them. That asymmetry is exactly what the layered architecture exists to encode — vendor-specific code stays at the boundary, never leaks down into the eval primitives or the model layer.
13 changes: 13 additions & 0 deletions src/eval/adapters/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Concrete LLM-client adapters for the eval harness.

The judge in `src.eval.judge` calls an `LLMClient` Protocol — never an SDK
directly. Each adapter in this package implements that Protocol for one
provider, so the eval core stays vendor-neutral and a downstream consumer
can swap providers by changing one wiring line.

Adapters are intentionally thin: env-driven construction, lazy SDK import,
one `complete_json(...)` method. No retries, no streaming, no batching —
the goal is "works for nightly eval runs", not "production-grade client".
"""

from __future__ import annotations
Loading
Loading