feat: eval pattern examples calling Azure OpenAI#104
Merged
Conversation
constk
added a commit
that referenced
this pull request
May 25, 2026
Addresses two gate failures on #104 surfaced by code review: 1. "Tests required" gate — feat: prefix declared a behaviour change but tests/ had no test for the new adapter (the eval/-side test only runs with live Azure credentials). Adds tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases covering _resolve_config (defaults, override, empty-string fallback, missing-env error listing), the constructor (env wiring, explicit API version, missing-env, missing-SDK), and the two SDK call paths (complete_json structured-output mode, complete user-message dispatch, null-content returns "" / "{}"). The SDK is mocked at sys.modules level so the test never hits the network and never requires the openai extra to be installed. 2. "src/ README audit" gate — every src/ package needs a README.md per CLAUDE.md. Adds src/eval/adapters/README.md documenting the layer's purpose, the current adapter, a 7-step "adding a new adapter" recipe, and why the layer lives at the top of the import order. Also applies the reviewer's non-blocking sentinel-string suggestion: the magic "azure-deployment" string passed as judge_model in eval/test_golden_patterns.py is now the named constant _AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner threads it through but the Azure adapter discards it. Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on 43 source files, ruff/format/import-linter all green. Refs #94
5 tasks
The eval slice previously shipped one toy case (echo-hello) and a
disabled-by-default nightly. A reader expecting an LLM-eval story
found the infrastructure without conviction.
Adds four worked-pattern cases that exercise the existing three
tolerance modes against a real Azure OpenAI deployment. These are
not benchmarks — they demonstrate what an eval case *looks like* for
the four LLM-eval patterns you most often need to write:
- factual-http-200 exact_match format-constrained recall
- numeric-seconds-per-day numeric_close numeric reasoning + tolerance
- definitional-fastapi-depends semantic_similar free-form judge-scored prose
- structured-json-status exact_match structured-output adherence
When the template is forked for a real project, replace these four
with cases that exercise the project's own prompts; the patterns
transfer regardless of what product is bolted on.
Provider choice — Azure OpenAI via the openai SDK with AzureOpenAI
client — is intentionally distinct from the rest of the harness
(which uses Claude via Claude Code). Demonstrates that the LLMClient
Protocol in src/eval/judge.py does its job: the eval core never
imports openai, vendor lock-in lives only in the adapter.
Changes:
- src/eval/adapters/azure_openai.py — implements LLMClient via the
openai.AzureOpenAI SDK. Reads endpoint/key/deployment/api-version
from env. Lazy-imports the SDK so the module is importable without
the optional extra installed; the adapter raises a clear
AzureOpenAIConfigError if the env or SDK is missing.
- eval/golden_patterns.json — the four cases with notes explaining
which pattern each demonstrates.
- eval/test_golden_patterns.py — separate test file gated on the
Azure env vars via pytestmark. Skipped on a stock checkout, so
`uv run pytest eval/` always exits 0. The toy test_golden_qa.py
keeps running as before.
- pyproject.toml — new optional [project.optional-dependencies] eval
extra (just `openai>=1.40.0`), mypy override for openai.* matching
the existing opentelemetry.* pattern, and a 0.2.10 -> 0.2.11
self-version bump.
- .github/workflows/eval-nightly.yml — env vars renamed from the
placeholder LLM_* set to AZURE_OPENAI_*. Header comment updated
with the Azure setup recipe. uv sync now passes --extra eval.
- docs/EVAL_HARNESS.md — new "Worked patterns" section with the
table mapping case -> tolerance -> pattern, the local setup
recipe, and a "Swapping providers" note documenting the
Protocol-based extension path.
Local gates: mypy --strict clean on 42 source files (was 31), ruff
clean, ruff format clean, import-linter both contracts kept, 192
unit tests pass, eval/ runs 1 passed + 4 skipped without LLM env.
Closes #94
Addresses two gate failures on #104 surfaced by code review: 1. "Tests required" gate — feat: prefix declared a behaviour change but tests/ had no test for the new adapter (the eval/-side test only runs with live Azure credentials). Adds tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases covering _resolve_config (defaults, override, empty-string fallback, missing-env error listing), the constructor (env wiring, explicit API version, missing-env, missing-SDK), and the two SDK call paths (complete_json structured-output mode, complete user-message dispatch, null-content returns "" / "{}"). The SDK is mocked at sys.modules level so the test never hits the network and never requires the openai extra to be installed. 2. "src/ README audit" gate — every src/ package needs a README.md per CLAUDE.md. Adds src/eval/adapters/README.md documenting the layer's purpose, the current adapter, a 7-step "adding a new adapter" recipe, and why the layer lives at the top of the import order. Also applies the reviewer's non-blocking sentinel-string suggestion: the magic "azure-deployment" string passed as judge_model in eval/test_golden_patterns.py is now the named constant _AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner threads it through but the Azure adapter discards it. Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on 43 source files, ruff/format/import-linter all green. Refs #94
src/ README audit gate looks for a `## Key interfaces` (or `## Public
surface`) anchor — the existing README had purpose / table /
extension recipe / layering rationale, but no exported-names section.
Adds a `## Key interfaces` section listing the two exported names:
- AzureOpenAIClient — the LLMClient implementation with notes on
complete() vs complete_json() and the discarded `model` arg
(Azure dispatches by deployment, not model).
- AzureOpenAIConfigError — the construction-time error type,
noting that it batches every missing env var into a single
message instead of failing-and-retrying.
Both already documented in the adapter docstrings; this section
hoists them to the README anchor the audit gate enforces.
Refs #94
0d6531a to
1a32080
Compare
constk
added a commit
that referenced
this pull request
May 26, 2026
…sed post-#103/#104) main moved ahead of develop on 2026-05-25 when PR #86 was merged directly to main rather than via develop -> release flow. The divergence is one squash commit (eff5b1c) carrying: - docs/BEADS.md (optional Beads issue-queue guidance) - .github/pull_request_template.md (Beads PR-template block) - .github/scripts/check_aspirational_tickets.py (PEP 758 reformat) - .github/scripts/check_pin_freshness.py / check_tests_present.py / check_version_bump.py (touch-ups) - .gitattributes / .gitignore (.beads/ ignore, Windows renormalise) - CONTRIBUTING.md (line-ending normalisation) - tests/test_scripts_compile.py (new CI-script compile gate) - docs/DEVELOPMENT.md / docs/HARNESS.md / docs/HARNESS_PRIMER.md cross-refs - pyproject.toml + uv.lock self-version 0.2.10 -> 0.2.11 This PR was rebased after #103 (CVE fix, develop -> 0.2.11) and #104 (eval pattern examples, develop -> 0.2.12) merged. The version on main (0.2.11) is now behind develop (0.2.12); the conflict is resolved by bumping develop -> 0.2.13. After this lands, develop is at 0.2.13 and contains everything main has. Remaining in-flight PRs (#99, #100, #101, #105) need to rebase to bump 0.2.13 -> 0.2.14 (and onward sequentially as they merge). No behaviour change beyond what #86 already added to main. # Conflicts: # pyproject.toml # uv.lock
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The eval slice previously shipped one toy case (
echo-hello) and a disabled nightly. A reader expecting an LLM-eval story found the infrastructure without conviction.This PR adds four worked-pattern cases that exercise the existing three tolerance modes against a real Azure OpenAI deployment. They are not benchmarks — they demonstrate what an eval case looks like for the four LLM-eval patterns you most often need to write:
factual-http-200exact_matchnumeric-seconds-per-daynumeric_closedefinitional-fastapi-dependssemantic_similarstructured-json-statusexact_matchWhen the template is forked for a real project, the four cases get replaced with ones that exercise the project's own prompts; the patterns transfer regardless of what product is bolted on.
Provider choice — Azure OpenAI — is intentionally distinct from the rest of the harness (which uses Claude via Claude Code). Demonstrates that the existing
LLMClientProtocol in src/eval/judge.py does its job: the eval core never importsopenai, and vendor lock-in lives only in the new adapter.Closes #94.
Changes
src/eval/adapters/azure_openai.py(new)AzureOpenAIClientimplementingLLMClient. Lazy SDK import, env-driven config, clearAzureOpenAIConfigErroron missing config.src/eval/adapters/__init__.py(new)eval/golden_patterns.json(new)eval/test_golden_patterns.py(new)AZURE_OPENAI_*env viapytestmark— skipped on stock checkouts.pyproject.tomlevaloptional extra (openai>=1.40.0), mypy override foropenai.*matching the existingopentelemetry.*pattern, version bump 0.2.10 → 0.2.11.uv.lockevalextra..github/workflows/eval-nightly.ymlLLM_*→AZURE_OPENAI_*. Header updated with the Azure recipe.uv syncnow passes--extra eval.docs/EVAL_HARNESS.mdTest plan
Local gates (all green):
uv run --frozen mypy --strict src/ tests/— clean on 42 source files (was 31)uv run --frozen ruff check .— All checks passeduv run --frozen ruff format --check .— 57 files already formatteduv run --frozen lint-imports— both contracts keptuv run --frozen pytest tests/ -q— 192 passeduv run --frozen pytest eval/ -q— 1 passed, 4 skipped (pattern cases correctly skipped without Azure env)workflow_dispatchon the nightly workflow once secrets are configuredInvariants affected
None. The new adapter sits at the top of the layered import order (
src.evalis the top layer); no boundary changes.New deps / actions / external surface
openai>=1.40.0— added as an optional extra (uv sync --extra eval). Defaultuv sync --extra devdoesn't pull it.eval-nightly.ymlaction SHAs unchanged.Linked issue
Closes #94