release: bring main up to develop (0.2.17 — release-readiness docs + eval pattern examples + transitive CVE patches)#107
Conversation
pip-audit on develop is flagging two transitive-dep CVEs: - idna 3.13 CVE-2026-45409 (fix in 3.15+) - starlette 1.0.0 PYSEC-2026-161 (fix in 1.0.1+) Both are surfaced via fastapi/httpx. Bumps via: uv lock --upgrade-package idna --upgrade-package starlette Resolves to idna 3.16 (3.15 was the listed fix; 3.16 is a further patch with the same fix) and starlette 1.1.0 (minor bump; FastAPI is compatible with it). All 192 unit tests pass on the upgraded lock. Bumps the project self-version 0.2.10 -> 0.2.11 per docs/DEVELOPMENT.md. Unblocks the pip-audit CI gate on #99, #100, #101, #102 (and any other PRs currently sitting on develop), all of which inherit the flagged transitive CVEs from develop and cannot pass that gate until this lands.
* feat: eval pattern examples calling Azure OpenAI (#94) The eval slice previously shipped one toy case (echo-hello) and a disabled-by-default nightly. A reader expecting an LLM-eval story found the infrastructure without conviction. Adds four worked-pattern cases that exercise the existing three tolerance modes against a real Azure OpenAI deployment. These are not benchmarks — they demonstrate what an eval case *looks like* for the four LLM-eval patterns you most often need to write: - factual-http-200 exact_match format-constrained recall - numeric-seconds-per-day numeric_close numeric reasoning + tolerance - definitional-fastapi-depends semantic_similar free-form judge-scored prose - structured-json-status exact_match structured-output adherence When the template is forked for a real project, replace these four with cases that exercise the project's own prompts; the patterns transfer regardless of what product is bolted on. Provider choice — Azure OpenAI via the openai SDK with AzureOpenAI client — is intentionally distinct from the rest of the harness (which uses Claude via Claude Code). Demonstrates that the LLMClient Protocol in src/eval/judge.py does its job: the eval core never imports openai, vendor lock-in lives only in the adapter. Changes: - src/eval/adapters/azure_openai.py — implements LLMClient via the openai.AzureOpenAI SDK. Reads endpoint/key/deployment/api-version from env. Lazy-imports the SDK so the module is importable without the optional extra installed; the adapter raises a clear AzureOpenAIConfigError if the env or SDK is missing. - eval/golden_patterns.json — the four cases with notes explaining which pattern each demonstrates. - eval/test_golden_patterns.py — separate test file gated on the Azure env vars via pytestmark. Skipped on a stock checkout, so `uv run pytest eval/` always exits 0. The toy test_golden_qa.py keeps running as before. - pyproject.toml — new optional [project.optional-dependencies] eval extra (just `openai>=1.40.0`), mypy override for openai.* matching the existing opentelemetry.* pattern, and a 0.2.10 -> 0.2.11 self-version bump. - .github/workflows/eval-nightly.yml — env vars renamed from the placeholder LLM_* set to AZURE_OPENAI_*. Header comment updated with the Azure setup recipe. uv sync now passes --extra eval. - docs/EVAL_HARNESS.md — new "Worked patterns" section with the table mapping case -> tolerance -> pattern, the local setup recipe, and a "Swapping providers" note documenting the Protocol-based extension path. Local gates: mypy --strict clean on 42 source files (was 31), ruff clean, ruff format clean, import-linter both contracts kept, 192 unit tests pass, eval/ runs 1 passed + 4 skipped without LLM env. Closes #94 * test: add adapter unit tests + adapters README (#94 review fixes) Addresses two gate failures on #104 surfaced by code review: 1. "Tests required" gate — feat: prefix declared a behaviour change but tests/ had no test for the new adapter (the eval/-side test only runs with live Azure credentials). Adds tests/test_eval_azure_openai_adapter.py: 13 fully-offline cases covering _resolve_config (defaults, override, empty-string fallback, missing-env error listing), the constructor (env wiring, explicit API version, missing-env, missing-SDK), and the two SDK call paths (complete_json structured-output mode, complete user-message dispatch, null-content returns "" / "{}"). The SDK is mocked at sys.modules level so the test never hits the network and never requires the openai extra to be installed. 2. "src/ README audit" gate — every src/ package needs a README.md per CLAUDE.md. Adds src/eval/adapters/README.md documenting the layer's purpose, the current adapter, a 7-step "adding a new adapter" recipe, and why the layer lives at the top of the import order. Also applies the reviewer's non-blocking sentinel-string suggestion: the magic "azure-deployment" string passed as judge_model in eval/test_golden_patterns.py is now the named constant _AZURE_DEPLOYMENT_SENTINEL with a comment explaining why the runner threads it through but the Azure adapter discards it. Local gates: 205 unit tests pass (was 192, +13 new), mypy clean on 43 source files, ruff/format/import-linter all green. Refs #94 * docs: add Key interfaces section to adapters README (#94 review) src/ README audit gate looks for a `## Key interfaces` (or `## Public surface`) anchor — the existing README had purpose / table / extension recipe / layering rationale, but no exported-names section. Adds a `## Key interfaces` section listing the two exported names: - AzureOpenAIClient — the LLMClient implementation with notes on complete() vs complete_json() and the discarded `model` arg (Azure dispatches by deployment, not model). - AzureOpenAIConfigError — the construction-time error type, noting that it batches every missing env var into a single message instead of failing-and-retrying. Both already documented in the adapter docstrings; this section hoists them to the README anchor the audit gate enforces. Refs #94 * chore: bump version to 0.2.12 (rebase onto develop after #103)
* chore: add optional Beads issue queue guidance * chore: address PR-86 review feedback (BEADS doc + template + CI-script compile gate) Applies the actionable items from the PR-86 review: - docs/BEADS.md: lead with a one-sentence "what Beads is" + upstream link; state the stance explicitly (optional/additive, recommended for agent-driven flows, GitHub remains authoritative); add a YAML example block under Recommended Bead fields; replace the duplicated Closure checklist with a Bead-specific narrowing that cites the PR template + CONTRIBUTING; call out that .beads/ is wiped by git clean -fdx. - .github/pull_request_template.md: collapse the "Local Beads" section into an HTML-commented opt-in block so it is invisible in the rendered preview until a Beads-using team uncomments it. - CONTRIBUTING.md: document the one-shot git renormalisation step for Windows clones after the .gitattributes change lands. - tests/test_scripts_compile.py: regression gate that py_compiles every .github/scripts/*.py. The "scripts unparseable" review finding was based on an older local Python — PEP 758 (3.14) makes the unparenthesised except clauses valid, so the scripts ARE fine on the project pin. The test guards against an actual syntax error landing in future. * chore: bump version to 0.2.11 --------- Co-authored-by: jakelindsay87 <jacob.b.lindsay@gmail.com>
* docs: mark admin-merge policy as transitional solo-owner state (#93) The existing "Solo-owner merge policy" section accurately documented how merges work today, but read as standing policy. From an external contributor's perspective it could look like the maintainer routinely bypasses their own gates. Adds a leading "Transitional" blockquote framing this as a single-owner workaround, not standing policy, and replaces the closing sentence with a numbered exit checklist (drop --admin, remove the subsection, update CODEOWNERS, optionally flip enforce_admins to true). All four changes land together when a second collaborator is onboarded. Mechanics of the merge command itself are unchanged. Closes #93 * chore: bump version to 0.2.11 * docs: make enforce_admins flip required in exit checklist (#93 review) Code review on #101 pushed back on step 4 of the "When the exemption ends" checklist: "Optionally flip enforce_admins to true". Leaving it false in a 2-person setup keeps the admin-bypass door open even after the single-owner workaround is no longer needed — which defeats the point of having an exit checklist. Drops "Optionally" and adds a one-line rationale so a future reader understands why the flip is non-optional. Refs #93
* docs: reframe README opener around the human+agent audience (#90) The previous opener led with what the harness is (a coding harness for Python+React) and folded the audience into a trailing clause. The new opener leads with who it's for — teams pairing AI agents with human engineers — and keeps the mechanism punchline ("every gate enforced mechanically in CI, not by discipline") that makes the harness story distinctive. Wording matches the repo's GitHub description for consistency between the two surfaces. Closes #90 * docs: tighten README opener — harness vocab + 0.2.11 bump (#90) Review feedback on #99: - "Production-grade SDLC scaffold" -> "Production-grade SDLC harness". Everywhere else (package name, docs/HARNESS.md, CLAUDE.md) calls it a harness; "scaffold" was an unintentional vocabulary drift. - "regardless of who's at the keyboard" -> "regardless of who shipped the code". Agents don't have keyboards; the original metaphor leaked. The new phrasing covers humans and agents without forcing the human-only mental model. - README opener now also mirrors the GitHub repo description verbatim ("human-LLM coding collaborations"), so the two surfaces stay aligned. Also bumps the project version 0.2.10 -> 0.2.11 (docs change -> PATCH per docs/DEVELOPMENT.md) in pyproject.toml and the self-version line in uv.lock, unblocking the "Version bump check" CI gate that flagged the original commit. The "enforced mechanically in CI, not by discipline" punchline is preserved verbatim. Refs #90
* docs: add concrete agent-failure example to "Why a harness" (#91) The "harness IS the product" claim reads abstract without a worked example. Adds a blockquoted, 3-line sidebar inside the "Why a harness" section showing one realistic failure mode: an agent reaches for a reverse import (src.models → src.tools), import-linter blocks it in CI against the "src.models depends on nothing in src/" contract, the agent's next iteration routes around it via docs/BOUNDARIES.md. Names a real gate, cites the real contract, links the real doc — so the example is verifiable, not theatre. Closes #91 * chore: bump version to 0.2.11
* docs: replace Jaeger screenshot TODO with section scaffold (#92) The observability story in README has one visible loose end: a TODO block where the Jaeger trace screenshot should go. The rest of the section reads cleanly, so the TODO sticks out. Promotes the placeholder to a real subsection ("Jaeger trace") with the explanatory caption already written: what boots the stack, what endpoint produces the trace, where to view it, and that span attributes use only the constant-defined semconv keys from src/observability/spans.py. The image itself still needs to be captured. The original capture recipe is preserved as an HTML comment so it remains discoverable, and the comment includes the exact one-line markdown to paste in once docs/images/jaeger-trace.png lands. Hiding the placeholder inside an HTML comment (rather than a broken-image ref) keeps the rendered README clean while the PNG is outstanding. The image-capture step itself is a follow-up — needs the maintainer to run docker compose locally and take the screenshot. Closes #92 (capture step tracked separately as a single-line README edit when the PNG is committed). * chore: bump version to 0.2.11
Owner
Author
|
Closing — direct develop → main path conflicts on pyproject.toml/uv.lock because #86 went to main directly and gave the version line two divergent histories. Reopening from |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What ships in this release
ea6b8b1d256e32idna 3.13 → 3.16(CVE-2026-45409),starlette 1.0.0 → 1.1.0(PYSEC-2026-161)18b4d30src/eval/adapters/azure_openai.pyadapter, optional[eval]extraeb0136e722293d59ad7f07c84f188938eb7Version
0.2.11 → 0.2.17. Six PATCH bumps cascaded as each in-flight PR rebased over the previous one — one bump per merge, as required by the version-bump gate.Highlights
echo-hellocase to four worked-pattern cases that exercise factual recall, numeric reasoning, definitional prose, and structured-output adherence against a real Azure OpenAI deployment. Live cases are gated onAZURE_OPENAI_*env vars;uv run pytest eval/on a stock checkout still exits 0.--adminworkflow as transitional, with a numbered exit checklist (theenforce_admins: trueflip is now required, not optional).Test plan
8938eb7; full unit suite + mypy --strict + ruff + import-linter all passed on the final tip during the chore: align develop with main — backport #86 content + version #106 syncpip-auditclean on developchangelog-prestage.yml) — verify after creationInvariants affected
None new. #101 strengthened the wording around the admin-merge exemption (transitional framing) — same invariant content, sharper documentation.
New deps / actions / external surface
[eval]withopenai>=1.40.0(pulled in only onuv sync --extra eval).eval/test_golden_patterns.py, only whenAZURE_OPENAI_*env vars are set.Tagging note
Per
.github/workflows/release.yml, the public release (GHCR image push, CycloneDX SBOM, GitHub Release page) is tag-triggered. Tagv0.2.17against the merge commit when this PR lands to publish.Linked issues
Closes none directly (all linked issues already closed on develop). This PR fans the closures out to main.