Add reference architecture, benchmarks, and discoverability polish by dgenio · Pull Request #203 · dgenio/contextweaver

dgenio · 2026-05-15T12:55:37Z

Summary

Lands the 5-issue "external-facing surface polish" bundle selected by the triage pass. Relates to #181, #197, #198, #199, #200 — none of these are auto-closed by the PR keyword grammar; some are fully addressed by this PR and some are partial with explicit follow-ups (see below).

Three major components:

Slack ops bot reference architecture (relates to [examples] Production reference architectures cookbook — Slack ops bot, code-review bot, voice agent #198 — partial) — a production-shaped, runnable example of a multi-turn agent fronting ~48 ops tools, demonstrating how routing, the context firewall, and persistent facts compose end-to-end. Code-review bot and voice agent architectures are deferred to follow-up issues [examples] Code-review bot reference architecture #204 and [examples] Voice agent reference architecture (Pipecat) #205 so [examples] Production reference architectures cookbook — Slack ops bot, code-review bot, voice agent #198 stays open.
Public benchmark scorecard (relates to [eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197 — partial) — a deterministic, committed markdown document (benchmarks/scorecard.md) regenerated from benchmarks/results/latest.json, so users can evaluate routing recall, token savings, and latency without running the harness. Per-backend matrix (tfidf/bm25/fuzzy) and weekly scheduled CI regeneration are deferred to follow-up issues [eval] Benchmark scorecard per-backend matrices (tfidf / bm25 / fuzzy) at 100/500/1000 catalog sizes #206 and [eval] Weekly scheduled scorecard regeneration CI job #207 so [eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197 stays open.
Discoverability polish (relates to [docs] Discoverability polish — README terminology, social preview, citation file, repo metadata #200 — full) — README badges (CI, PyPI, Python versions, license, docs, Discussions), extended keywords, CITATION.cff, updated pyproject.toml metadata, and a new decision-tree landing page (docs/which_pattern.md) that routes users to the smallest piece that fixes their symptom. The social-preview PNG remains deferred per the issue's own implementation notes (needs a GitHub UI step beyond a code PR).

Also fully landed in this PR: #181 (stress-test benchmark scenario forcing items_dropped > 0 + dedup_removed > 0) and #199 (decision-tree landing page).

Issue status after this PR

Issue	Status	Notes
#181	✅ Fully addressed	`stress_conversation.jsonl` triggers drops=7, dedup=4, compaction=3.29×
#197	🟡 Partial — leave open	Scorecard committed; follow-ups #206, #207
#198	🟡 Partial — leave open	Slack ops bot landed; follow-ups #204, #205
#199	✅ Fully addressed	`docs/which_pattern.md` + cross-links + mkdocs nav
#200	✅ Fully addressed (minus PNG)	Badges, CITATION, pyproject metadata, classifiers

Changes

Reference Architecture (Slack ops bot)

examples/architectures/slack_ops_bot/main.py — runnable example with a 6-turn on-call investigation transcript, demonstrating routing shortlist selection, firewall artifact interception, and persistent fact tracking across turns. The 240-line synthetic log dump and tool-response map are built lazily via functools.cache so import is cheap.
examples/architectures/slack_ops_bot/catalog.yaml — 48-tool catalog (logs, deploy, oncall, alerts, tickets, metrics, identity, infra, feature) with namespaces, side-effect flags, and cost hints.
examples/architectures/slack_ops_bot/README.md — local documentation explaining the architecture, setup, and key patterns.
examples/architectures/slack_ops_bot/OUTPUT.md — captured deterministic run output showing routing decisions, firewall behavior, and fact persistence.
tests/test_architectures_slack.py — smoke tests pinning deterministic invariants (catalog size, intent match rate, firewall artifact interception). The firewall assertion checks substring shape rather than the exact character count so cosmetic transcript edits don't churn the test.
docs/architectures/slack_ops_bot.md — public documentation linking to the architecture and explaining its shape.
docs/architectures/index.md — landing page for reference architectures.

Benchmark Scorecard

scripts/render_scorecard.py — deterministic stdlib-only renderer that converts benchmarks/results/latest.json to benchmarks/scorecard.md. No contextweaver imports, validates required fields, derives the answer-phase budget from the payload's context rows, and guarantees byte-identical output for identical input.
benchmarks/scorecard.md — committed, auto-generated markdown with routing accuracy (precision@k, recall@k, MRR) and context-pipeline metrics (token budgets, dedup, compaction) at three catalog sizes.
benchmarks/results/latest.json — updated with fresh latency measurements (p50/p95/p99 in ms).
benchmarks/scenarios/stress_conversation.jsonl — new 30+ turn SEV2 incident-response scenario with 3 large tool results to exercise the firewall, near-duplicate pairs to exercise dedup, and total token volume that pushes past the 6000-token answer budget so items_dropped > 0. The fixture is hand-authored to hit specific firewall/dedup/drop targets; regeneration must satisfy the invariants pinned in [eval] Expand long_conversation scenario to stress-test budget-driven selection #181's acceptance criteria.
tests/test_render_scorecard.py — unit tests for determinism, field validation, and table formatting.
docs/benchmarks.md — public documentation explaining what is measured and how to reproduce.
benchmarks/README.md — harness reference with the per-scenario invariants table.
Makefile — new scorecard, scorecard-check, and architectures targets.
.github/workflows/ci.yml — new gating scorecard-check step.

Discoverability & Documentation

README.md — added badge row (CI, PyPI version, Python versions, license, docs, Discussions); clarified "context engineering" as a secondary phrase; added decision-tree link and benchmark scorecard link.
llms-full.txt and llms.txt — regenerated to pick up the README and pyproject changes.
docs/which_pattern.md — new decision tree routing users to the smallest piece that fixes their symptom (routing-only, firewall-only, or full pipeline).
docs/architecture.md — added pointer to which_pattern.md for new users.
docs/interop.md — added "In a hurry?" link.
docs/cookbook.md — added link to reference architectures.
CITATION.cff — minimal CFF v1.2 citation metadata for GitHub's "Cite this repository" button. version and date-released are documented as release-process owned (updated at tag-cut time). The identifiers: block is intentionally omitted so the file passes CFF validators without a placeholder DOI; it will be added once a real DOI is minted.
pyproject.toml — Documentation / Issues / Discussions URLs, Framework :: AsyncIO + Topic :: Scientific/Engineering :: Artificial Intelligence classifiers, extended keywords.
AGENTS.md and docs/agent-context/workflows.md — updated commands tables with the new make targets.

Non-goals

LLM inference, tool execution, or model wiring — out of scope per contextweaver's stated boundary.
Code-review bot and voice-agent reference architectures — deferred to [examples] Code-review bot reference architecture #204 / [examples] Voice agent reference architecture (Pipecat) #205 to keep this PR reviewable.
Per-backend scorecard matrix and scheduled regeneration — deferred to [eval] Benchmark scorecard per-backend matrices (tfidf / bm25 / fuzzy) at 100/500/1000 catalog sizes #206 / [eval] Weekly scheduled scorecard regeneration CI job #207.
Social-preview PNG — needs a GitHub UI step; tracked under [docs] Discoverability polish — README terminology, social preview, citation file, repo metadata #200.
Real DOI in CITATION.cff — added once the first tagged release is minted on Zenodo.

Follow-up issues (deliberately out of scope)

#	Title	Source issue
#204	code_review_bot reference architecture	#198
#205	voice_agent reference architecture (Pipecat)	#198
#206	per-backend scorecard matrix (tfidf/bm25/fuzzy at 100/500/1000)	#197
#207	weekly scheduled scorecard regeneration CI job	#197

#198 and #197 remain open after this PR so the follow-up work has a parent issue to track against.

Risks & rollback

Low risk — additive surface (new example, new benchmark fixture, new docs, new optional Makefile targets) plus small metadata edits. The only behavioral surface inside the library is unchanged. scorecard-check is the new gating CI step; if it ever drifts, rerun make scorecard from a clean checkout to regenerate. Rollback is git revert of this PR's merge commit.

Verification

All 6 make ci targets pass locally on Python 3.12 (ruff format --check, ruff check, mypy src/, pytest — 899 passed / 8 skipped, make example, make demo), plus make scorecard-check (CI-gated, separate from make ci). CI is green across Python 3.10 / 3.11 / 3.12 on the latest head.

…#199, #200) Lands the five-issue "external-facing surface polish" group selected by the triage pass. No changes to src/contextweaver/ pipeline code. #200 — Discoverability polish - README badges (CI, PyPI, Python versions, license, docs, Discussions) - "context engineering" as a secondary phrase in intro + Problem - CITATION.cff at repo root (CFF v1.2) - pyproject Documentation/Issues/Discussions URLs, Framework :: AsyncIO and Topic :: Scientific/Engineering :: AI classifiers, extended keywords - Social-preview PNG deferred (needs GitHub UI step) #199 — Decision-tree landing page - docs/which_pattern.md branches on user symptom -> one concrete next step - Linked from README Quickstart and top of architecture.md / interop.md - mkdocs nav entry added #198 — Production reference architectures cookbook (Slack ops bot) - examples/architectures/slack_ops_bot/ with 48-tool catalog.yaml, 6-turn scripted main.py, README, OUTPUT.md - Demonstrates bounded-choice pattern (Router narrows 48 -> 3; bot picks from shortlist), firewall on 34 KB log dump, persistent facts across turns - docs/architectures/index.md + slack_ops_bot.md pages - make architectures umbrella target wired into make example - Code-review bot + voice agent deferred to follow-up issues #197 — Benchmark scorecard - scripts/render_scorecard.py: stdlib-only deterministic renderer - make scorecard / make scorecard-check targets - benchmarks/scorecard.md committed, regeneratable in one command - docs/benchmarks.md methodology page - README "Why Trust" gains scorecard link - CI: new gating scorecard-check step (runs before the non-gating benchmark step so it validates the committed pair, not the runner's fresh numbers) - Per-backend matrix and weekly CI regen tracked as follow-ups #181 — Stress-test benchmark scenario - benchmarks/scenarios/stress_conversation.jsonl: SEV2 incident-response transcript, 30 user turns, 3 large tool results (firewall fires), 4 near-duplicate pairs (dedup fires), total tokens push past the 6000-token answer budget so items_dropped > 0 - benchmarks/results/latest.json regenerated; benchmark README baseline metrics table refreshed and corrected the stale "git-ignored" note Tests - tests/test_render_scorecard.py: 12 tests covering renderer determinism, routing/context table formatting, missing-field validation, --check mode - tests/test_architectures_slack.py: 8 tests pinning catalog size, intent match count, firewall trigger, persisted facts, final prompt rendering Verification - ruff format --check src/ tests/ examples/ scripts/ - clean - ruff check src/ tests/ examples/ scripts/ - all checks passed - mypy src/ - 54 files, no issues - pytest - 768 passed, 8 skipped (was 748 baseline; +20 new tests) - make example - all scripts including Slack ops bot run cleanly - make demo - clean - make scorecard-check - deterministic (no drift) - make llms-check - up to date https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj

Copilot

Pull request overview

This PR bundles three discoverability/reference-material initiatives: a Slack ops bot reference architecture (#198), a deterministic public benchmark scorecard with make scorecard/scorecard-check targets (#197), and README/PyPI metadata polish including CITATION.cff, badges, extended keywords/classifiers, and a symptom-based decision-tree landing page (#200). It also adds a stress-test benchmark scenario, expands make example with an architectures umbrella, and refreshes agent docs to reflect the new commands.

Changes:

Add Slack ops bot reference architecture (catalog.yaml, runnable main.py, README/OUTPUT, smoke tests) and link it from cookbook/docs/README.
Add deterministic stdlib-only scripts/render_scorecard.py plus benchmarks/scorecard.md, new stress_conversation.jsonl scenario, docs/benchmarks.md, and a gating scorecard-check CI step.
Discoverability polish: README badges, "context engineering" phrasing, CITATION.cff, extended pyproject.toml keywords/classifiers/URLs, docs/which_pattern.md decision tree, mkdocs nav additions, and llms.txt/llms-full.txt regeneration.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`examples/architectures/slack_ops_bot/{main.py,catalog.yaml,README.md,OUTPUT.md,__init__.py}`	New Slack ops bot reference architecture wiring router, firewall, and FactStore
`examples/architectures/__init__.py`	Architectures package marker
`tests/test_architectures_slack.py`	Smoke tests pinning catalog size, intent match rate, firewall + fact invariants
`scripts/render_scorecard.py`	Deterministic stdlib renderer with `--check` drift mode
`tests/test_render_scorecard.py`	Determinism, validation, and CLI behavior tests
`benchmarks/scorecard.md`, `benchmarks/results/latest.json`, `benchmarks/scenarios/stress_conversation.jsonl`, `benchmarks/README.md`	Scorecard output, refreshed baseline, new stress scenario, doc refresh
`docs/{which_pattern.md,benchmarks.md,architectures/index.md,architectures/slack_ops_bot.md,architecture.md,interop.md,cookbook.md,agent-context/workflows.md}`	New decision tree, benchmarks doc, architecture pages, cross-links
`README.md`, `llms.txt`, `llms-full.txt`, `mkdocs.yml`, `scripts/gen_llms.py`	Badges, phrasing, nav additions, regenerated LLM context files
`pyproject.toml`, `CITATION.cff`	Extended metadata, classifiers, URLs, citation file
`Makefile`, `.github/workflows/ci.yml`, `AGENTS.md`, `CHANGELOG.md`	New `scorecard`/`scorecard-check`/`architectures` targets, gating CI step, docs/changelog updates

…HONY targets, CHANGELOG entries)

…ance step

dgenio · 2026-05-15T16:51:37Z

Review — REQUEST_CHANGES (Phase 2/3 of review cycle)

Large PR — 33 files, 2,471 additions across 5 functional areas. All 6 make ci targets pass locally (fmt ✓ lint ✓ type ✓ test 899/899 ✓ example ✓ demo ✓).

Must-fix ✅ (resolved in `97c3263` — already on branch)

.github/workflows/ci.yml — missing --retry in the gating Weaver-spec conformance step.

- curl -fsSL "https://raw.githubusercontent.com/dgenio/weaver-spec/main/contracts/json/${s}.schema.json" \
+ curl -fsSL --retry 3 --retry-delay 2 "https://raw.githubusercontent.com/dgenio/weaver-spec/main/contracts/json/${s}.schema.json" \

The original curl -fsSL had no --retry flag. A transient hiccup on raw.githubusercontent.com (CDN, DNS, GitHub incident) would fail all four schema fetches, fail the gating step, and block every open PR on this repo — including unrelated hotfixes. Fixed in 97c3263.

Strengths

Slack ops bot is a production-quality reference architecture: bounded-choice routing pattern correctly demonstrated (Router narrows → bot selects), firewall artifact interception works, persistent facts carry across all 6 turns, 100% intent match on scripted transcript.
Scorecard renderer (scripts/render_scorecard.py) is stdlib-only, deterministic, byte-stable, 13 unit tests covering determinism, field validation, --check drift mode, and missing file handling. Correctly follows the gen_llms.py convention.
stress_conversation.jsonl fully satisfies [eval] Expand long_conversation scenario to stress-test budget-driven selection #181: 7 dropped, 4 dedup removed, 3.29× avg compaction — all four previously-untested pipeline stages now appear in benchmark output.
Discoverability changes complete: Documentation/Issues URLs, Framework :: AsyncIO + AI classifier, extended keywords, CITATION.cff, docs/which_pattern.md (162 lines), linked from README/architecture.md/interop.md/mkdocs.yml.
All architectural boundaries maintained. No print() in library code. No business logic in __init__.py. AGENTS.md, CHANGELOG.md, docs/agent-context/workflows.md all updated.

Informational (non-blocking)

Follow-up issues created for deferred scope:

[examples] Code-review bot reference architecture #204 — code_review_bot reference architecture ([examples] Production reference architectures cookbook — Slack ops bot, code-review bot, voice agent #198 remainder)
[examples] Voice agent reference architecture (Pipecat) #205 — voice_agent reference architecture (Pipecat) ([examples] Production reference architectures cookbook — Slack ops bot, code-review bot, voice agent #198 remainder)
[eval] Benchmark scorecard per-backend matrices (tfidf / bm25 / fuzzy) at 100/500/1000 catalog sizes #206 — per-backend scoring matrices tfidf/bm25/fuzzy at 100/500/1000 sizes ([eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197 remainder)
[eval] Weekly scheduled scorecard regeneration CI job #207 — weekly scheduled scorecard regeneration CI job ([eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197 remainder)

Suggest updating Closes #198 → Relates to #198 in the PR description so the issue stays open for #204/#205.

Nit: CITATION.cff is missing date-released and a DOI placeholder (doi: "10.5281/zenodo.XXXXXXX"). CFF 1.2 best practice — neither is required for GitHub's "Cite this repository" button today.

…nit) Address the CFF 1.2 best-practice nit from the PR #203 review. Adds `date-released: 2026-05-11` (matches the 0.3.0 release date in CHANGELOG) and an `identifiers:` entry with a Zenodo DOI placeholder. Both are optional for GitHub's "Cite this repository" button to render but recommended by the spec — and trivial to add now so they don't get forgotten when the project does get a real DOI. The DOI value is intentionally a placeholder (`10.5281/zenodo.XXXXXXX`) and is marked as such in the inline description; replace once a real Zenodo archive of a tagged release exists. https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj

- CITATION.cff: remove DOI placeholder identifier and document that version/date-released are owned by the release process. - tests/test_architectures_slack.py: loosen brittle exact-char-count firewall assertion to substring checks for stability. - examples/architectures/slack_ops_bot/main.py: lazy-init the large log dump and tool-response map via @functools.cache so import is cheap. - scripts/render_scorecard.py: derive answer-phase budget from the benchmark payload instead of hard-coding 6000. - benchmarks/README.md: note that stress_conversation.jsonl is hand-authored and point regeneration at the #181 acceptance criteria.

Copilot AI review requested due to automatic review settings May 15, 2026 12:55

Copilot started reviewing on behalf of dgenio May 15, 2026 12:56 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

dgenio added 3 commits May 15, 2026 16:22

merge: resolve conflicts with main (scorecard+weaver-spec ci steps, P…

2c0b364

…HONY targets, CHANGELOG entries)

fix(lint): replace Any with object in _spec_to_jsonable (ANN401)

b720fde

fix(ci): add --retry 3 --retry-delay 2 to curl in Weaver-spec conform…

97c3263

…ance step

dgenio mentioned this pull request May 15, 2026

[eval] Expand long_conversation scenario to stress-test budget-driven selection #181

Closed

5 tasks

dgenio merged commit c2f49ee into main May 16, 2026
3 checks passed

dgenio deleted the claude/triage-issues-o7Ofs branch May 16, 2026 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reference architecture, benchmarks, and discoverability polish#203

Add reference architecture, benchmarks, and discoverability polish#203
dgenio merged 6 commits into
mainfrom
claude/triage-issues-o7Ofs

dgenio commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

dgenio commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dgenio commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue status after this PR

Changes

Reference Architecture (Slack ops bot)

Benchmark Scorecard

Discoverability & Documentation

Non-goals

Follow-up issues (deliberately out of scope)

Risks & rollback

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

dgenio commented May 15, 2026

Review — REQUEST_CHANGES (Phase 2/3 of review cycle)

Must-fix ✅ (resolved in 97c3263 — already on branch)

Strengths

Informational (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dgenio commented May 15, 2026 •

edited

Loading

Must-fix ✅ (resolved in `97c3263` — already on branch)