Add reference architecture, benchmarks, and discoverability polish#203
Conversation
…#199, #200) Lands the five-issue "external-facing surface polish" group selected by the triage pass. No changes to src/contextweaver/ pipeline code. #200 — Discoverability polish - README badges (CI, PyPI, Python versions, license, docs, Discussions) - "context engineering" as a secondary phrase in intro + Problem - CITATION.cff at repo root (CFF v1.2) - pyproject Documentation/Issues/Discussions URLs, Framework :: AsyncIO and Topic :: Scientific/Engineering :: AI classifiers, extended keywords - Social-preview PNG deferred (needs GitHub UI step) #199 — Decision-tree landing page - docs/which_pattern.md branches on user symptom -> one concrete next step - Linked from README Quickstart and top of architecture.md / interop.md - mkdocs nav entry added #198 — Production reference architectures cookbook (Slack ops bot) - examples/architectures/slack_ops_bot/ with 48-tool catalog.yaml, 6-turn scripted main.py, README, OUTPUT.md - Demonstrates bounded-choice pattern (Router narrows 48 -> 3; bot picks from shortlist), firewall on 34 KB log dump, persistent facts across turns - docs/architectures/index.md + slack_ops_bot.md pages - make architectures umbrella target wired into make example - Code-review bot + voice agent deferred to follow-up issues #197 — Benchmark scorecard - scripts/render_scorecard.py: stdlib-only deterministic renderer - make scorecard / make scorecard-check targets - benchmarks/scorecard.md committed, regeneratable in one command - docs/benchmarks.md methodology page - README "Why Trust" gains scorecard link - CI: new gating scorecard-check step (runs before the non-gating benchmark step so it validates the committed pair, not the runner's fresh numbers) - Per-backend matrix and weekly CI regen tracked as follow-ups #181 — Stress-test benchmark scenario - benchmarks/scenarios/stress_conversation.jsonl: SEV2 incident-response transcript, 30 user turns, 3 large tool results (firewall fires), 4 near-duplicate pairs (dedup fires), total tokens push past the 6000-token answer budget so items_dropped > 0 - benchmarks/results/latest.json regenerated; benchmark README baseline metrics table refreshed and corrected the stale "git-ignored" note Tests - tests/test_render_scorecard.py: 12 tests covering renderer determinism, routing/context table formatting, missing-field validation, --check mode - tests/test_architectures_slack.py: 8 tests pinning catalog size, intent match count, firewall trigger, persisted facts, final prompt rendering Verification - ruff format --check src/ tests/ examples/ scripts/ - clean - ruff check src/ tests/ examples/ scripts/ - all checks passed - mypy src/ - 54 files, no issues - pytest - 768 passed, 8 skipped (was 748 baseline; +20 new tests) - make example - all scripts including Slack ops bot run cleanly - make demo - clean - make scorecard-check - deterministic (no drift) - make llms-check - up to date https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
There was a problem hiding this comment.
Pull request overview
This PR bundles three discoverability/reference-material initiatives: a Slack ops bot reference architecture (#198), a deterministic public benchmark scorecard with make scorecard/scorecard-check targets (#197), and README/PyPI metadata polish including CITATION.cff, badges, extended keywords/classifiers, and a symptom-based decision-tree landing page (#200). It also adds a stress-test benchmark scenario, expands make example with an architectures umbrella, and refreshes agent docs to reflect the new commands.
Changes:
- Add Slack ops bot reference architecture (catalog.yaml, runnable
main.py, README/OUTPUT, smoke tests) and link it from cookbook/docs/README. - Add deterministic stdlib-only
scripts/render_scorecard.pyplusbenchmarks/scorecard.md, newstress_conversation.jsonlscenario,docs/benchmarks.md, and a gatingscorecard-checkCI step. - Discoverability polish: README badges, "context engineering" phrasing,
CITATION.cff, extendedpyproject.tomlkeywords/classifiers/URLs,docs/which_pattern.mddecision tree, mkdocs nav additions, andllms.txt/llms-full.txtregeneration.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
examples/architectures/slack_ops_bot/{main.py,catalog.yaml,README.md,OUTPUT.md,__init__.py} |
New Slack ops bot reference architecture wiring router, firewall, and FactStore |
examples/architectures/__init__.py |
Architectures package marker |
tests/test_architectures_slack.py |
Smoke tests pinning catalog size, intent match rate, firewall + fact invariants |
scripts/render_scorecard.py |
Deterministic stdlib renderer with --check drift mode |
tests/test_render_scorecard.py |
Determinism, validation, and CLI behavior tests |
benchmarks/scorecard.md, benchmarks/results/latest.json, benchmarks/scenarios/stress_conversation.jsonl, benchmarks/README.md |
Scorecard output, refreshed baseline, new stress scenario, doc refresh |
docs/{which_pattern.md,benchmarks.md,architectures/index.md,architectures/slack_ops_bot.md,architecture.md,interop.md,cookbook.md,agent-context/workflows.md} |
New decision tree, benchmarks doc, architecture pages, cross-links |
README.md, llms.txt, llms-full.txt, mkdocs.yml, scripts/gen_llms.py |
Badges, phrasing, nav additions, regenerated LLM context files |
pyproject.toml, CITATION.cff |
Extended metadata, classifiers, URLs, citation file |
Makefile, .github/workflows/ci.yml, AGENTS.md, CHANGELOG.md |
New scorecard/scorecard-check/architectures targets, gating CI step, docs/changelog updates |
Review — REQUEST_CHANGES (Phase 2/3 of review cycle)Large PR — 33 files, 2,471 additions across 5 functional areas. All 6 Must-fix ✅ (resolved in 97c3263 — already on branch)
- curl -fsSL "https://raw.githubusercontent.com/dgenio/weaver-spec/main/contracts/json/${s}.schema.json" \
+ curl -fsSL --retry 3 --retry-delay 2 "https://raw.githubusercontent.com/dgenio/weaver-spec/main/contracts/json/${s}.schema.json" \The original Strengths
Informational (non-blocking)Follow-up issues created for deferred scope:
Suggest updating Nit: |
…nit) Address the CFF 1.2 best-practice nit from the PR #203 review. Adds `date-released: 2026-05-11` (matches the 0.3.0 release date in CHANGELOG) and an `identifiers:` entry with a Zenodo DOI placeholder. Both are optional for GitHub's "Cite this repository" button to render but recommended by the spec — and trivial to add now so they don't get forgotten when the project does get a real DOI. The DOI value is intentionally a placeholder (`10.5281/zenodo.XXXXXXX`) and is marked as such in the inline description; replace once a real Zenodo archive of a tagged release exists. https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
- CITATION.cff: remove DOI placeholder identifier and document that version/date-released are owned by the release process. - tests/test_architectures_slack.py: loosen brittle exact-char-count firewall assertion to substring checks for stability. - examples/architectures/slack_ops_bot/main.py: lazy-init the large log dump and tool-response map via @functools.cache so import is cheap. - scripts/render_scorecard.py: derive answer-phase budget from the benchmark payload instead of hard-coding 6000. - benchmarks/README.md: note that stress_conversation.jsonl is hand-authored and point regeneration at the #181 acceptance criteria.
Summary
Lands the 5-issue "external-facing surface polish" bundle selected by the triage pass. Relates to #181, #197, #198, #199, #200 — none of these are auto-closed by the PR keyword grammar; some are fully addressed by this PR and some are partial with explicit follow-ups (see below).
Three major components:
benchmarks/scorecard.md) regenerated frombenchmarks/results/latest.json, so users can evaluate routing recall, token savings, and latency without running the harness. Per-backend matrix (tfidf/bm25/fuzzy) and weekly scheduled CI regeneration are deferred to follow-up issues [eval] Benchmark scorecard per-backend matrices (tfidf / bm25 / fuzzy) at 100/500/1000 catalog sizes #206 and [eval] Weekly scheduled scorecard regeneration CI job #207 so [eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197 stays open.CITATION.cff, updatedpyproject.tomlmetadata, and a new decision-tree landing page (docs/which_pattern.md) that routes users to the smallest piece that fixes their symptom. The social-preview PNG remains deferred per the issue's own implementation notes (needs a GitHub UI step beyond a code PR).Also fully landed in this PR: #181 (stress-test benchmark scenario forcing
items_dropped > 0+dedup_removed > 0) and #199 (decision-tree landing page).Issue status after this PR
stress_conversation.jsonltriggers drops=7, dedup=4, compaction=3.29×docs/which_pattern.md+ cross-links + mkdocs navChanges
Reference Architecture (Slack ops bot)
examples/architectures/slack_ops_bot/main.py— runnable example with a 6-turn on-call investigation transcript, demonstrating routing shortlist selection, firewall artifact interception, and persistent fact tracking across turns. The 240-line synthetic log dump and tool-response map are built lazily viafunctools.cacheso import is cheap.examples/architectures/slack_ops_bot/catalog.yaml— 48-tool catalog (logs, deploy, oncall, alerts, tickets, metrics, identity, infra, feature) with namespaces, side-effect flags, and cost hints.examples/architectures/slack_ops_bot/README.md— local documentation explaining the architecture, setup, and key patterns.examples/architectures/slack_ops_bot/OUTPUT.md— captured deterministic run output showing routing decisions, firewall behavior, and fact persistence.tests/test_architectures_slack.py— smoke tests pinning deterministic invariants (catalog size, intent match rate, firewall artifact interception). The firewall assertion checks substring shape rather than the exact character count so cosmetic transcript edits don't churn the test.docs/architectures/slack_ops_bot.md— public documentation linking to the architecture and explaining its shape.docs/architectures/index.md— landing page for reference architectures.Benchmark Scorecard
scripts/render_scorecard.py— deterministic stdlib-only renderer that convertsbenchmarks/results/latest.jsontobenchmarks/scorecard.md. No contextweaver imports, validates required fields, derives the answer-phase budget from the payload's context rows, and guarantees byte-identical output for identical input.benchmarks/scorecard.md— committed, auto-generated markdown with routing accuracy (precision@k, recall@k, MRR) and context-pipeline metrics (token budgets, dedup, compaction) at three catalog sizes.benchmarks/results/latest.json— updated with fresh latency measurements (p50/p95/p99 in ms).benchmarks/scenarios/stress_conversation.jsonl— new 30+ turn SEV2 incident-response scenario with 3 large tool results to exercise the firewall, near-duplicate pairs to exercise dedup, and total token volume that pushes past the 6000-token answer budget soitems_dropped > 0. The fixture is hand-authored to hit specific firewall/dedup/drop targets; regeneration must satisfy the invariants pinned in [eval] Expand long_conversation scenario to stress-test budget-driven selection #181's acceptance criteria.tests/test_render_scorecard.py— unit tests for determinism, field validation, and table formatting.docs/benchmarks.md— public documentation explaining what is measured and how to reproduce.benchmarks/README.md— harness reference with the per-scenario invariants table.Makefile— newscorecard,scorecard-check, andarchitecturestargets..github/workflows/ci.yml— new gatingscorecard-checkstep.Discoverability & Documentation
README.md— added badge row (CI, PyPI version, Python versions, license, docs, Discussions); clarified "context engineering" as a secondary phrase; added decision-tree link and benchmark scorecard link.llms-full.txtandllms.txt— regenerated to pick up the README and pyproject changes.docs/which_pattern.md— new decision tree routing users to the smallest piece that fixes their symptom (routing-only, firewall-only, or full pipeline).docs/architecture.md— added pointer towhich_pattern.mdfor new users.docs/interop.md— added "In a hurry?" link.docs/cookbook.md— added link to reference architectures.CITATION.cff— minimal CFF v1.2 citation metadata for GitHub's "Cite this repository" button.versionanddate-releasedare documented as release-process owned (updated at tag-cut time). Theidentifiers:block is intentionally omitted so the file passes CFF validators without a placeholder DOI; it will be added once a real DOI is minted.pyproject.toml—Documentation/Issues/DiscussionsURLs,Framework :: AsyncIO+Topic :: Scientific/Engineering :: Artificial Intelligenceclassifiers, extended keywords.AGENTS.mdanddocs/agent-context/workflows.md— updated commands tables with the newmaketargets.Non-goals
CITATION.cff— added once the first tagged release is minted on Zenodo.Follow-up issues (deliberately out of scope)
#198 and #197 remain open after this PR so the follow-up work has a parent issue to track against.
Risks & rollback
Low risk — additive surface (new example, new benchmark fixture, new docs, new optional Makefile targets) plus small metadata edits. The only behavioral surface inside the library is unchanged.
scorecard-checkis the new gating CI step; if it ever drifts, rerunmake scorecardfrom a clean checkout to regenerate. Rollback isgit revertof this PR's merge commit.Verification
All 6
make citargets pass locally on Python 3.12 (ruff format --check,ruff check,mypy src/,pytest— 899 passed / 8 skipped,make example,make demo), plusmake scorecard-check(CI-gated, separate frommake ci). CI is green across Python 3.10 / 3.11 / 3.12 on the latest head.