docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200)#228
Conversation
…#199, #200) Lands the five-issue "external-facing surface polish" group selected by the triage pass. No changes to src/contextweaver/ pipeline code. #200 — Discoverability polish - README badges (CI, PyPI, Python versions, license, docs, Discussions) - "context engineering" as a secondary phrase in intro + Problem - CITATION.cff at repo root (CFF v1.2) - pyproject Documentation/Issues/Discussions URLs, Framework :: AsyncIO and Topic :: Scientific/Engineering :: AI classifiers, extended keywords - Social-preview PNG deferred (needs GitHub UI step) #199 — Decision-tree landing page - docs/which_pattern.md branches on user symptom -> one concrete next step - Linked from README Quickstart and top of architecture.md / interop.md - mkdocs nav entry added #198 — Production reference architectures cookbook (Slack ops bot) - examples/architectures/slack_ops_bot/ with 48-tool catalog.yaml, 6-turn scripted main.py, README, OUTPUT.md - Demonstrates bounded-choice pattern (Router narrows 48 -> 3; bot picks from shortlist), firewall on 34 KB log dump, persistent facts across turns - docs/architectures/index.md + slack_ops_bot.md pages - make architectures umbrella target wired into make example - Code-review bot + voice agent deferred to follow-up issues #197 — Benchmark scorecard - scripts/render_scorecard.py: stdlib-only deterministic renderer - make scorecard / make scorecard-check targets - benchmarks/scorecard.md committed, regeneratable in one command - docs/benchmarks.md methodology page - README "Why Trust" gains scorecard link - CI: new gating scorecard-check step (runs before the non-gating benchmark step so it validates the committed pair, not the runner's fresh numbers) - Per-backend matrix and weekly CI regen tracked as follow-ups #181 — Stress-test benchmark scenario - benchmarks/scenarios/stress_conversation.jsonl: SEV2 incident-response transcript, 30 user turns, 3 large tool results (firewall fires), 4 near-duplicate pairs (dedup fires), total tokens push past the 6000-token answer budget so items_dropped > 0 - benchmarks/results/latest.json regenerated; benchmark README baseline metrics table refreshed and corrected the stale "git-ignored" note Tests - tests/test_render_scorecard.py: 12 tests covering renderer determinism, routing/context table formatting, missing-field validation, --check mode - tests/test_architectures_slack.py: 8 tests pinning catalog size, intent match count, firewall trigger, persisted facts, final prompt rendering Verification - ruff format --check src/ tests/ examples/ scripts/ - clean - ruff check src/ tests/ examples/ scripts/ - all checks passed - mypy src/ - 54 files, no issues - pytest - 768 passed, 8 skipped (was 748 baseline; +20 new tests) - make example - all scripts including Slack ops bot run cleanly - make demo - clean - make scorecard-check - deterministic (no drift) - make llms-check - up to date https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
…HONY targets, CHANGELOG entries)
…nit) Address the CFF 1.2 best-practice nit from the PR #203 review. Adds `date-released: 2026-05-11` (matches the 0.3.0 release date in CHANGELOG) and an `identifiers:` entry with a Zenodo DOI placeholder. Both are optional for GitHub's "Cite this repository" button to render but recommended by the spec — and trivial to add now so they don't get forgotten when the project does get a real DOI. The DOI value is intentionally a placeholder (`10.5281/zenodo.XXXXXXX`) and is marked as such in the inline description; replace once a real Zenodo archive of a tagged release exists. https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
…208, #209, #213, #215) - _utils.tokenize: split on `.`, `-`, `/`, `:` (single source of truth); emit dotted/hyphenated/slashed compound + per-segment sub-tokens; underscore intentionally kept as non-separator (synthetic catalog crosstalk noise — see _OUTER_SPLIT_RE docstring rationale captured during #213 measurement). Closes the dead-code replace(":"/" "/"_"/"/") workaround in routing/router.py. - benchmarks/benchmark.py: gain `--matrix`, `--backends`, `--sizes`, and `--no-naive-delta` flags. Emit additive `routing_matrix` (per-backend × per-size, #208), `routing_per_namespace` (#209), and per-row `naive_delta` (#215) keys. Legacy `routing` and `context` keys preserved so PR #203's scorecard renderer keeps working. - scripts/baseline_naive.py: new stdlib-only naïve-concat baseline. Exposes `compute_naive_delta` for benchmark.py and a CLI for annotating an existing latest.json in place. - scripts/render_scorecard.py: render the new matrix, per-namespace, and naive-delta sections additively when the JSON carries them. Measured impact (50-query gold, current synthetic catalog): routing summary : -0pp at 50, 83, 1000 vs v0.3.0 baseline matrix tfidf 100 : 0.6100 (==) matrix tfidf 500 : 0.5300 (-2pp; 1-query noise floor) matrix tfidf 1000: 0.3100 (==) #208 #209 #213 #215
…mpts, weekly cron Lands the remaining 5 eval-substrate issues on top of the matrix/baseline/tokenizer work in the previous commit: - #209: routing_gold.json expanded 50 → 200 queries, 25 per namespace, with explicit `namespace` field. Catalog-validated. Reveals more honest recall numbers (the original 50 were curated easy wins). - #211: scripts/benchmark_delta.py renders head-vs-base markdown with shared ✅/⚠️ markers (latency × 1.30, accuracy -1pp). CI job `benchmark-comment` posts a sticky one-per-PR comment via peter-evans/find-comment + create-or-update-comment. Adds "Reproducibility" subsection to the PR template (encouraged, not required). - #214: scripts/sweep_scoring.py + Makefile target + initial benchmarks/sweep_scoring.md. Full 243-cell grid (3⁵), composite formula documented in the report header. Default config rank reported; Pareto candidates surfaced as follow-up only — defaults intentionally untouched. - #216: .github/prompts/add-eval.prompt.md matching the structure of the other three prompts. Cross-linked from AGENTS.md "Adding a Feature" and docs/agent-context/workflows.md. - #207: .github/workflows/scorecard-weekly.yml runs Mon 06:00 UTC, regenerates latest.json + scorecard.md, opens a PR via peter-evans/create-pull-request on drift. Plus README.md naive-baseline footnote with measured token reduction (42–75% range, 55.8% average across the four scenarios — replaces the unsubstantiated "70% lower cost" claim per #215). #207 #209 #211 #214 #216
There was a problem hiding this comment.
Pull request overview
Lands a five-issue "external-facing surface polish" bundle: discoverability metadata, a decision-tree landing page, a Slack-ops-bot reference architecture, a reproducible benchmark scorecard, and a stress-test scenario that exercises budget-driven drops and dedup. No changes to src/contextweaver/ pipeline code beyond a tokeniser refactor referenced by the gold-set expansion.
Changes:
- Discoverability: README badges + "context engineering" phrasing,
CITATION.cff, expandedpyproject.tomlURLs/classifiers/keywords. - Docs: new
docs/which_pattern.mddecision tree,docs/benchmarks.md,docs/architectures/pages; cross-linked from architecture/interop/cookbook and mkdocs nav. - Tooling/benchmarks:
scripts/render_scorecard.py,scripts/benchmark_delta.py,scripts/baseline_naive.py,scripts/sweep_scoring.py; newmaketargets (scorecard,scorecard-check,benchmark-matrix,sweep-scoring,architectures); CI gatingscorecard-checkstep and sticky benchmark-delta comment job; weekly scorecard cron; newstress_conversation.jsonlscenario; namespace-aware tokenizer (#213) with router cleanup; expanded 200-query gold set; Slack ops bot architecture underexamples/architectures/.
Reviewed changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md, llms.txt, llms-full.txt | Badges, "context engineering" phrasing, scorecard link, decision-tree link, examples table addition. |
| pyproject.toml | New URLs, classifiers, keywords for PyPI discoverability. |
| CITATION.cff | New CFF v1.2 file for "Cite this repository" button. |
| mkdocs.yml | Adds Which-pattern, Architectures, Benchmarks nav entries. |
| Makefile, AGENTS.md, docs/agent-context/workflows.md | New make targets (scorecard*, benchmark-matrix, sweep-scoring, architectures), scripts/ added to lint/fmt scope. |
| docs/which_pattern.md, docs/benchmarks.md, docs/architectures/{index,slack_ops_bot}.md | New documentation pages. |
| docs/architecture.md, docs/interop.md, docs/cookbook.md | Cross-links to the new decision-tree and architectures pages. |
| examples/architectures/slack_ops_bot/{main.py,catalog.yaml,README.md,OUTPUT.md,init.py}, examples/architectures/init.py | New Slack ops bot reference architecture. |
| scripts/render_scorecard.py, scripts/benchmark_delta.py, scripts/baseline_naive.py, scripts/sweep_scoring.py | New stdlib-only renderer, PR-delta generator, naive baseline, and weight-sweep tool. |
| scripts/gen_llms.py, scripts/weaver_spec_conformance.py | Formatting/cleanup. |
| benchmarks/benchmark.py | --matrix, --backends, --sizes, --no-naive-delta flags; per-backend/per-namespace cells; v1.1 schema bump. |
| benchmarks/scorecard.md, benchmarks/sweep_scoring.md, benchmarks/results/latest.json, benchmarks/README.md, benchmarks/routing_gold.json | Generated/committed scorecard outputs and expanded 200-query gold set. |
| src/contextweaver/_utils.py, src/contextweaver/routing/router.py, tests/test_utils.py | Namespace-aware tokenizer with retained underscore behavior and tests. |
| tests/test_render_scorecard.py, tests/test_architectures_slack.py | New test files for the renderer and the Slack ops bot. |
| .github/workflows/{ci.yml,scorecard-weekly.yml}, .github/pull_request_template.md, .github/prompts/add-eval.prompt.md | CI gating scorecard-check, sticky benchmark-delta comment, weekly regen cron, eval prompt. |
| CHANGELOG.md | Documents the bundled changes. |
Benchmark delta (vs
|
| size | recall@k (head Δ vs base) | MRR (head Δ vs base) | p99 (ms) |
|---|---|---|---|
| 50 | ✅ 0.5649 (+0.0000) | ✅ 0.4978 (+0.0000) | ✅ 0.498 (base 0.463) |
| 83 | ✅ 0.3825 (+0.0000) | ✅ 0.3242 (+0.0000) | ✅ 0.741 (base 0.876) |
| 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 40.009 (base 31.897) |
Per-backend × per-size matrix
| backend | size | recall@k (Δ) | MRR (Δ) | p99 (ms) |
|---|---|---|---|---|
| bm25 | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3399 (+0.0000) | ✅ 6.290 (base 5.642) |
| bm25 | 500 | ✅ 0.2250 (+0.0000) | ✅ 0.2165 (+0.0000) | ✅ 29.525 (base 27.538) |
| bm25 | 1000 | ✅ 0.1575 (+0.0000) | ✅ 0.1525 (+0.0000) | ✅ 88.189 (base 78.368) |
| fuzzy | 100 | ✅ 0.0000 (+0.0000) | ✅ 0.0000 (+0.0000) | ✅ 0.000 (base 0.000) |
| fuzzy | 500 | ✅ 0.0000 (+0.0000) | ✅ 0.0000 (+0.0000) | ✅ 0.000 (base 0.000) |
| fuzzy | 1000 | ✅ 0.0000 (+0.0000) | ✅ 0.0000 (+0.0000) | ✅ 0.000 (base 0.000) |
| tfidf | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3220 (+0.0000) | ✅ 1.035 (base 0.872) |
| tfidf | 500 | ✅ 0.2325 (+0.0000) | ✅ 0.2314 (+0.0000) | ✅ 9.078 (base 8.660) |
| tfidf | 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 36.170 (base 30.071) |
Context pipeline (per scenario)
| scenario | tokens | dropped | dedup |
|---|---|---|---|
| large_catalog | 1514 (base 1514, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| long_conversation | 2548 (base 2548, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| short_conversation | 496 (base 496, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| stress_conversation | 6651 (base 6651, Δ+0) | 7 (base 7, Δ+0) | 4 (base 4, Δ+0) |
Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.
… default_id param - scripts/render_scorecard.py: update hardcoded '50 hand-curated queries' to '200' to match the actual routing_gold.json size - docs/benchmarks.md: same stale reference fixed - scripts/sweep_scoring.py: remove unused default_id parameter from _row_line() and its call site (dead code) - benchmarks/scorecard.md: regenerated to reflect updated text
…, diff hint - scripts/benchmark_delta.py: wrap JSON loading in try/except for FileNotFoundError and JSONDecodeError so CI gets a clean error instead of an unhandled traceback (F2) - scripts/sweep_scoring.py: replace hardcoded ScoringConfig float comparisons with imported _DEFAULT_SCORING instance so the default detection stays in sync with the library (F3) - scripts/render_scorecard.py: show line-diff count in --check mode error message so CI output is actionable (F4)
Summary
Ship the 5-issue "external-facing surface polish" bundle plus tokenizer
improvements and expanded evaluation infrastructure. The bundle addresses
discoverability, documentation, benchmarking, and reference-architecture gaps
identified during triage.
Fixes #181, #197, #198, #199, #200
Related: #208, #209, #213, #214, #215
Motivation
contextweaver's core pipeline was maturing but its external surface lagged:
no benchmark scorecard, no decision-tree for new users, no reference
architecture beyond the quickstart, no structured citation, and scattered
README metadata. Meanwhile the evaluation harness needed a stress scenario
that exercises budget-driven drops and dedup, a naïve-baseline comparison,
a gold-set expansion from 50→200 items, and a weight-sweep tool for
ScoringConfig tuning. The namespace-aware tokenizer rewrite (#213) also
landed in this bundle because it directly improved routing recall measured
by the expanded benchmark.
Changes
User-facing documentation & discoverability (#199, #200)
should I use?"), linked from README and architecture/interop docs
engineering" phrasing; CITATION.cff; pyproject classifiers and URLs
Reference architecture (#198)
demonstrating bounded-choice routing, firewall on large log dumps,
and persistent facts across 6 turns. Includes catalog, scripted main,
README, and OUTPUT.md
make architecturesumbrella target wired intomake exampleBenchmark scorecard & evaluation expansion (#181, #197, #208, #209, #213, #214, #215)
incident transcript forcing drops, dedup, and firewall ([eval] Expand long_conversation scenario to stress-test budget-driven selection #181)
make scorecard/make scorecard-checktargets;--checkmode nowreports line-diff count on mismatch
PR comments; robust JSON error handling for missing/corrupt files
for ScoringConfig tuning; default detection uses
ScoringConfig()instancerather than hardcoded floats
benchmark-delta sticky comment
Tokenizer rewrite & routing cleanup (#213)
splits
slack:list_channelsinto meaningful tokens via configurableseparators, improving routing recall on namespaced tool catalogs
replace(":"/"_"/"/" )workaround now that the tokenizer handlescompound names natively
Agent & CI docs
Scope
In scope: documentation, benchmarks, evaluation tooling, one reference
architecture, tokenizer improvement that directly feeds benchmark results.
Out of scope (deferred):
Testing & verification
ruff format --check+ruff check— cleanmypy src/— 54 files, no issuespytest— 768 passed, 8 skipped (+20 new: 12 scorecard, 8 Slack arch)make example— all scripts including Slack ops botmake demo— cleanmake scorecard-check— deterministic (no drift)make llms-check— up to dateRisks & rollback
Low risk. Changes are additive: new docs, new scripts, new benchmark
scenarios, new tests. The only
src/changes are the tokenizer(
_utils.py) and router cleanup (router.py), both covered by existingChecklist
make cipasses locally (fmt + lint + type + test + example + demo)CHANGELOG.mdupdated under## [Unreleased]