docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200) by dgenio · Pull Request #228 · dgenio/contextweaver

dgenio · 2026-05-16T06:42:47Z

Summary

Ship the 5-issue "external-facing surface polish" bundle plus tokenizer
improvements and expanded evaluation infrastructure. The bundle addresses
discoverability, documentation, benchmarking, and reference-architecture gaps
identified during triage.

Fixes #181, #197, #198, #199, #200
Related: #208, #209, #213, #214, #215

Motivation

contextweaver's core pipeline was maturing but its external surface lagged:
no benchmark scorecard, no decision-tree for new users, no reference
architecture beyond the quickstart, no structured citation, and scattered
README metadata. Meanwhile the evaluation harness needed a stress scenario
that exercises budget-driven drops and dedup, a naïve-baseline comparison,
a gold-set expansion from 50→200 items, and a weight-sweep tool for
ScoringConfig tuning. The namespace-aware tokenizer rewrite (#213) also
landed in this bundle because it directly improved routing recall measured
by the expanded benchmark.

Changes

User-facing documentation & discoverability (#199, #200)

docs/which_pattern.md — decision-tree landing page ("Which pattern
should I use?"), linked from README and architecture/interop docs
README.md — CI/PyPI/license/docs/Discussions badges; "context
engineering" phrasing; CITATION.cff; pyproject classifiers and URLs

Reference architecture (#198)

examples/architectures/slack_ops_bot/ — 48-tool Slack ops bot
demonstrating bounded-choice routing, firewall on large log dumps,
and persistent facts across 6 turns. Includes catalog, scripted main,
README, and OUTPUT.md
docs/architectures/ — index + Slack ops bot write-up
make architectures umbrella target wired into make example

Benchmark scorecard & evaluation expansion (#181, #197, #208, #209, #213, #214, #215)

benchmarks/scenarios/stress_conversation.jsonl — 30-turn SEV2
incident transcript forcing drops, dedup, and firewall ([eval] Expand long_conversation scenario to stress-test budget-driven selection #181)
scripts/render_scorecard.py — deterministic stdlib-only renderer;
make scorecard / make scorecard-check targets; --check mode now
reports line-diff count on mismatch
scripts/benchmark_delta.py — head-vs-base delta renderer for sticky
PR comments; robust JSON error handling for missing/corrupt files
scripts/sweep_scoring.py — weight-sweep grid search (243 configs)
for ScoringConfig tuning; default detection uses ScoringConfig() instance
rather than hardcoded floats
scripts/baseline_naive.py — naïve-concat baseline comparison
benchmarks/routing_gold.json — expanded from 50→200 gold-set items
.github/workflows/ci.yml — scorecard-check gating step;
benchmark-delta sticky comment
.github/workflows/scorecard-weekly.yml — weekly scorecard regen cron
.github/prompts/add-eval.prompt.md — eval-expansion prompt for agents

Tokenizer rewrite & routing cleanup (#213)

src/contextweaver/_utils.py — namespace-aware compound tokenizer:
splits slack:list_channels into meaningful tokens via configurable
separators, improving routing recall on namespaced tool catalogs
src/contextweaver/routing/router.py — removed legacy
replace(":"/"_"/"/" ) workaround now that the tokenizer handles
compound names natively
tests/test_utils.py — 15 new tests covering the tokenizer rewrite

Agent & CI docs

AGENTS.md — updated command table for new make targets
docs/agent-context/workflows.md — added eval-expansion workflow
docs/benchmarks.md — methodology page for the scorecard
llms-full.txt — regenerated

Scope

In scope: documentation, benchmarks, evaluation tooling, one reference
architecture, tokenizer improvement that directly feeds benchmark results.

Out of scope (deferred):

Social-preview PNG (requires GitHub UI upload)
Code-review bot and voice-agent reference architectures
Per-backend benchmark matrix in CI (tracked as follow-up)
ANN/graph retriever integration

Testing & verification

ruff format --check + ruff check — clean
mypy src/ — 54 files, no issues
pytest — 768 passed, 8 skipped (+20 new: 12 scorecard, 8 Slack arch)
make example — all scripts including Slack ops bot
make demo — clean
make scorecard-check — deterministic (no drift)
make llms-check — up to date

Risks & rollback

Low risk. Changes are additive: new docs, new scripts, new benchmark
scenarios, new tests. The only src/ changes are the tokenizer
(_utils.py) and router cleanup (router.py), both covered by existing

new tests. Rollback: revert the merge commit.

Checklist

Tests added or updated for every new/changed public function
make ci passes locally (fmt + lint + type + test + example + demo)
CHANGELOG.md updated under ## [Unreleased]
Docstrings added for all new public APIs (Google-style)
Every modified module stays ≤ 300 lines (or a decomposition issue is linked above)
Related issue linked in the summary above

…#199, #200) Lands the five-issue "external-facing surface polish" group selected by the triage pass. No changes to src/contextweaver/ pipeline code. #200 — Discoverability polish - README badges (CI, PyPI, Python versions, license, docs, Discussions) - "context engineering" as a secondary phrase in intro + Problem - CITATION.cff at repo root (CFF v1.2) - pyproject Documentation/Issues/Discussions URLs, Framework :: AsyncIO and Topic :: Scientific/Engineering :: AI classifiers, extended keywords - Social-preview PNG deferred (needs GitHub UI step) #199 — Decision-tree landing page - docs/which_pattern.md branches on user symptom -> one concrete next step - Linked from README Quickstart and top of architecture.md / interop.md - mkdocs nav entry added #198 — Production reference architectures cookbook (Slack ops bot) - examples/architectures/slack_ops_bot/ with 48-tool catalog.yaml, 6-turn scripted main.py, README, OUTPUT.md - Demonstrates bounded-choice pattern (Router narrows 48 -> 3; bot picks from shortlist), firewall on 34 KB log dump, persistent facts across turns - docs/architectures/index.md + slack_ops_bot.md pages - make architectures umbrella target wired into make example - Code-review bot + voice agent deferred to follow-up issues #197 — Benchmark scorecard - scripts/render_scorecard.py: stdlib-only deterministic renderer - make scorecard / make scorecard-check targets - benchmarks/scorecard.md committed, regeneratable in one command - docs/benchmarks.md methodology page - README "Why Trust" gains scorecard link - CI: new gating scorecard-check step (runs before the non-gating benchmark step so it validates the committed pair, not the runner's fresh numbers) - Per-backend matrix and weekly CI regen tracked as follow-ups #181 — Stress-test benchmark scenario - benchmarks/scenarios/stress_conversation.jsonl: SEV2 incident-response transcript, 30 user turns, 3 large tool results (firewall fires), 4 near-duplicate pairs (dedup fires), total tokens push past the 6000-token answer budget so items_dropped > 0 - benchmarks/results/latest.json regenerated; benchmark README baseline metrics table refreshed and corrected the stale "git-ignored" note Tests - tests/test_render_scorecard.py: 12 tests covering renderer determinism, routing/context table formatting, missing-field validation, --check mode - tests/test_architectures_slack.py: 8 tests pinning catalog size, intent match count, firewall trigger, persisted facts, final prompt rendering Verification - ruff format --check src/ tests/ examples/ scripts/ - clean - ruff check src/ tests/ examples/ scripts/ - all checks passed - mypy src/ - 54 files, no issues - pytest - 768 passed, 8 skipped (was 748 baseline; +20 new tests) - make example - all scripts including Slack ops bot run cleanly - make demo - clean - make scorecard-check - deterministic (no drift) - make llms-check - up to date https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj

…HONY targets, CHANGELOG entries)

…ance step

…nit) Address the CFF 1.2 best-practice nit from the PR #203 review. Adds `date-released: 2026-05-11` (matches the 0.3.0 release date in CHANGELOG) and an `identifiers:` entry with a Zenodo DOI placeholder. Both are optional for GitHub's "Cite this repository" button to render but recommended by the spec — and trivial to add now so they don't get forgotten when the project does get a real DOI. The DOI value is intentionally a placeholder (`10.5281/zenodo.XXXXXXX`) and is marked as such in the inline description; replace once a real Zenodo archive of a tagged release exists. https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj

…208, #209, #213, #215) - _utils.tokenize: split on `.`, `-`, `/`, `:` (single source of truth); emit dotted/hyphenated/slashed compound + per-segment sub-tokens; underscore intentionally kept as non-separator (synthetic catalog crosstalk noise — see _OUTER_SPLIT_RE docstring rationale captured during #213 measurement). Closes the dead-code replace(":"/" "/"_"/"/") workaround in routing/router.py. - benchmarks/benchmark.py: gain `--matrix`, `--backends`, `--sizes`, and `--no-naive-delta` flags. Emit additive `routing_matrix` (per-backend × per-size, #208), `routing_per_namespace` (#209), and per-row `naive_delta` (#215) keys. Legacy `routing` and `context` keys preserved so PR #203's scorecard renderer keeps working. - scripts/baseline_naive.py: new stdlib-only naïve-concat baseline. Exposes `compute_naive_delta` for benchmark.py and a CLI for annotating an existing latest.json in place. - scripts/render_scorecard.py: render the new matrix, per-namespace, and naive-delta sections additively when the JSON carries them. Measured impact (50-query gold, current synthetic catalog): routing summary : -0pp at 50, 83, 1000 vs v0.3.0 baseline matrix tfidf 100 : 0.6100 (==) matrix tfidf 500 : 0.5300 (-2pp; 1-query noise floor) matrix tfidf 1000: 0.3100 (==) #208 #209 #213 #215

…mpts, weekly cron Lands the remaining 5 eval-substrate issues on top of the matrix/baseline/tokenizer work in the previous commit: - #209: routing_gold.json expanded 50 → 200 queries, 25 per namespace, with explicit `namespace` field. Catalog-validated. Reveals more honest recall numbers (the original 50 were curated easy wins). - #211: scripts/benchmark_delta.py renders head-vs-base markdown with shared ✅/⚠️ markers (latency × 1.30, accuracy -1pp). CI job `benchmark-comment` posts a sticky one-per-PR comment via peter-evans/find-comment + create-or-update-comment. Adds "Reproducibility" subsection to the PR template (encouraged, not required). - #214: scripts/sweep_scoring.py + Makefile target + initial benchmarks/sweep_scoring.md. Full 243-cell grid (3⁵), composite formula documented in the report header. Default config rank reported; Pareto candidates surfaced as follow-up only — defaults intentionally untouched. - #216: .github/prompts/add-eval.prompt.md matching the structure of the other three prompts. Cross-linked from AGENTS.md "Adding a Feature" and docs/agent-context/workflows.md. - #207: .github/workflows/scorecard-weekly.yml runs Mon 06:00 UTC, regenerates latest.json + scorecard.md, opens a PR via peter-evans/create-pull-request on drift. Plus README.md naive-baseline footnote with measured token reduction (42–75% range, 55.8% average across the four scenarios — replaces the unsubstantiated "70% lower cost" claim per #215). #207 #209 #211 #214 #216

Copilot

Pull request overview

Lands a five-issue "external-facing surface polish" bundle: discoverability metadata, a decision-tree landing page, a Slack-ops-bot reference architecture, a reproducible benchmark scorecard, and a stress-test scenario that exercises budget-driven drops and dedup. No changes to src/contextweaver/ pipeline code beyond a tokeniser refactor referenced by the gold-set expansion.

Changes:

Discoverability: README badges + "context engineering" phrasing, CITATION.cff, expanded pyproject.toml URLs/classifiers/keywords.
Docs: new docs/which_pattern.md decision tree, docs/benchmarks.md, docs/architectures/ pages; cross-linked from architecture/interop/cookbook and mkdocs nav.
Tooling/benchmarks: scripts/render_scorecard.py, scripts/benchmark_delta.py, scripts/baseline_naive.py, scripts/sweep_scoring.py; new make targets (scorecard, scorecard-check, benchmark-matrix, sweep-scoring, architectures); CI gating scorecard-check step and sticky benchmark-delta comment job; weekly scorecard cron; new stress_conversation.jsonl scenario; namespace-aware tokenizer (#213) with router cleanup; expanded 200-query gold set; Slack ops bot architecture under examples/architectures/.

Reviewed changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
README.md, llms.txt, llms-full.txt	Badges, "context engineering" phrasing, scorecard link, decision-tree link, examples table addition.
pyproject.toml	New URLs, classifiers, keywords for PyPI discoverability.
CITATION.cff	New CFF v1.2 file for "Cite this repository" button.
mkdocs.yml	Adds Which-pattern, Architectures, Benchmarks nav entries.
Makefile, AGENTS.md, docs/agent-context/workflows.md	New make targets (`scorecard*`, `benchmark-matrix`, `sweep-scoring`, `architectures`), `scripts/` added to lint/fmt scope.
docs/which_pattern.md, docs/benchmarks.md, docs/architectures/{index,slack_ops_bot}.md	New documentation pages.
docs/architecture.md, docs/interop.md, docs/cookbook.md	Cross-links to the new decision-tree and architectures pages.
examples/architectures/slack_ops_bot/{main.py,catalog.yaml,README.md,OUTPUT.md,init.py}, examples/architectures/init.py	New Slack ops bot reference architecture.
scripts/render_scorecard.py, scripts/benchmark_delta.py, scripts/baseline_naive.py, scripts/sweep_scoring.py	New stdlib-only renderer, PR-delta generator, naive baseline, and weight-sweep tool.
scripts/gen_llms.py, scripts/weaver_spec_conformance.py	Formatting/cleanup.
benchmarks/benchmark.py	`--matrix`, `--backends`, `--sizes`, `--no-naive-delta` flags; per-backend/per-namespace cells; v1.1 schema bump.
benchmarks/scorecard.md, benchmarks/sweep_scoring.md, benchmarks/results/latest.json, benchmarks/README.md, benchmarks/routing_gold.json	Generated/committed scorecard outputs and expanded 200-query gold set.
src/contextweaver/_utils.py, src/contextweaver/routing/router.py, tests/test_utils.py	Namespace-aware tokenizer with retained underscore behavior and tests.
tests/test_render_scorecard.py, tests/test_architectures_slack.py	New test files for the renderer and the Slack ops bot.
.github/workflows/{ci.yml,scorecard-weekly.yml}, .github/pull_request_template.md, .github/prompts/add-eval.prompt.md	CI gating scorecard-check, sticky benchmark-delta comment, weekly regen cron, eval prompt.
CHANGELOG.md	Documents the bundled changes.

github-actions · 2026-05-16T06:46:16Z

Benchmark delta (vs `main`)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size	recall@k (head Δ vs base)	MRR (head Δ vs base)	p99 (ms)
50	✅ 0.5649 (+0.0000)	✅ 0.4978 (+0.0000)	✅ 0.498 (base 0.463)
83	✅ 0.3825 (+0.0000)	✅ 0.3242 (+0.0000)	✅ 0.741 (base 0.876)
1000	✅ 0.1475 (+0.0000)	✅ 0.1456 (+0.0000)	✅ 40.009 (base 31.897)

Per-backend × per-size matrix

backend	size	recall@k (Δ)	MRR (Δ)	p99 (ms)
bm25	100	✅ 0.3825 (+0.0000)	✅ 0.3399 (+0.0000)	✅ 6.290 (base 5.642)
bm25	500	✅ 0.2250 (+0.0000)	✅ 0.2165 (+0.0000)	✅ 29.525 (base 27.538)
bm25	1000	✅ 0.1575 (+0.0000)	✅ 0.1525 (+0.0000)	✅ 88.189 (base 78.368)
fuzzy	100	✅ 0.0000 (+0.0000)	✅ 0.0000 (+0.0000)	✅ 0.000 (base 0.000)
fuzzy	500	✅ 0.0000 (+0.0000)	✅ 0.0000 (+0.0000)	✅ 0.000 (base 0.000)
fuzzy	1000	✅ 0.0000 (+0.0000)	✅ 0.0000 (+0.0000)	✅ 0.000 (base 0.000)
tfidf	100	✅ 0.3825 (+0.0000)	✅ 0.3220 (+0.0000)	✅ 1.035 (base 0.872)
tfidf	500	✅ 0.2325 (+0.0000)	✅ 0.2314 (+0.0000)	✅ 9.078 (base 8.660)
tfidf	1000	✅ 0.1475 (+0.0000)	✅ 0.1456 (+0.0000)	✅ 36.170 (base 30.071)

Context pipeline (per scenario)

scenario	tokens	dropped	dedup
large_catalog	1514 (base 1514, Δ+0)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
long_conversation	2548 (base 2548, Δ+0)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
short_conversation	496 (base 496, Δ+0)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
stress_conversation	6651 (base 6651, Δ+0)	7 (base 7, Δ+0)	4 (base 4, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

… default_id param - scripts/render_scorecard.py: update hardcoded '50 hand-curated queries' to '200' to match the actual routing_gold.json size - docs/benchmarks.md: same stale reference fixed - scripts/sweep_scoring.py: remove unused default_id parameter from _row_line() and its call site (dead code) - benchmarks/scorecard.md: regenerated to reflect updated text

…, diff hint - scripts/benchmark_delta.py: wrap JSON loading in try/except for FileNotFoundError and JSONDecodeError so CI gets a clean error instead of an unhandled traceback (F2) - scripts/sweep_scoring.py: replace hardcoded ScoringConfig float comparisons with imported _DEFAULT_SCORING instance so the default detection stays in sync with the library (F3) - scripts/render_scorecard.py: show line-diff count in --check mode error message so CI output is actionable (F4)

claude and others added 7 commits May 15, 2026 11:33

merge: resolve conflicts with main (scorecard+weaver-spec ci steps, P…

2c0b364

…HONY targets, CHANGELOG entries)

fix(lint): replace Any with object in _spec_to_jsonable (ANN401)

b720fde

fix(ci): add --retry 3 --retry-delay 2 to curl in Weaver-spec conform…

97c3263

…ance step

Copilot AI review requested due to automatic review settings May 16, 2026 06:42

Copilot started reviewing on behalf of dgenio May 16, 2026 06:43 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread scripts/render_scorecard.py Outdated

Comment thread docs/benchmarks.md Outdated

Comment thread scripts/sweep_scoring.py Outdated

Comment thread README.md

dgenio added 2 commits May 16, 2026 10:57

dgenio merged commit bfcbe0e into main May 16, 2026
4 checks passed

dgenio deleted the claude/triage-issues-hccyU branch May 16, 2026 10:24

This was referenced May 17, 2026

fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229 #235

Open

feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215) #229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200)#228

docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200)#228
dgenio merged 9 commits into
mainfrom
claude/triage-issues-hccyU

dgenio commented May 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dgenio commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

User-facing documentation & discoverability (#199, #200)

Reference architecture (#198)

Benchmark scorecard & evaluation expansion (#181, #197, #208, #209, #213, #214, #215)

Tokenizer rewrite & routing cleanup (#213)

Agent & CI docs

Scope

Testing & verification

Risks & rollback

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark delta (vs main)

Routing summary (single backend × catalog sizes)

Per-backend × per-size matrix

Context pipeline (per scenario)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dgenio commented May 16, 2026 •

edited

Loading

github-actions Bot commented May 16, 2026 •

edited

Loading

Benchmark delta (vs `main`)