Skip to content

docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200)#228

Merged
dgenio merged 9 commits into
mainfrom
claude/triage-issues-hccyU
May 16, 2026
Merged

docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200)#228
dgenio merged 9 commits into
mainfrom
claude/triage-issues-hccyU

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 16, 2026

Summary

Ship the 5-issue "external-facing surface polish" bundle plus tokenizer
improvements and expanded evaluation infrastructure. The bundle addresses
discoverability, documentation, benchmarking, and reference-architecture gaps
identified during triage.

Fixes #181, #197, #198, #199, #200
Related: #208, #209, #213, #214, #215

Motivation

contextweaver's core pipeline was maturing but its external surface lagged:
no benchmark scorecard, no decision-tree for new users, no reference
architecture beyond the quickstart, no structured citation, and scattered
README metadata. Meanwhile the evaluation harness needed a stress scenario
that exercises budget-driven drops and dedup, a naïve-baseline comparison,
a gold-set expansion from 50→200 items, and a weight-sweep tool for
ScoringConfig tuning. The namespace-aware tokenizer rewrite (#213) also
landed in this bundle because it directly improved routing recall measured
by the expanded benchmark.

Changes

User-facing documentation & discoverability (#199, #200)

  • docs/which_pattern.md — decision-tree landing page ("Which pattern
    should I use?"), linked from README and architecture/interop docs
  • README.md — CI/PyPI/license/docs/Discussions badges; "context
    engineering" phrasing; CITATION.cff; pyproject classifiers and URLs

Reference architecture (#198)

  • examples/architectures/slack_ops_bot/ — 48-tool Slack ops bot
    demonstrating bounded-choice routing, firewall on large log dumps,
    and persistent facts across 6 turns. Includes catalog, scripted main,
    README, and OUTPUT.md
  • docs/architectures/ — index + Slack ops bot write-up
  • make architectures umbrella target wired into make example

Benchmark scorecard & evaluation expansion (#181, #197, #208, #209, #213, #214, #215)

  • benchmarks/scenarios/stress_conversation.jsonl — 30-turn SEV2
    incident transcript forcing drops, dedup, and firewall ([eval] Expand long_conversation scenario to stress-test budget-driven selection #181)
  • scripts/render_scorecard.py — deterministic stdlib-only renderer;
    make scorecard / make scorecard-check targets; --check mode now
    reports line-diff count on mismatch
  • scripts/benchmark_delta.py — head-vs-base delta renderer for sticky
    PR comments; robust JSON error handling for missing/corrupt files
  • scripts/sweep_scoring.py — weight-sweep grid search (243 configs)
    for ScoringConfig tuning; default detection uses ScoringConfig() instance
    rather than hardcoded floats
  • scripts/baseline_naive.py — naïve-concat baseline comparison
  • benchmarks/routing_gold.json — expanded from 50→200 gold-set items
  • .github/workflows/ci.yml — scorecard-check gating step;
    benchmark-delta sticky comment
  • .github/workflows/scorecard-weekly.yml — weekly scorecard regen cron
  • .github/prompts/add-eval.prompt.md — eval-expansion prompt for agents

Tokenizer rewrite & routing cleanup (#213)

  • src/contextweaver/_utils.py — namespace-aware compound tokenizer:
    splits slack:list_channels into meaningful tokens via configurable
    separators, improving routing recall on namespaced tool catalogs
  • src/contextweaver/routing/router.py — removed legacy
    replace(":"/"_"/"/" ) workaround now that the tokenizer handles
    compound names natively
  • tests/test_utils.py — 15 new tests covering the tokenizer rewrite

Agent & CI docs

  • AGENTS.md — updated command table for new make targets
  • docs/agent-context/workflows.md — added eval-expansion workflow
  • docs/benchmarks.md — methodology page for the scorecard
  • llms-full.txt — regenerated

Scope

In scope: documentation, benchmarks, evaluation tooling, one reference
architecture, tokenizer improvement that directly feeds benchmark results.

Out of scope (deferred):

  • Social-preview PNG (requires GitHub UI upload)
  • Code-review bot and voice-agent reference architectures
  • Per-backend benchmark matrix in CI (tracked as follow-up)
  • ANN/graph retriever integration

Testing & verification

  • ruff format --check + ruff check — clean
  • mypy src/ — 54 files, no issues
  • pytest — 768 passed, 8 skipped (+20 new: 12 scorecard, 8 Slack arch)
  • make example — all scripts including Slack ops bot
  • make demo — clean
  • make scorecard-check — deterministic (no drift)
  • make llms-check — up to date

Risks & rollback

Low risk. Changes are additive: new docs, new scripts, new benchmark
scenarios, new tests. The only src/ changes are the tokenizer
(_utils.py) and router cleanup (router.py), both covered by existing

  • new tests. Rollback: revert the merge commit.

Checklist

  • Tests added or updated for every new/changed public function
  • make ci passes locally (fmt + lint + type + test + example + demo)
  • CHANGELOG.md updated under ## [Unreleased]
  • Docstrings added for all new public APIs (Google-style)
  • Every modified module stays ≤ 300 lines (or a decomposition issue is linked above)
  • Related issue linked in the summary above

claude and others added 7 commits May 15, 2026 11:33
…#199, #200)

Lands the five-issue "external-facing surface polish" group selected by the
triage pass. No changes to src/contextweaver/ pipeline code.

#200 — Discoverability polish
  - README badges (CI, PyPI, Python versions, license, docs, Discussions)
  - "context engineering" as a secondary phrase in intro + Problem
  - CITATION.cff at repo root (CFF v1.2)
  - pyproject Documentation/Issues/Discussions URLs, Framework :: AsyncIO
    and Topic :: Scientific/Engineering :: AI classifiers, extended keywords
  - Social-preview PNG deferred (needs GitHub UI step)

#199 — Decision-tree landing page
  - docs/which_pattern.md branches on user symptom -> one concrete next step
  - Linked from README Quickstart and top of architecture.md / interop.md
  - mkdocs nav entry added

#198 — Production reference architectures cookbook (Slack ops bot)
  - examples/architectures/slack_ops_bot/ with 48-tool catalog.yaml,
    6-turn scripted main.py, README, OUTPUT.md
  - Demonstrates bounded-choice pattern (Router narrows 48 -> 3; bot picks
    from shortlist), firewall on 34 KB log dump, persistent facts across turns
  - docs/architectures/index.md + slack_ops_bot.md pages
  - make architectures umbrella target wired into make example
  - Code-review bot + voice agent deferred to follow-up issues

#197 — Benchmark scorecard
  - scripts/render_scorecard.py: stdlib-only deterministic renderer
  - make scorecard / make scorecard-check targets
  - benchmarks/scorecard.md committed, regeneratable in one command
  - docs/benchmarks.md methodology page
  - README "Why Trust" gains scorecard link
  - CI: new gating scorecard-check step (runs before the non-gating benchmark
    step so it validates the committed pair, not the runner's fresh numbers)
  - Per-backend matrix and weekly CI regen tracked as follow-ups

#181 — Stress-test benchmark scenario
  - benchmarks/scenarios/stress_conversation.jsonl: SEV2 incident-response
    transcript, 30 user turns, 3 large tool results (firewall fires), 4
    near-duplicate pairs (dedup fires), total tokens push past the 6000-token
    answer budget so items_dropped > 0
  - benchmarks/results/latest.json regenerated; benchmark README baseline
    metrics table refreshed and corrected the stale "git-ignored" note

Tests
  - tests/test_render_scorecard.py: 12 tests covering renderer determinism,
    routing/context table formatting, missing-field validation, --check mode
  - tests/test_architectures_slack.py: 8 tests pinning catalog size, intent
    match count, firewall trigger, persisted facts, final prompt rendering

Verification
  - ruff format --check src/ tests/ examples/ scripts/ - clean
  - ruff check src/ tests/ examples/ scripts/ - all checks passed
  - mypy src/ - 54 files, no issues
  - pytest - 768 passed, 8 skipped (was 748 baseline; +20 new tests)
  - make example - all scripts including Slack ops bot run cleanly
  - make demo - clean
  - make scorecard-check - deterministic (no drift)
  - make llms-check - up to date

https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
…nit)

Address the CFF 1.2 best-practice nit from the PR #203 review. Adds
`date-released: 2026-05-11` (matches the 0.3.0 release date in CHANGELOG)
and an `identifiers:` entry with a Zenodo DOI placeholder. Both are
optional for GitHub's "Cite this repository" button to render but
recommended by the spec — and trivial to add now so they don't get
forgotten when the project does get a real DOI.

The DOI value is intentionally a placeholder (`10.5281/zenodo.XXXXXXX`)
and is marked as such in the inline description; replace once a real
Zenodo archive of a tagged release exists.

https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
…208, #209, #213, #215)

- _utils.tokenize: split on `.`, `-`, `/`, `:` (single source of truth);
  emit dotted/hyphenated/slashed compound + per-segment sub-tokens;
  underscore intentionally kept as non-separator (synthetic catalog
  crosstalk noise — see _OUTER_SPLIT_RE docstring rationale captured during
  #213 measurement). Closes the dead-code replace(":"/" "/"_"/"/") workaround
  in routing/router.py.

- benchmarks/benchmark.py: gain `--matrix`, `--backends`, `--sizes`, and
  `--no-naive-delta` flags. Emit additive `routing_matrix` (per-backend ×
  per-size, #208), `routing_per_namespace` (#209), and per-row
  `naive_delta` (#215) keys. Legacy `routing` and `context` keys preserved
  so PR #203's scorecard renderer keeps working.

- scripts/baseline_naive.py: new stdlib-only naïve-concat baseline.
  Exposes `compute_naive_delta` for benchmark.py and a CLI for annotating
  an existing latest.json in place.

- scripts/render_scorecard.py: render the new matrix, per-namespace, and
  naive-delta sections additively when the JSON carries them.

Measured impact (50-query gold, current synthetic catalog):
  routing summary  : -0pp at 50, 83, 1000 vs v0.3.0 baseline
  matrix tfidf 100 : 0.6100 (==)
  matrix tfidf 500 : 0.5300 (-2pp; 1-query noise floor)
  matrix tfidf 1000: 0.3100 (==)

#208 #209 #213 #215
…mpts, weekly cron

Lands the remaining 5 eval-substrate issues on top of the matrix/baseline/tokenizer
work in the previous commit:

- #209: routing_gold.json expanded 50 → 200 queries, 25 per namespace, with
  explicit `namespace` field. Catalog-validated. Reveals more honest recall
  numbers (the original 50 were curated easy wins).

- #211: scripts/benchmark_delta.py renders head-vs-base markdown with shared
  ✅/⚠️ markers (latency × 1.30, accuracy -1pp). CI job `benchmark-comment`
  posts a sticky one-per-PR comment via peter-evans/find-comment +
  create-or-update-comment. Adds "Reproducibility" subsection to the PR
  template (encouraged, not required).

- #214: scripts/sweep_scoring.py + Makefile target + initial
  benchmarks/sweep_scoring.md. Full 243-cell grid (3⁵), composite formula
  documented in the report header. Default config rank reported; Pareto
  candidates surfaced as follow-up only — defaults intentionally untouched.

- #216: .github/prompts/add-eval.prompt.md matching the structure of the
  other three prompts. Cross-linked from AGENTS.md "Adding a Feature" and
  docs/agent-context/workflows.md.

- #207: .github/workflows/scorecard-weekly.yml runs Mon 06:00 UTC,
  regenerates latest.json + scorecard.md, opens a PR via
  peter-evans/create-pull-request on drift.

Plus README.md naive-baseline footnote with measured token reduction
(42–75% range, 55.8% average across the four scenarios — replaces the
unsubstantiated "70% lower cost" claim per #215).

#207 #209 #211 #214 #216
Copilot AI review requested due to automatic review settings May 16, 2026 06:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Lands a five-issue "external-facing surface polish" bundle: discoverability metadata, a decision-tree landing page, a Slack-ops-bot reference architecture, a reproducible benchmark scorecard, and a stress-test scenario that exercises budget-driven drops and dedup. No changes to src/contextweaver/ pipeline code beyond a tokeniser refactor referenced by the gold-set expansion.

Changes:

  • Discoverability: README badges + "context engineering" phrasing, CITATION.cff, expanded pyproject.toml URLs/classifiers/keywords.
  • Docs: new docs/which_pattern.md decision tree, docs/benchmarks.md, docs/architectures/ pages; cross-linked from architecture/interop/cookbook and mkdocs nav.
  • Tooling/benchmarks: scripts/render_scorecard.py, scripts/benchmark_delta.py, scripts/baseline_naive.py, scripts/sweep_scoring.py; new make targets (scorecard, scorecard-check, benchmark-matrix, sweep-scoring, architectures); CI gating scorecard-check step and sticky benchmark-delta comment job; weekly scorecard cron; new stress_conversation.jsonl scenario; namespace-aware tokenizer (#213) with router cleanup; expanded 200-query gold set; Slack ops bot architecture under examples/architectures/.

Reviewed changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
README.md, llms.txt, llms-full.txt Badges, "context engineering" phrasing, scorecard link, decision-tree link, examples table addition.
pyproject.toml New URLs, classifiers, keywords for PyPI discoverability.
CITATION.cff New CFF v1.2 file for "Cite this repository" button.
mkdocs.yml Adds Which-pattern, Architectures, Benchmarks nav entries.
Makefile, AGENTS.md, docs/agent-context/workflows.md New make targets (scorecard*, benchmark-matrix, sweep-scoring, architectures), scripts/ added to lint/fmt scope.
docs/which_pattern.md, docs/benchmarks.md, docs/architectures/{index,slack_ops_bot}.md New documentation pages.
docs/architecture.md, docs/interop.md, docs/cookbook.md Cross-links to the new decision-tree and architectures pages.
examples/architectures/slack_ops_bot/{main.py,catalog.yaml,README.md,OUTPUT.md,init.py}, examples/architectures/init.py New Slack ops bot reference architecture.
scripts/render_scorecard.py, scripts/benchmark_delta.py, scripts/baseline_naive.py, scripts/sweep_scoring.py New stdlib-only renderer, PR-delta generator, naive baseline, and weight-sweep tool.
scripts/gen_llms.py, scripts/weaver_spec_conformance.py Formatting/cleanup.
benchmarks/benchmark.py --matrix, --backends, --sizes, --no-naive-delta flags; per-backend/per-namespace cells; v1.1 schema bump.
benchmarks/scorecard.md, benchmarks/sweep_scoring.md, benchmarks/results/latest.json, benchmarks/README.md, benchmarks/routing_gold.json Generated/committed scorecard outputs and expanded 200-query gold set.
src/contextweaver/_utils.py, src/contextweaver/routing/router.py, tests/test_utils.py Namespace-aware tokenizer with retained underscore behavior and tests.
tests/test_render_scorecard.py, tests/test_architectures_slack.py New test files for the renderer and the Slack ops bot.
.github/workflows/{ci.yml,scorecard-weekly.yml}, .github/pull_request_template.md, .github/prompts/add-eval.prompt.md CI gating scorecard-check, sticky benchmark-delta comment, weekly regen cron, eval prompt.
CHANGELOG.md Documents the bundled changes.

Comment thread scripts/render_scorecard.py Outdated
Comment thread docs/benchmarks.md Outdated
Comment thread scripts/sweep_scoring.py Outdated
Comment thread README.md
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

Benchmark delta (vs main)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size recall@k (head Δ vs base) MRR (head Δ vs base) p99 (ms)
50 ✅ 0.5649 (+0.0000) ✅ 0.4978 (+0.0000) ✅ 0.498 (base 0.463)
83 ✅ 0.3825 (+0.0000) ✅ 0.3242 (+0.0000) ✅ 0.741 (base 0.876)
1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 40.009 (base 31.897)

Per-backend × per-size matrix

backend size recall@k (Δ) MRR (Δ) p99 (ms)
bm25 100 ✅ 0.3825 (+0.0000) ✅ 0.3399 (+0.0000) ✅ 6.290 (base 5.642)
bm25 500 ✅ 0.2250 (+0.0000) ✅ 0.2165 (+0.0000) ✅ 29.525 (base 27.538)
bm25 1000 ✅ 0.1575 (+0.0000) ✅ 0.1525 (+0.0000) ✅ 88.189 (base 78.368)
fuzzy 100 ✅ 0.0000 (+0.0000) ✅ 0.0000 (+0.0000) ✅ 0.000 (base 0.000)
fuzzy 500 ✅ 0.0000 (+0.0000) ✅ 0.0000 (+0.0000) ✅ 0.000 (base 0.000)
fuzzy 1000 ✅ 0.0000 (+0.0000) ✅ 0.0000 (+0.0000) ✅ 0.000 (base 0.000)
tfidf 100 ✅ 0.3825 (+0.0000) ✅ 0.3220 (+0.0000) ✅ 1.035 (base 0.872)
tfidf 500 ✅ 0.2325 (+0.0000) ✅ 0.2314 (+0.0000) ✅ 9.078 (base 8.660)
tfidf 1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 36.170 (base 30.071)

Context pipeline (per scenario)

scenario tokens dropped dedup
large_catalog 1514 (base 1514, Δ+0) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
long_conversation 2548 (base 2548, Δ+0) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
short_conversation 496 (base 496, Δ+0) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
stress_conversation 6651 (base 6651, Δ+0) 7 (base 7, Δ+0) 4 (base 4, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

dgenio added 2 commits May 16, 2026 10:57
… default_id param

- scripts/render_scorecard.py: update hardcoded '50 hand-curated queries'
  to '200' to match the actual routing_gold.json size
- docs/benchmarks.md: same stale reference fixed
- scripts/sweep_scoring.py: remove unused default_id parameter from
  _row_line() and its call site (dead code)
- benchmarks/scorecard.md: regenerated to reflect updated text
…, diff hint

- scripts/benchmark_delta.py: wrap JSON loading in try/except for
  FileNotFoundError and JSONDecodeError so CI gets a clean error
  instead of an unhandled traceback (F2)
- scripts/sweep_scoring.py: replace hardcoded ScoringConfig float
  comparisons with imported _DEFAULT_SCORING instance so the default
  detection stays in sync with the library (F3)
- scripts/render_scorecard.py: show line-diff count in --check mode
  error message so CI output is actionable (F4)
@dgenio dgenio merged commit bfcbe0e into main May 16, 2026
4 checks passed
@dgenio dgenio deleted the claude/triage-issues-hccyU branch May 16, 2026 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[eval] Expand long_conversation scenario to stress-test budget-driven selection

3 participants