feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215) by dgenio · Pull Request #229 · dgenio/contextweaver

dgenio · 2026-05-16T11:21:39Z

Lands the 7-issue benchmark-scorecard maturation group selected by the
triage pass. Same surfaces (benchmarks/, scripts/, .github/workflows/,
one tokenizer change in src/contextweaver/_utils.py); one latest.json
regeneration instead of seven progressive drift commits.

#208 — Per-backend × per-size matrix

--matrix, --backends, --sizes, --no-naive flags on benchmark.py
make benchmark-matrix target
latest.json gains matrix array + per_namespace dict (additive)
skipped fuzzy backend records a status field instead of vanishing

#209 — Gold-set expansion

50 -> 200 hand-authored queries in routing_gold.json
Every namespace has >=23 queries; every entry gains namespace field
Existing 50 preserved verbatim (just augmented with the new field)

#215 — Naive baseline harness

scripts/baseline_naive.py + per-scenario naive_delta block
Same CharDivFourEstimator on both sides for fair ratios
README's "70% lower cost" replaced with measured 58.3% on stress
scenario + scorecard footnote linking the full breakdown

#214 — Weight sweep tool

scripts/sweep_scoring.py + make sweep-scoring
243-point grid over the 5 ScoringConfig weights
Composite (50% coverage + 30% util-overrun + 20% drop-rate) ranking
Pareto-dominators surfaced as candidates for a follow-up; defaults
untouched per the issue's strict non-goal

#207 — Weekly scheduled regen

.github/workflows/scorecard-weekly.yml — Mondays 06:00 UTC
peter-evans/create-pull-request opens drift PRs tagged
area/eval + automated

#211 — Soft PR regression comment

scripts/benchmark_delta.py renders sticky comment markdown
.github/workflows/benchmark-delta.yml runs on PR events
peter-evans/create-or-update-comment edits in place
continue-on-error: true (soft gate per Round 2 Q4=A)
PR template gains optional "Reproducibility" subsection

#213 — Namespace-aware tokenizer

tokenize() emits sub-tokens for ids with internal . _ - /
Original joined form retained; STOPWORDS + len>=2 filters apply
On the 3x3 matrix: net-positive on 8/9 cells (+0.75pp to +3.50pp).
One cell (fuzzy at catalog 100) drops 1.50pp — slightly over
[routing] Namespace-aware tokenizer split for dotted tool IDs (with measured impact) #213's strict <=1pp regression bound. Documented in CHANGELOG.

Scorecard renderer + scorecard.md

New "Per-backend matrix", "Per-namespace recall", "vs naive concat"
sections with stable ✅/⚠️ markers at base × 1.30
Latency budget constant shared between scorecard and PR delta
(test_benchmark_delta::test_latency_threshold_matches_scorecard_renderer
locks them together)

Schema: latest.json bumped to 1.1. Additive only — legacy routing array
and existing context fields preserved verbatim for back-compat.

Tests

91 new test cases across test_utils, test_benchmark,
test_render_scorecard, test_benchmark_delta, test_baseline_naive,
test_sweep_scoring
Full suite: 982 passed, 5 skipped (+91 new tests vs main)

Verification

ruff format --check src/ tests/ examples/ scripts/ — clean
ruff check src/ tests/ examples/ scripts/ — All checks passed
mypy src/ — Success: no issues found in 64 source files
pytest -q — 982 passed, 5 skipped
make example, make demo — clean
make scorecard-check — clean (deterministic regen)
make llms-check — clean

https://claude.ai/code/session_01RMDbkdJg3svPM2sTHVFJUu

…sweep + CI (#207, #208, #209, #211, #213, #214, #215) Lands the 7-issue benchmark-scorecard maturation group selected by the triage pass. Same surfaces (`benchmarks/`, `scripts/`, `.github/workflows/`, one tokenizer change in `src/contextweaver/_utils.py`); one `latest.json` regeneration instead of seven progressive drift commits. #208 — Per-backend × per-size matrix - `--matrix`, `--backends`, `--sizes`, `--no-naive` flags on benchmark.py - `make benchmark-matrix` target - latest.json gains `matrix` array + `per_namespace` dict (additive) - skipped fuzzy backend records a `status` field instead of vanishing #209 — Gold-set expansion - 50 -> 200 hand-authored queries in routing_gold.json - Every namespace has >=23 queries; every entry gains `namespace` field - Existing 50 preserved verbatim (just augmented with the new field) #215 — Naive baseline harness - scripts/baseline_naive.py + per-scenario `naive_delta` block - Same CharDivFourEstimator on both sides for fair ratios - README's "70% lower cost" replaced with measured 58.3% on stress scenario + scorecard footnote linking the full breakdown #214 — Weight sweep tool - scripts/sweep_scoring.py + `make sweep-scoring` - 243-point grid over the 5 ScoringConfig weights - Composite (50% coverage + 30% util-overrun + 20% drop-rate) ranking - Pareto-dominators surfaced as candidates for a follow-up; defaults untouched per the issue's strict non-goal #207 — Weekly scheduled regen - .github/workflows/scorecard-weekly.yml — Mondays 06:00 UTC - peter-evans/create-pull-request opens drift PRs tagged area/eval + automated #211 — Soft PR regression comment - scripts/benchmark_delta.py renders sticky comment markdown - .github/workflows/benchmark-delta.yml runs on PR events - peter-evans/create-or-update-comment edits in place - continue-on-error: true (soft gate per Round 2 Q4=A) - PR template gains optional "Reproducibility" subsection #213 — Namespace-aware tokenizer - tokenize() emits sub-tokens for ids with internal . _ - / - Original joined form retained; STOPWORDS + len>=2 filters apply - On the 3x3 matrix: net-positive on 8/9 cells (+0.75pp to +3.50pp). One cell (fuzzy at catalog 100) drops 1.50pp — slightly over #213's strict <=1pp regression bound. Documented in CHANGELOG. Scorecard renderer + scorecard.md - New "Per-backend matrix", "Per-namespace recall", "vs naive concat" sections with stable ✅/⚠️ markers at base × 1.30 - Latency budget constant shared between scorecard and PR delta (test_benchmark_delta::test_latency_threshold_matches_scorecard_renderer locks them together) Schema: latest.json bumped to 1.1. Additive only — legacy `routing` array and existing context fields preserved verbatim for back-compat. Tests - 91 new test cases across test_utils, test_benchmark, test_render_scorecard, test_benchmark_delta, test_baseline_naive, test_sweep_scoring - Full suite: 982 passed, 5 skipped (+91 new tests vs main) Verification - ruff format --check src/ tests/ examples/ scripts/ — clean - ruff check src/ tests/ examples/ scripts/ — All checks passed - mypy src/ — Success: no issues found in 64 source files - pytest -q — 982 passed, 5 skipped - make example, make demo — clean - make scorecard-check — clean (deterministic regen) - make llms-check — clean https://claude.ai/code/session_01RMDbkdJg3svPM2sTHVFJUu

Copilot

Pull request overview

This PR matures the evaluation/scorecard infrastructure by expanding the benchmark dataset and outputs (matrix + per-namespace + naïve baseline), adding a ScoringConfig sweep report, and wiring CI automation (weekly regen + per-PR delta comments) so regressions are visible without manual runs.

Changes:

Extend the benchmark harness + JSON schema (v1.1) to emit a per-backend×per-size routing matrix, per-namespace recall, and per-scenario naïve-baseline deltas.
Add new tooling/scripts and reports: sticky benchmark delta renderer, ScoringConfig sweep runner + report, gold-set expansion helper, and updated scorecard renderer/markdown outputs.
Add CI workflows for weekly scorecard regeneration PRs and per-PR sticky delta comments; update docs/README accordingly.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`tests/test_utils.py`	Adds tokenizer sub-token + tokenize_list tests
`tests/test_sweep_scoring.py`	Adds sweep tool grid/report/pareto tests
`tests/test_render_scorecard.py`	Adds tests for new scorecard sections/validation
`tests/test_benchmark.py`	Adds tests for matrix mode + helpers
`tests/test_benchmark_delta.py`	Adds tests for delta renderer + threshold sync
`tests/test_baseline_naive.py`	Adds tests for naïve-baseline computations
`src/contextweaver/_utils.py`	Implements namespace-aware tokenization + tokenize_list
`scripts/sweep_scoring.py`	Implements ScoringConfig grid sweep + report
`scripts/render_scorecard.py`	Renders matrix/per-namespace/naïve sections
`scripts/benchmark_delta.py`	Renders sticky PR comment benchmark deltas
`scripts/baseline_naive.py`	Computes naïve concat token/coverage deltas
`scripts/_gold_expansion.py`	One-shot generator for expanded gold set
`README.md`	Replaces illustrative cost claim with measured footnote
`Makefile`	Adds `benchmark-matrix` and `sweep-scoring` targets
`llms-full.txt`	Regenerated docs bundle reflecting README/workflow updates
`docs/benchmarks.md`	Documents new benchmark surfaces + CI integration
`docs/agent-context/workflows.md`	Documents new make targets
`CHANGELOG.md`	Documents new eval features + schema bump
`benchmarks/sweep_scoring.md`	Adds generated sweep report
`benchmarks/scorecard.md`	Regenerates scorecard with new sections
`benchmarks/routing_gold.json`	Expands gold set to 200 + adds namespace field
`benchmarks/results/latest.json`	Regenerates baseline output with v1.1 schema
`benchmarks/README.md`	Updates benchmark methodology + matrix + naïve columns
`benchmarks/benchmark.py`	Adds matrix mode, per-namespace, naïve delta emission
`AGENTS.md`	Documents new make targets
`.github/workflows/scorecard-weekly.yml`	Weekly scheduled regen workflow (PR on drift)
`.github/workflows/benchmark-delta.yml`	Per-PR delta workflow posting sticky comment
`.github/pull_request_template.md`	Adds optional reproducibility section

dgenio · 2026-05-16T21:11:30Z

@@ -174,9 +180,12 @@ def tokenize_list(text: str) -> list[str]:

    Steps:
    1. Lower-case the input.
-    2. Split on one or more non-word characters.
-    3. Discard tokens shorter than 2 characters.
-    4. Remove :data:`STOPWORDS`.
+    2. Scan word-like runs that may include internal ``.``/``_``/``-``/``/``
+       (so dotted / snake_case ids stay intact).
+    3. Strip surrounding delimiter characters from each run.
+    4. Emit the joined form (length ≥ 2 and not a STOPWORD).
+    5. If the run still has internal delimiters, also emit each sub-token
+       — again filtered by length ≥ 2 and STOPWORD membership.

    Args:
        text: Raw input string.
@@ -185,8 +194,16 @@ def tokenize_list(text: str) -> list[str]:
        A ``list[str]`` of normalised, stop-word-filtered tokens in
        occurrence order. Duplicate tokens are preserved.
    """
-    tokens = _SPLIT_RE.split(text.lower())
-    return [t for t in tokens if len(t) >= 2 and t not in STOPWORDS]
+    out: list[str] = []
+    for raw in _SURFACE_RE.findall(text.lower()):
+        stripped = raw.strip(_DELIM_CHARS)
+        if len(stripped) >= 2 and stripped not in STOPWORDS:
+            out.append(stripped)
+        if _SUB_SPLIT_RE.search(stripped):
+            for sub in _SUB_SPLIT_RE.split(stripped):
+                if len(sub) >= 2 and sub not in STOPWORDS:
+                    out.append(sub)


Verified false positive — _DELIM_CHARS = "._-/" already contains _ (4 chars: ., _, -, /). Behavioural test confirms tokenize_list("_foo_") → ["foo"] (leading/trailing underscores correctly stripped). No code change needed.

Resolving thread.

Generated by Claude Code

dgenio · 2026-05-16T21:11:34Z

+/ 1000) plus **per-namespace recall@5** under `routing.matrix` and
+`routing.per_namespace`. The legacy single-backend rows (50 / 83 / 1000
+on `tfidf`) are preserved verbatim for back-compat.


Good catch — fixed. README now says the matrix and per-namespace data live at top-level matrix and per_namespace in results/latest.json, matching what the harness actually writes and what the scorecard renderer reads.

Addressed in 7afea33.

Generated by Claude Code

dgenio · 2026-05-16T21:11:37Z

+  `namespace` field. The benchmark aggregates recall@k into
+  `routing.per_namespace` keyed by backend so the scorecard can show
+  which namespaces are weakest on each scorer.


Good catch — fixed. Changelog now refers to a top-level per_namespace object (alongside top-level matrix), matching the actual schema.

Addressed in 7afea33.

Generated by Claude Code

dgenio · 2026-05-16T21:11:40Z

+    base_matrix = list(base.get("matrix", []))
+    head_matrix = list(head.get("matrix", []))
+    if base_matrix or head_matrix:
+        # The matrix uses the same row shape as the routing table — but with
+        # a non-default backend key — so the same renderer handles both.
+        parts += [
+            "### Matrix (#208)",
+            "",
+            _routing_delta_table(base_matrix, head_matrix, k),
+            "",
+        ]


Good catch — fixed. _routing_delta_table now detects a non-empty status on either the base or head row and renders a _skipped_ / _skipped_ / _skipped_ | <status text> row, suppressing the latency-budget marker. Added two regression tests in tests/test_benchmark_delta.py (test_matrix_row_with_status_rendered_as_skipped covers the head-skipped case; test_matrix_row_with_status_on_base_only covers the inverse).

Addressed in 7afea33.

Generated by Claude Code

dgenio · 2026-05-16T21:11:44Z

+def _backend_available(backend: str) -> tuple[bool, str]:
+    """Report whether *backend* can run in this environment.
+
+    Returns ``(True, "")`` if available, else ``(False, reason)``. The
+    reason is recorded into ``latest.json`` so the scorecard can render
+    a skipped-row marker instead of silently omitting cells.
+    """
+    if backend == "fuzzy" and FuzzyScorer is None:
+        return False, "skipped: rapidfuzz not installed (contextweaver[retrieval] extra)"
+    return True, ""



Good catch — fixed. _parse_args now validates args.backends against _BACKENDS_DEFAULT after parsing and calls parser.error() (exit 2) with a clear message listing the supported set. Added two tests in tests/test_benchmark.py: test_parse_args_rejects_unknown_backends asserts SystemExit(2) on --backends tfidf,bogus; test_parse_args_accepts_supported_backends pins the happy path.

Addressed in 7afea33.

Generated by Claude Code

dgenio · 2026-05-16T21:11:48Z

+#!/usr/bin/env python3
+"""ScoringConfig weight sweep + measured-defaults report (issue #214).
+
+Grid-searches the ``ScoringConfig`` weight space against the committed
+benchmark scenarios and writes a deterministic markdown report to
+``benchmarks/sweep_scoring.md`` showing:
+
+* The composite score and per-axis metrics for every configuration.
+* The rank of the current default within the swept grid.
+* Any configurations that Pareto-dominate the default (lower or equal on
+  every axis we care about, strictly better on at least one) — surfaced
+  as candidates for a separate, deliberately-scoped follow-up issue.
+
+Per the issue spec, this script does **not** change ``ScoringConfig``
+defaults. The report is the deliverable; defaults remain the
+maintainer's call.
+
+Composite formula (also documented in the report header)::
+
+    composite = 0.50 * coverage_pct_avg


Non-actionable in this PR. The 410-line size was anticipated in the PR description's "Tradeoffs / risks" section — scripts/ has existing precedent above the 300-line guideline (render_scorecard.py = 526 lines), and the maintainer's "delivery over splitting" preference applies. The AGENTS.md guideline targets src/ modules with documented exemptions for types.py, envelope.py, __main__.py; extending the policy to scripts/ is a docs change that belongs in a separate PR.

Resolving thread.

Generated by Claude Code

dgenio · 2026-05-16T21:11:51Z

+    if default_row is not None and not default_row.is_default:
+        pass  # unreachable — default_row is found by is_default


Good catch — fixed. Removed the unreachable if default_row is not None and not default_row.is_default: pass branch from render_report() (default_row is selected by is_default upstream, so the negation can never trigger).

Addressed in 7afea33.

Generated by Claude Code

dgenio · 2026-05-16T21:11:54Z

+        naive_delta: dict[str, float] | None = None
+        if with_naive:
+            delta = compute_naive_delta(events=events, pack=pack, cw_tokens=prompt_toks)
+            naive_delta = {
+                "naive_tokens": float(delta.naive_tokens),
+                "cw_tokens": float(delta.cw_tokens),


Good catch — fixed. _collect_context_stats now writes int(delta.naive_tokens) / int(delta.cw_tokens) into the naive_delta block, matching the NaiveDelta dataclass field types (int). The ContextStats.naive_delta annotation is broadened to dict[str, float | int] | None to reflect the mixed-type payload (counts as int, percentages as float). The renderer's defensive int(float(...)) continues to accept legacy float values in pre-existing latest.json files.

Addressed in 7afea33.

Generated by Claude Code

- benchmark_delta.py: render status-bearing matrix rows as "skipped" with the reason text instead of treating zeroed metrics as a real regression (Copilot #4). +2 tests. - benchmark.py: validate --backends against the supported set (tfidf, bm25, fuzzy) and exit with argparse-style code 2 on typos rather than a ConfigError traceback from Router init (Copilot #5). +2 tests. - sweep_scoring.py: drop the unreachable `if default_row is not None and not default_row.is_default: pass` branch in render_report() (Copilot #7). - benchmark.py: emit naive_delta.naive_tokens / cw_tokens as ints (counts), matching the NaiveDelta dataclass and avoiding the int(float(...)) coercion in renderers (Copilot #8). - README.md / CHANGELOG.md: matrix and per_namespace live at the top of latest.json, not nested under routing (Copilot #2, #3). Comments #1 (_DELIM_CHARS missing _) and #6 (sweep_scoring.py size) are non-actionable — #1 is a false positive (underscore is already present; verified by behavioural test on _foo_) and #6 follows the existing scripts/ precedent (render_scorecard.py = 526 lines) acknowledged in the PR description. make ci: all 6 targets clean. make scorecard-check + make llms-check also clean.

dgenio · 2026-05-17T06:10:11Z

Closing as a duplicate of #228

PR #228 (already merged into main at 813aaef's parent) implemented the same issue group (#207, #208, #209, #211, #213, #214, #215) that this PR targets. Both branches were authored independently and the implementations differ in detail (different file additions for scripts/benchmark_delta.py, scripts/sweep_scoring.py, scripts/baseline_naive.py, benchmarks/sweep_scoring.md; both expanded benchmarks/routing_gold.json with different query sets; both reworked src/contextweaver/_utils.py's tokenizer).

A rebase onto current main surfaces 12+ files in deep conflict — resolving them would be equivalent to rewriting this PR from scratch to match main's already-shipped implementation.

What was rescued

The Phase-1 review-fix commit on this PR (7afea33) contained six small Copilot-suggested fixes that the maintainer is welcome to keep. Re-applied them against current main as a fresh PR — #NEW_PR (see below). Two of the six fixes still apply (the other four were already correct on main via #228):

--backends typo validation in benchmarks/benchmark.py (Copilot Feat/complete v0.1 implementation #5)
Skipped-row rendering in scripts/benchmark_delta.py (Copilot feat(routing): complete routing engine with tests and sample catalog #4)

The other four (int casts in baseline_naive, dead branch in sweep_scoring, top-level matrix/per_namespace doc location, _DELIM_CHARS) are already correct on main via #228's implementation.

Closing this PR. Tracking the rescued work via the new branch claude/pr-229-followup-cherry-picks.

Generated by Claude Code

dgenio · 2026-05-17T06:10:17Z

(Edit: the rescue PR is #235.)

Generated by Claude Code

Copilot AI review requested due to automatic review settings May 16, 2026 11:21

Copilot started reviewing on behalf of dgenio May 16, 2026 11:22 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

dgenio mentioned this pull request May 17, 2026

fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229 #235

Open

dgenio closed this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215)#229

feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215)#229
dgenio wants to merge 2 commits into
mainfrom
claude/triage-issues-xiSV0

dgenio commented May 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio May 16, 2026

Uh oh!

dgenio commented May 17, 2026

Uh oh!

dgenio commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if default_row is not None and not default_row.is_default:
		pass # unreachable — default_row is found by is_default

Conversation

dgenio commented May 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgenio commented May 17, 2026

Closing as a duplicate of #228

What was rescued

Uh oh!

dgenio commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants