Skip to content

feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215)#229

Closed
dgenio wants to merge 2 commits into
mainfrom
claude/triage-issues-xiSV0
Closed

feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215)#229
dgenio wants to merge 2 commits into
mainfrom
claude/triage-issues-xiSV0

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 16, 2026

Lands the 7-issue benchmark-scorecard maturation group selected by the
triage pass. Same surfaces (benchmarks/, scripts/, .github/workflows/,
one tokenizer change in src/contextweaver/_utils.py); one latest.json
regeneration instead of seven progressive drift commits.

#208 — Per-backend × per-size matrix

  • --matrix, --backends, --sizes, --no-naive flags on benchmark.py
  • make benchmark-matrix target
  • latest.json gains matrix array + per_namespace dict (additive)
  • skipped fuzzy backend records a status field instead of vanishing

#209 — Gold-set expansion

  • 50 -> 200 hand-authored queries in routing_gold.json
  • Every namespace has >=23 queries; every entry gains namespace field
  • Existing 50 preserved verbatim (just augmented with the new field)

#215 — Naive baseline harness

  • scripts/baseline_naive.py + per-scenario naive_delta block
  • Same CharDivFourEstimator on both sides for fair ratios
  • README's "70% lower cost" replaced with measured 58.3% on stress
    scenario + scorecard footnote linking the full breakdown

#214 — Weight sweep tool

  • scripts/sweep_scoring.py + make sweep-scoring
  • 243-point grid over the 5 ScoringConfig weights
  • Composite (50% coverage + 30% util-overrun + 20% drop-rate) ranking
  • Pareto-dominators surfaced as candidates for a follow-up; defaults
    untouched per the issue's strict non-goal

#207 — Weekly scheduled regen

  • .github/workflows/scorecard-weekly.yml — Mondays 06:00 UTC
  • peter-evans/create-pull-request opens drift PRs tagged
    area/eval + automated

#211 — Soft PR regression comment

  • scripts/benchmark_delta.py renders sticky comment markdown
  • .github/workflows/benchmark-delta.yml runs on PR events
  • peter-evans/create-or-update-comment edits in place
  • continue-on-error: true (soft gate per Round 2 Q4=A)
  • PR template gains optional "Reproducibility" subsection

#213 — Namespace-aware tokenizer

Scorecard renderer + scorecard.md

  • New "Per-backend matrix", "Per-namespace recall", "vs naive concat"
    sections with stable ✅/⚠️ markers at base × 1.30
  • Latency budget constant shared between scorecard and PR delta
    (test_benchmark_delta::test_latency_threshold_matches_scorecard_renderer
    locks them together)

Schema: latest.json bumped to 1.1. Additive only — legacy routing array
and existing context fields preserved verbatim for back-compat.

Tests

  • 91 new test cases across test_utils, test_benchmark,
    test_render_scorecard, test_benchmark_delta, test_baseline_naive,
    test_sweep_scoring
  • Full suite: 982 passed, 5 skipped (+91 new tests vs main)

Verification

  • ruff format --check src/ tests/ examples/ scripts/ — clean
  • ruff check src/ tests/ examples/ scripts/ — All checks passed
  • mypy src/ — Success: no issues found in 64 source files
  • pytest -q — 982 passed, 5 skipped
  • make example, make demo — clean
  • make scorecard-check — clean (deterministic regen)
  • make llms-check — clean

https://claude.ai/code/session_01RMDbkdJg3svPM2sTHVFJUu

…sweep + CI (#207, #208, #209, #211, #213, #214, #215)

Lands the 7-issue benchmark-scorecard maturation group selected by the
triage pass. Same surfaces (`benchmarks/`, `scripts/`, `.github/workflows/`,
one tokenizer change in `src/contextweaver/_utils.py`); one `latest.json`
regeneration instead of seven progressive drift commits.

#208 — Per-backend × per-size matrix
  - `--matrix`, `--backends`, `--sizes`, `--no-naive` flags on benchmark.py
  - `make benchmark-matrix` target
  - latest.json gains `matrix` array + `per_namespace` dict (additive)
  - skipped fuzzy backend records a `status` field instead of vanishing

#209 — Gold-set expansion
  - 50 -> 200 hand-authored queries in routing_gold.json
  - Every namespace has >=23 queries; every entry gains `namespace` field
  - Existing 50 preserved verbatim (just augmented with the new field)

#215 — Naive baseline harness
  - scripts/baseline_naive.py + per-scenario `naive_delta` block
  - Same CharDivFourEstimator on both sides for fair ratios
  - README's "70% lower cost" replaced with measured 58.3% on stress
    scenario + scorecard footnote linking the full breakdown

#214 — Weight sweep tool
  - scripts/sweep_scoring.py + `make sweep-scoring`
  - 243-point grid over the 5 ScoringConfig weights
  - Composite (50% coverage + 30% util-overrun + 20% drop-rate) ranking
  - Pareto-dominators surfaced as candidates for a follow-up; defaults
    untouched per the issue's strict non-goal

#207 — Weekly scheduled regen
  - .github/workflows/scorecard-weekly.yml — Mondays 06:00 UTC
  - peter-evans/create-pull-request opens drift PRs tagged
    area/eval + automated

#211 — Soft PR regression comment
  - scripts/benchmark_delta.py renders sticky comment markdown
  - .github/workflows/benchmark-delta.yml runs on PR events
  - peter-evans/create-or-update-comment edits in place
  - continue-on-error: true (soft gate per Round 2 Q4=A)
  - PR template gains optional "Reproducibility" subsection

#213 — Namespace-aware tokenizer
  - tokenize() emits sub-tokens for ids with internal . _ - /
  - Original joined form retained; STOPWORDS + len>=2 filters apply
  - On the 3x3 matrix: net-positive on 8/9 cells (+0.75pp to +3.50pp).
    One cell (fuzzy at catalog 100) drops 1.50pp — slightly over
    #213's strict <=1pp regression bound. Documented in CHANGELOG.

Scorecard renderer + scorecard.md
  - New "Per-backend matrix", "Per-namespace recall", "vs naive concat"
    sections with stable ✅/⚠️ markers at base × 1.30
  - Latency budget constant shared between scorecard and PR delta
    (test_benchmark_delta::test_latency_threshold_matches_scorecard_renderer
    locks them together)

Schema: latest.json bumped to 1.1. Additive only — legacy `routing` array
and existing context fields preserved verbatim for back-compat.

Tests
  - 91 new test cases across test_utils, test_benchmark,
    test_render_scorecard, test_benchmark_delta, test_baseline_naive,
    test_sweep_scoring
  - Full suite: 982 passed, 5 skipped (+91 new tests vs main)

Verification
  - ruff format --check src/ tests/ examples/ scripts/ — clean
  - ruff check src/ tests/ examples/ scripts/ — All checks passed
  - mypy src/ — Success: no issues found in 64 source files
  - pytest -q — 982 passed, 5 skipped
  - make example, make demo — clean
  - make scorecard-check — clean (deterministic regen)
  - make llms-check — clean

https://claude.ai/code/session_01RMDbkdJg3svPM2sTHVFJUu
Copilot AI review requested due to automatic review settings May 16, 2026 11:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR matures the evaluation/scorecard infrastructure by expanding the benchmark dataset and outputs (matrix + per-namespace + naïve baseline), adding a ScoringConfig sweep report, and wiring CI automation (weekly regen + per-PR delta comments) so regressions are visible without manual runs.

Changes:

  • Extend the benchmark harness + JSON schema (v1.1) to emit a per-backend×per-size routing matrix, per-namespace recall, and per-scenario naïve-baseline deltas.
  • Add new tooling/scripts and reports: sticky benchmark delta renderer, ScoringConfig sweep runner + report, gold-set expansion helper, and updated scorecard renderer/markdown outputs.
  • Add CI workflows for weekly scorecard regeneration PRs and per-PR sticky delta comments; update docs/README accordingly.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_utils.py Adds tokenizer sub-token + tokenize_list tests
tests/test_sweep_scoring.py Adds sweep tool grid/report/pareto tests
tests/test_render_scorecard.py Adds tests for new scorecard sections/validation
tests/test_benchmark.py Adds tests for matrix mode + helpers
tests/test_benchmark_delta.py Adds tests for delta renderer + threshold sync
tests/test_baseline_naive.py Adds tests for naïve-baseline computations
src/contextweaver/_utils.py Implements namespace-aware tokenization + tokenize_list
scripts/sweep_scoring.py Implements ScoringConfig grid sweep + report
scripts/render_scorecard.py Renders matrix/per-namespace/naïve sections
scripts/benchmark_delta.py Renders sticky PR comment benchmark deltas
scripts/baseline_naive.py Computes naïve concat token/coverage deltas
scripts/_gold_expansion.py One-shot generator for expanded gold set
README.md Replaces illustrative cost claim with measured footnote
Makefile Adds benchmark-matrix and sweep-scoring targets
llms-full.txt Regenerated docs bundle reflecting README/workflow updates
docs/benchmarks.md Documents new benchmark surfaces + CI integration
docs/agent-context/workflows.md Documents new make targets
CHANGELOG.md Documents new eval features + schema bump
benchmarks/sweep_scoring.md Adds generated sweep report
benchmarks/scorecard.md Regenerates scorecard with new sections
benchmarks/routing_gold.json Expands gold set to 200 + adds namespace field
benchmarks/results/latest.json Regenerates baseline output with v1.1 schema
benchmarks/README.md Updates benchmark methodology + matrix + naïve columns
benchmarks/benchmark.py Adds matrix mode, per-namespace, naïve delta emission
AGENTS.md Documents new make targets
.github/workflows/scorecard-weekly.yml Weekly scheduled regen workflow (PR on drift)
.github/workflows/benchmark-delta.yml Per-PR delta workflow posting sticky comment
.github/pull_request_template.md Adds optional reproducibility section

Comment on lines 165 to +205
@@ -174,9 +180,12 @@ def tokenize_list(text: str) -> list[str]:

Steps:
1. Lower-case the input.
2. Split on one or more non-word characters.
3. Discard tokens shorter than 2 characters.
4. Remove :data:`STOPWORDS`.
2. Scan word-like runs that may include internal ``.``/``_``/``-``/``/``
(so dotted / snake_case ids stay intact).
3. Strip surrounding delimiter characters from each run.
4. Emit the joined form (length ≥ 2 and not a STOPWORD).
5. If the run still has internal delimiters, also emit each sub-token
— again filtered by length ≥ 2 and STOPWORD membership.

Args:
text: Raw input string.
@@ -185,8 +194,16 @@ def tokenize_list(text: str) -> list[str]:
A ``list[str]`` of normalised, stop-word-filtered tokens in
occurrence order. Duplicate tokens are preserved.
"""
tokens = _SPLIT_RE.split(text.lower())
return [t for t in tokens if len(t) >= 2 and t not in STOPWORDS]
out: list[str] = []
for raw in _SURFACE_RE.findall(text.lower()):
stripped = raw.strip(_DELIM_CHARS)
if len(stripped) >= 2 and stripped not in STOPWORDS:
out.append(stripped)
if _SUB_SPLIT_RE.search(stripped):
for sub in _SUB_SPLIT_RE.split(stripped):
if len(sub) >= 2 and sub not in STOPWORDS:
out.append(sub)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified false positive — _DELIM_CHARS = "._-/" already contains _ (4 chars: ., _, -, /). Behavioural test confirms tokenize_list("_foo_")["foo"] (leading/trailing underscores correctly stripped). No code change needed.

Resolving thread.


Generated by Claude Code

Comment thread benchmarks/README.md Outdated
Comment on lines +73 to +75
/ 1000) plus **per-namespace recall@5** under `routing.matrix` and
`routing.per_namespace`. The legacy single-backend rows (50 / 83 / 1000
on `tfidf`) are preserved verbatim for back-compat.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. README now says the matrix and per-namespace data live at top-level matrix and per_namespace in results/latest.json, matching what the harness actually writes and what the scorecard renderer reads.

Addressed in 7afea33.


Generated by Claude Code

Comment thread CHANGELOG.md Outdated
Comment on lines +23 to +25
`namespace` field. The benchmark aggregates recall@k into
`routing.per_namespace` keyed by backend so the scorecard can show
which namespaces are weakest on each scorer.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Changelog now refers to a top-level per_namespace object (alongside top-level matrix), matching the actual schema.

Addressed in 7afea33.


Generated by Claude Code

Comment on lines +174 to +184
base_matrix = list(base.get("matrix", []))
head_matrix = list(head.get("matrix", []))
if base_matrix or head_matrix:
# The matrix uses the same row shape as the routing table — but with
# a non-default backend key — so the same renderer handles both.
parts += [
"### Matrix (#208)",
"",
_routing_delta_table(base_matrix, head_matrix, k),
"",
]
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. _routing_delta_table now detects a non-empty status on either the base or head row and renders a _skipped_ / _skipped_ / _skipped_ | <status text> row, suppressing the latency-budget marker. Added two regression tests in tests/test_benchmark_delta.py (test_matrix_row_with_status_rendered_as_skipped covers the head-skipped case; test_matrix_row_with_status_on_base_only covers the inverse).

Addressed in 7afea33.


Generated by Claude Code

Comment thread benchmarks/benchmark.py
Comment on lines +130 to 140
def _backend_available(backend: str) -> tuple[bool, str]:
"""Report whether *backend* can run in this environment.

Returns ``(True, "")`` if available, else ``(False, reason)``. The
reason is recorded into ``latest.json`` so the scorecard can render
a skipped-row marker instead of silently omitting cells.
"""
if backend == "fuzzy" and FuzzyScorer is None:
return False, "skipped: rapidfuzz not installed (contextweaver[retrieval] extra)"
return True, ""

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. _parse_args now validates args.backends against _BACKENDS_DEFAULT after parsing and calls parser.error() (exit 2) with a clear message listing the supported set. Added two tests in tests/test_benchmark.py: test_parse_args_rejects_unknown_backends asserts SystemExit(2) on --backends tfidf,bogus; test_parse_args_accepts_supported_backends pins the happy path.

Addressed in 7afea33.


Generated by Claude Code

Comment thread scripts/sweep_scoring.py
Comment on lines +1 to +20
#!/usr/bin/env python3
"""ScoringConfig weight sweep + measured-defaults report (issue #214).

Grid-searches the ``ScoringConfig`` weight space against the committed
benchmark scenarios and writes a deterministic markdown report to
``benchmarks/sweep_scoring.md`` showing:

* The composite score and per-axis metrics for every configuration.
* The rank of the current default within the swept grid.
* Any configurations that Pareto-dominate the default (lower or equal on
every axis we care about, strictly better on at least one) — surfaced
as candidates for a separate, deliberately-scoped follow-up issue.

Per the issue spec, this script does **not** change ``ScoringConfig``
defaults. The report is the deliverable; defaults remain the
maintainer's call.

Composite formula (also documented in the report header)::

composite = 0.50 * coverage_pct_avg
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-actionable in this PR. The 410-line size was anticipated in the PR description's "Tradeoffs / risks" section — scripts/ has existing precedent above the 300-line guideline (render_scorecard.py = 526 lines), and the maintainer's "delivery over splitting" preference applies. The AGENTS.md guideline targets src/ modules with documented exemptions for types.py, envelope.py, __main__.py; extending the policy to scripts/ is a docs change that belongs in a separate PR.

Resolving thread.


Generated by Claude Code

Comment thread scripts/sweep_scoring.py Outdated
Comment on lines +324 to +325
if default_row is not None and not default_row.is_default:
pass # unreachable — default_row is found by is_default
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Removed the unreachable if default_row is not None and not default_row.is_default: pass branch from render_report() (default_row is selected by is_default upstream, so the negation can never trigger).

Addressed in 7afea33.


Generated by Claude Code

Comment thread benchmarks/benchmark.py Outdated
Comment on lines +471 to +476
naive_delta: dict[str, float] | None = None
if with_naive:
delta = compute_naive_delta(events=events, pack=pack, cw_tokens=prompt_toks)
naive_delta = {
"naive_tokens": float(delta.naive_tokens),
"cw_tokens": float(delta.cw_tokens),
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. _collect_context_stats now writes int(delta.naive_tokens) / int(delta.cw_tokens) into the naive_delta block, matching the NaiveDelta dataclass field types (int). The ContextStats.naive_delta annotation is broadened to dict[str, float | int] | None to reflect the mixed-type payload (counts as int, percentages as float). The renderer's defensive int(float(...)) continues to accept legacy float values in pre-existing latest.json files.

Addressed in 7afea33.


Generated by Claude Code

- benchmark_delta.py: render status-bearing matrix rows as "skipped"
  with the reason text instead of treating zeroed metrics as a real
  regression (Copilot #4). +2 tests.
- benchmark.py: validate --backends against the supported set
  (tfidf, bm25, fuzzy) and exit with argparse-style code 2 on typos
  rather than a ConfigError traceback from Router init (Copilot #5).
  +2 tests.
- sweep_scoring.py: drop the unreachable
  `if default_row is not None and not default_row.is_default: pass`
  branch in render_report() (Copilot #7).
- benchmark.py: emit naive_delta.naive_tokens / cw_tokens as ints
  (counts), matching the NaiveDelta dataclass and avoiding the
  int(float(...)) coercion in renderers (Copilot #8).
- README.md / CHANGELOG.md: matrix and per_namespace live at the
  top of latest.json, not nested under routing (Copilot #2, #3).

Comments #1 (_DELIM_CHARS missing _) and #6 (sweep_scoring.py size)
are non-actionable — #1 is a false positive (underscore is already
present; verified by behavioural test on _foo_) and #6 follows the
existing scripts/ precedent (render_scorecard.py = 526 lines)
acknowledged in the PR description.

make ci: all 6 targets clean. make scorecard-check + make llms-check
also clean.
Copy link
Copy Markdown
Owner Author

dgenio commented May 17, 2026

Closing as a duplicate of #228

PR #228 (already merged into main at 813aaef's parent) implemented the same issue group (#207, #208, #209, #211, #213, #214, #215) that this PR targets. Both branches were authored independently and the implementations differ in detail (different file additions for scripts/benchmark_delta.py, scripts/sweep_scoring.py, scripts/baseline_naive.py, benchmarks/sweep_scoring.md; both expanded benchmarks/routing_gold.json with different query sets; both reworked src/contextweaver/_utils.py's tokenizer).

A rebase onto current main surfaces 12+ files in deep conflict — resolving them would be equivalent to rewriting this PR from scratch to match main's already-shipped implementation.

What was rescued

The Phase-1 review-fix commit on this PR (7afea33) contained six small Copilot-suggested fixes that the maintainer is welcome to keep. Re-applied them against current main as a fresh PR — #NEW_PR (see below). Two of the six fixes still apply (the other four were already correct on main via #228):

The other four (int casts in baseline_naive, dead branch in sweep_scoring, top-level matrix/per_namespace doc location, _DELIM_CHARS) are already correct on main via #228's implementation.

Closing this PR. Tracking the rescued work via the new branch claude/pr-229-followup-cherry-picks.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

dgenio commented May 17, 2026

(Edit: the rescue PR is #235.)


Generated by Claude Code

@dgenio dgenio closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants