fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229 by dgenio · Pull Request #235 · dgenio/contextweaver

dgenio · 2026-05-17T06:10:00Z

Summary

Re-applies the two Copilot-suggested fixes from PR #229's review-fix commit that remained applicable on top of main after the duplicate PR #228 merged. The other four Copilot-suggested fixes from #229 turned out to already be correct on main via #228's implementation.

PR #229 itself is a structural duplicate of merged PR #228 (both branches independently implemented the same #207-#215 issue group). A direct rebase surfaces 12+ files in deep conflict because the same logical work was implemented differently on each branch. Recommend closing #229 as a duplicate (see the comment posted there).

Changes

1. benchmarks/benchmark.py — --backends typo validation (PR #229 Copilot #5)

Typos like --backends tfidf,tifdf previously reached Router(scorer_backend=) and surfaced as a ConfigError traceback. Now _parse_args validates against _SUPPORTED_BACKENDS = {tfidf, bm25, fuzzy} and exits via parser.error() with argparse's standard code-2 convention.

Tests added in tests/test_benchmark.py:

test_parse_args_rejects_unknown_backend — code-2 exit on typo
test_parse_args_accepts_all_supported_backends
test_parse_args_accepts_subset_of_backends

2. scripts/benchmark_delta.py — skipped-row rendering (PR #229 Copilot #4)

When benchmarks/benchmark.py --matrix emits a (backend, size) cell with status: "skipped: rapidfuzz not installed" (zeroed metrics by design), the previous delta script treated those zeroes as a real accuracy/latency regression and emitted ⚠️ markers on every PR. Now _render_matrix_section short-circuits on a non-empty status and emits a single _skipped_ (<reason>) row.

Tests added in new tests/test_benchmark_delta.py:

test_matrix_delta_renders_skipped_row_for_status_bearing_cell — pins no ⚠️ on skipped cells
test_matrix_delta_real_cells_still_render_metrics — status-less cells unchanged
test_matrix_delta_empty_payload_returns_empty_string — empty-input safety

What did NOT need re-applying (already correct on main via #228)

PR #229 Copilot #	Concern	Already-on-main status
#8	int casts on `naive_tokens` / `cw_tokens`	`scripts/baseline_naive.py:141-142` already emits ints
#7	dead `if default_row is not None and not default_row.is_default: pass` branch	Main's `sweep_scoring.py` never had it
#2 / #3	matrix/per_namespace location in README + CHANGELOG	Main's `benchmarks/README.md:136` already says "additive top-level keys"
#1	`_DELIM_CHARS` missing underscore	Was a false positive in PR #229 (acknowledged in the original review-fix commit)
#6	`sweep_scoring.py` size	Acknowledged in PR #229 as following the existing `scripts/` precedent

Verification

ruff format src/ tests/ examples/ scripts/   →  clean
ruff check  src/ tests/ examples/ scripts/   →  All checks passed
mypy src/                                    →  Success: 0 issues / 72 source files
pytest -q                                    →  1011 passed, 5 skipped (+ 6 new regression tests)
make example                                 →  clean
make demo                                    →  Demo complete
make scorecard-check                         →  clean

Closes the rescue-path for the abandoned PR feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215) #229's review-fix commit (7afea33)
Recommend closing PR feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215) #229 as a duplicate of merged PR docs+bench: ship 5-issue user-facing surface bundle (#181, #197, #198, #199, #200) #228

Generated by Claude Code

PR #229 was a duplicate of merged PR #228 — both branches independently implemented the same #207-#215 issue group, making a direct rebase unworkable (12+ files in deep conflict). This commit re-applies the two of the six Copilot-suggested fixes from PR #229's review-fix commit that remained applicable on top of main's #228-merged implementation; the other four (int casts in baseline_naive, dead branch in sweep_scoring, top-level matrix/per_namespace doc location in README + CHANGELOG) are already correct on main. 1. **benchmark.py --backends validation** (PR #229 Copilot #5). Typos like `--backends tfidf,tifdf` previously reached `Router(scorer_backend=)` and surfaced as a `ConfigError` traceback. Now `_parse_args` validates against `_SUPPORTED_BACKENDS = {tfidf, bm25, fuzzy}` and exits via `parser.error()` with argparse's standard code-2 convention. Tests: `test_parse_args_rejects_unknown_backend`, `test_parse_args_accepts_all_supported_backends`, `test_parse_args_accepts_subset_of_backends`. 2. **benchmark_delta.py skipped-row rendering** (PR #229 Copilot #4). When `benchmarks/benchmark.py --matrix` emits a `(backend, size)` cell with `status: "skipped: rapidfuzz not installed"` (zeroed metrics by design), the previous delta script treated those zeroes as a real accuracy/latency regression and rendered ⚠️ markers on every PR. Now `_render_matrix_section` short-circuits on a non-empty `status` and emits a single `_skipped_ (<reason>)` row. Tests covering both the skipped-row path and the unchanged status-less path live in the new `tests/test_benchmark_delta.py`. This is the surviving piece of PR #229's PR-review-cycle work. PR #229 itself should be closed as a structural duplicate; this commit lands on a fresh branch (`claude/pr-229-followup-cherry-picks`) off current main. Verification: - ruff format/lint clean - mypy strict — 0 issues / 72 source files - pytest -q — 1011 passed, 5 skipped - make example + make demo clean - make scorecard-check clean

Copilot

Pull request overview

Re-applies two fixes from a duplicate PR: (1) validate benchmarks/benchmark.py --backends early via argparse conventions, and (2) adjust scripts/benchmark_delta.py matrix rendering to avoid treating “skipped” cells as regressions, with accompanying regression tests.

Changes:

Add _SUPPORTED_BACKENDS validation in benchmarks/benchmark.py::_parse_args() to fail fast on unknown backend names via parser.error() (exit code 2).
Update scripts/benchmark_delta.py::_render_matrix_section() to short-circuit rendering when a matrix cell carries a status.
Add/extend tests for _parse_args() and the matrix-delta “skipped cell” behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`benchmarks/benchmark.py`	Adds supported-backend allowlist and argparse-level validation for `--backends`.
`scripts/benchmark_delta.py`	Attempts to suppress false-positive regressions by treating `status`-bearing cells as skipped.
`tests/test_benchmark.py`	Adds regression tests for `_parse_args()` backend validation behavior.
`tests/test_benchmark_delta.py`	Adds regression tests for skipped-cell rendering in matrix delta output.

dgenio · 2026-05-17T06:14:33Z

+        # Skip-cells carry zeroed metrics by design (e.g. fuzzy with no
+        # rapidfuzz). Treating them as accuracy/latency regressions would
+        # produce false-positive ⚠️ markers on every PR; emit a single
+        # "skipped" row that surfaces the reason text instead.
+        head_status = str(h.get("status", ""))
+        base_status = str(b.get("status", ""))
+        if head_status or base_status:
+            reason = head_status or base_status
+            lines.append(f"| {backend} | {size} | _skipped_ ({reason}) | — | — |")
+            continue


Confirmed and fixed in fad6462. You were exactly right — MatrixCell.status: str = "ok" (benchmark.py:184) so my if head_status or base_status: gate fired on every real cell. The github-actions benchmark-delta bot caught it live on this PR: every (backend, size) cell rendered as _skipped_ (ok) instead of its metrics.

The gate is now status != "ok", aligned with the existing convention in scripts/render_scorecard.py:150-151. Missing-status cells are treated as "ok" for backwards compat.

On the "availability improvements" suggestion (base-skipped/head-ok): I kept the simpler "either side is non-ok → render as skipped" treatment for now. A head-only metric without a base would show as recall@k 0.45 (+0.45) which is misleading (not a real delta) — surfacing the more nuanced rendering is a worthwhile but separate improvement. Happy to follow up if you'd like that in this PR.

Resolving thread.

Generated by Claude Code

dgenio · 2026-05-17T06:14:36Z

+def test_matrix_delta_real_cells_still_render_metrics() -> None:
+    """Status-less cells still produce a full recall/MRR/latency row."""
+    base = _payload(
+        [
+            {
+                "backend": "tfidf",
+                "catalog_size": 100,
+                "recall_at_k": 0.5,
+                "mrr": 0.4,
+                "latency_ms_p99": 1.0,
+            }
+        ]
+    )
+    head = _payload(
+        [
+            {
+                "backend": "tfidf",
+                "catalog_size": 100,
+                "recall_at_k": 0.5,
+                "mrr": 0.4,
+                "latency_ms_p99": 1.0,
+            }
+        ]
+    )
+    out = _render_matrix_section(base, head)
+    # Numeric markers present, no skipped marker, no regression marker.
+    assert "_skipped_" not in out
+    assert "0.5000" in out
+


Fixed in fad6462. The fixtures now explicitly carry status: "ok":

test_matrix_delta_ok_cells_render_metrics (renamed from the misleading _real_cells_still_render_metrics) — sets status: "ok" on both base and head and asserts metrics render (0.5000 present, no _skipped_). This is exactly the regression test that would have caught my original bug.

test_matrix_delta_missing_status_treated_as_ok — backwards-compat for cells with no status field at all.

Verified locally: pytest -q now reports 1012 passed (one more test than before the fix).

Resolving thread.

Generated by Claude Code

github-actions · 2026-05-17T06:13:16Z

Benchmark delta (vs `main`)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size	recall@k (head Δ vs base)	MRR (head Δ vs base)	p99 (ms)
50	✅ 0.5649 (+0.0000)	✅ 0.4978 (+0.0000)	✅ 0.499 (base 0.463)
83	✅ 0.3825 (+0.0000)	✅ 0.3242 (+0.0000)	⚠️ 1.179 (base 0.876)
1000	✅ 0.1475 (+0.0000)	✅ 0.1456 (+0.0000)	✅ 39.815 (base 31.897)

Per-backend × per-size matrix

backend	size	recall@k (Δ)	MRR (Δ)	p99 (ms)
bm25	100	✅ 0.3825 (+0.0000)	✅ 0.3399 (+0.0000)	✅ 5.896 (base 5.642)
bm25	500	✅ 0.2250 (+0.0000)	✅ 0.2165 (+0.0000)	✅ 27.958 (base 27.538)
bm25	1000	✅ 0.1575 (+0.0000)	✅ 0.1525 (+0.0000)	✅ 86.709 (base 78.368)
fuzzy	100	skipped (skipped: missing rapidfuzz)	—	—
fuzzy	500	skipped (skipped: missing rapidfuzz)	—	—
fuzzy	1000	skipped (skipped: missing rapidfuzz)	—	—
tfidf	100	✅ 0.3825 (+0.0000)	✅ 0.3220 (+0.0000)	✅ 0.991 (base 0.872)
tfidf	500	✅ 0.2325 (+0.0000)	✅ 0.2314 (+0.0000)	✅ 8.804 (base 8.660)
tfidf	1000	✅ 0.1475 (+0.0000)	✅ 0.1456 (+0.0000)	✅ 33.536 (base 30.071)

Context pipeline (per scenario)

scenario	tokens	dropped	dedup
large_catalog	1514 (base 1514, Δ+0)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
long_conversation	2548 (base 2548, Δ+0)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
short_conversation	496 (base 496, Δ+0)	0 (base 0, Δ+0)	0 (base 0, Δ+0)
stress_conversation	6651 (base 6651, Δ+0)	7 (base 7, Δ+0)	4 (base 4, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

Addresses both Copilot inline comments on PR #235. The previous fix used `if head_status or base_status:` to detect skipped matrix cells, but `benchmarks/benchmark.py::MatrixCell.status` defaults to `"ok"` for every real cell — making `"ok"` truthy and causing the delta script to render `_skipped_ (ok)` for every backend/size cell. The benchmark-delta bot comment on this PR caught the bug live: all 9 real cells (bm25/fuzzy/tfidf × 100/500/1000) were marked as skipped. The gate is now `status != "ok"`, mirroring the existing convention in `scripts/render_scorecard.py:150-151`. Cells with no `status` field are treated as "ok" for backwards compat. Test updates: - `test_matrix_delta_ok_cells_render_metrics` (renamed from `test_matrix_delta_real_cells_still_render_metrics`) now explicitly sets `status: "ok"` on the fixtures and asserts metrics render — this is the regression test that would have caught the original bug. - `test_matrix_delta_missing_status_treated_as_ok` added for backwards compatibility on cells that omit `status` entirely. Verification: - ruff format/lint/mypy clean - pytest -q: 1012 passed, 5 skipped (one more test than before the fix)

Copilot AI review requested due to automatic review settings May 17, 2026 06:10

dgenio mentioned this pull request May 17, 2026

feat(eval): scorecard maturation — matrix + gold expansion + naive + sweep + CI (#207, #208, #209, #211, #213, #214, #215) #229

Closed

Copilot started reviewing on behalf of dgenio May 17, 2026 06:10 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229#235

fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229#235
dgenio wants to merge 2 commits into
mainfrom
claude/pr-229-followup-cherry-picks

dgenio commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

dgenio May 17, 2026

Uh oh!

dgenio May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dgenio commented May 17, 2026

Summary

Changes

What did NOT need re-applying (already correct on main via #228)

Verification

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

dgenio May 17, 2026

Choose a reason for hiding this comment

Uh oh!

dgenio May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark delta (vs main)

Routing summary (single backend × catalog sizes)

Per-backend × per-size matrix

Context pipeline (per scenario)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 17, 2026 •

edited

Loading

Benchmark delta (vs `main`)