fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229#235
fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229#235dgenio wants to merge 2 commits into
Conversation
PR #229 was a duplicate of merged PR #228 — both branches independently implemented the same #207-#215 issue group, making a direct rebase unworkable (12+ files in deep conflict). This commit re-applies the two of the six Copilot-suggested fixes from PR #229's review-fix commit that remained applicable on top of main's #228-merged implementation; the other four (int casts in baseline_naive, dead branch in sweep_scoring, top-level matrix/per_namespace doc location in README + CHANGELOG) are already correct on main. 1. **benchmark.py --backends validation** (PR #229 Copilot #5). Typos like `--backends tfidf,tifdf` previously reached `Router(scorer_backend=)` and surfaced as a `ConfigError` traceback. Now `_parse_args` validates against `_SUPPORTED_BACKENDS = {tfidf, bm25, fuzzy}` and exits via `parser.error()` with argparse's standard code-2 convention. Tests: `test_parse_args_rejects_unknown_backend`, `test_parse_args_accepts_all_supported_backends`, `test_parse_args_accepts_subset_of_backends`. 2. **benchmark_delta.py skipped-row rendering** (PR #229 Copilot #4). When `benchmarks/benchmark.py --matrix` emits a `(backend, size)` cell with `status: "skipped: rapidfuzz not installed"` (zeroed metrics by design), the previous delta script treated those zeroes as a real accuracy/latency regression and rendered⚠️ markers on every PR. Now `_render_matrix_section` short-circuits on a non-empty `status` and emits a single `_skipped_ (<reason>)` row. Tests covering both the skipped-row path and the unchanged status-less path live in the new `tests/test_benchmark_delta.py`. This is the surviving piece of PR #229's PR-review-cycle work. PR #229 itself should be closed as a structural duplicate; this commit lands on a fresh branch (`claude/pr-229-followup-cherry-picks`) off current main. Verification: - ruff format/lint clean - mypy strict — 0 issues / 72 source files - pytest -q — 1011 passed, 5 skipped - make example + make demo clean - make scorecard-check clean
There was a problem hiding this comment.
Pull request overview
Re-applies two fixes from a duplicate PR: (1) validate benchmarks/benchmark.py --backends early via argparse conventions, and (2) adjust scripts/benchmark_delta.py matrix rendering to avoid treating “skipped” cells as regressions, with accompanying regression tests.
Changes:
- Add
_SUPPORTED_BACKENDSvalidation inbenchmarks/benchmark.py::_parse_args()to fail fast on unknown backend names viaparser.error()(exit code 2). - Update
scripts/benchmark_delta.py::_render_matrix_section()to short-circuit rendering when a matrix cell carries astatus. - Add/extend tests for
_parse_args()and the matrix-delta “skipped cell” behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
benchmarks/benchmark.py |
Adds supported-backend allowlist and argparse-level validation for --backends. |
scripts/benchmark_delta.py |
Attempts to suppress false-positive regressions by treating status-bearing cells as skipped. |
tests/test_benchmark.py |
Adds regression tests for _parse_args() backend validation behavior. |
tests/test_benchmark_delta.py |
Adds regression tests for skipped-cell rendering in matrix delta output. |
| # Skip-cells carry zeroed metrics by design (e.g. fuzzy with no | ||
| # rapidfuzz). Treating them as accuracy/latency regressions would | ||
| # produce false-positive ⚠️ markers on every PR; emit a single | ||
| # "skipped" row that surfaces the reason text instead. | ||
| head_status = str(h.get("status", "")) | ||
| base_status = str(b.get("status", "")) | ||
| if head_status or base_status: | ||
| reason = head_status or base_status | ||
| lines.append(f"| {backend} | {size} | _skipped_ ({reason}) | — | — |") | ||
| continue |
There was a problem hiding this comment.
Confirmed and fixed in fad6462. You were exactly right — MatrixCell.status: str = "ok" (benchmark.py:184) so my if head_status or base_status: gate fired on every real cell. The github-actions benchmark-delta bot caught it live on this PR: every (backend, size) cell rendered as _skipped_ (ok) instead of its metrics.
The gate is now status != "ok", aligned with the existing convention in scripts/render_scorecard.py:150-151. Missing-status cells are treated as "ok" for backwards compat.
On the "availability improvements" suggestion (base-skipped/head-ok): I kept the simpler "either side is non-ok → render as skipped" treatment for now. A head-only metric without a base would show as recall@k 0.45 (+0.45) which is misleading (not a real delta) — surfacing the more nuanced rendering is a worthwhile but separate improvement. Happy to follow up if you'd like that in this PR.
Resolving thread.
Generated by Claude Code
| def test_matrix_delta_real_cells_still_render_metrics() -> None: | ||
| """Status-less cells still produce a full recall/MRR/latency row.""" | ||
| base = _payload( | ||
| [ | ||
| { | ||
| "backend": "tfidf", | ||
| "catalog_size": 100, | ||
| "recall_at_k": 0.5, | ||
| "mrr": 0.4, | ||
| "latency_ms_p99": 1.0, | ||
| } | ||
| ] | ||
| ) | ||
| head = _payload( | ||
| [ | ||
| { | ||
| "backend": "tfidf", | ||
| "catalog_size": 100, | ||
| "recall_at_k": 0.5, | ||
| "mrr": 0.4, | ||
| "latency_ms_p99": 1.0, | ||
| } | ||
| ] | ||
| ) | ||
| out = _render_matrix_section(base, head) | ||
| # Numeric markers present, no skipped marker, no regression marker. | ||
| assert "_skipped_" not in out | ||
| assert "0.5000" in out | ||
|
|
There was a problem hiding this comment.
Fixed in fad6462. The fixtures now explicitly carry status: "ok":
test_matrix_delta_ok_cells_render_metrics(renamed from the misleading_real_cells_still_render_metrics) — setsstatus: "ok"on both base and head and asserts metrics render (0.5000present, no_skipped_). This is exactly the regression test that would have caught my original bug.test_matrix_delta_missing_status_treated_as_ok— backwards-compat for cells with nostatusfield at all.
Verified locally: pytest -q now reports 1012 passed (one more test than before the fix).
Resolving thread.
Generated by Claude Code
Benchmark delta (vs
|
| size | recall@k (head Δ vs base) | MRR (head Δ vs base) | p99 (ms) |
|---|---|---|---|
| 50 | ✅ 0.5649 (+0.0000) | ✅ 0.4978 (+0.0000) | ✅ 0.499 (base 0.463) |
| 83 | ✅ 0.3825 (+0.0000) | ✅ 0.3242 (+0.0000) | |
| 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 39.815 (base 31.897) |
Per-backend × per-size matrix
| backend | size | recall@k (Δ) | MRR (Δ) | p99 (ms) |
|---|---|---|---|---|
| bm25 | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3399 (+0.0000) | ✅ 5.896 (base 5.642) |
| bm25 | 500 | ✅ 0.2250 (+0.0000) | ✅ 0.2165 (+0.0000) | ✅ 27.958 (base 27.538) |
| bm25 | 1000 | ✅ 0.1575 (+0.0000) | ✅ 0.1525 (+0.0000) | ✅ 86.709 (base 78.368) |
| fuzzy | 100 | skipped (skipped: missing rapidfuzz) | — | — |
| fuzzy | 500 | skipped (skipped: missing rapidfuzz) | — | — |
| fuzzy | 1000 | skipped (skipped: missing rapidfuzz) | — | — |
| tfidf | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3220 (+0.0000) | ✅ 0.991 (base 0.872) |
| tfidf | 500 | ✅ 0.2325 (+0.0000) | ✅ 0.2314 (+0.0000) | ✅ 8.804 (base 8.660) |
| tfidf | 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 33.536 (base 30.071) |
Context pipeline (per scenario)
| scenario | tokens | dropped | dedup |
|---|---|---|---|
| large_catalog | 1514 (base 1514, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| long_conversation | 2548 (base 2548, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| short_conversation | 496 (base 496, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| stress_conversation | 6651 (base 6651, Δ+0) | 7 (base 7, Δ+0) | 4 (base 4, Δ+0) |
Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.
Addresses both Copilot inline comments on PR #235. The previous fix used `if head_status or base_status:` to detect skipped matrix cells, but `benchmarks/benchmark.py::MatrixCell.status` defaults to `"ok"` for every real cell — making `"ok"` truthy and causing the delta script to render `_skipped_ (ok)` for every backend/size cell. The benchmark-delta bot comment on this PR caught the bug live: all 9 real cells (bm25/fuzzy/tfidf × 100/500/1000) were marked as skipped. The gate is now `status != "ok"`, mirroring the existing convention in `scripts/render_scorecard.py:150-151`. Cells with no `status` field are treated as "ok" for backwards compat. Test updates: - `test_matrix_delta_ok_cells_render_metrics` (renamed from `test_matrix_delta_real_cells_still_render_metrics`) now explicitly sets `status: "ok"` on the fixtures and asserts metrics render — this is the regression test that would have caught the original bug. - `test_matrix_delta_missing_status_treated_as_ok` added for backwards compatibility on cells that omit `status` entirely. Verification: - ruff format/lint/mypy clean - pytest -q: 1012 passed, 5 skipped (one more test than before the fix)
Summary
Re-applies the two Copilot-suggested fixes from PR #229's review-fix commit that remained applicable on top of main after the duplicate PR #228 merged. The other four Copilot-suggested fixes from #229 turned out to already be correct on main via #228's implementation.
PR #229 itself is a structural duplicate of merged PR #228 (both branches independently implemented the same #207-#215 issue group). A direct rebase surfaces 12+ files in deep conflict because the same logical work was implemented differently on each branch. Recommend closing #229 as a duplicate (see the comment posted there).
Changes
1.
benchmarks/benchmark.py—--backendstypo validation (PR #229 Copilot #5)Typos like
--backends tfidf,tifdfpreviously reachedRouter(scorer_backend=)and surfaced as aConfigErrortraceback. Now_parse_argsvalidates against_SUPPORTED_BACKENDS = {tfidf, bm25, fuzzy}and exits viaparser.error()with argparse's standard code-2 convention.Tests added in
tests/test_benchmark.py:test_parse_args_rejects_unknown_backend— code-2 exit on typotest_parse_args_accepts_all_supported_backendstest_parse_args_accepts_subset_of_backends2.
scripts/benchmark_delta.py— skipped-row rendering (PR #229 Copilot #4)When⚠️ markers on every PR. Now
benchmarks/benchmark.py --matrixemits a(backend, size)cell withstatus: "skipped: rapidfuzz not installed"(zeroed metrics by design), the previous delta script treated those zeroes as a real accuracy/latency regression and emitted_render_matrix_sectionshort-circuits on a non-emptystatusand emits a single_skipped_ (<reason>)row.Tests added in new
tests/test_benchmark_delta.py:test_matrix_delta_renders_skipped_row_for_status_bearing_cell— pins notest_matrix_delta_real_cells_still_render_metrics— status-less cells unchangedtest_matrix_delta_empty_payload_returns_empty_string— empty-input safetyWhat did NOT need re-applying (already correct on main via #228)
naive_tokens/cw_tokensscripts/baseline_naive.py:141-142already emits intsif default_row is not None and not default_row.is_default: passbranchsweep_scoring.pynever had itbenchmarks/README.md:136already says "additive top-level keys"_DELIM_CHARSmissing underscoresweep_scoring.pysizescripts/precedentVerification
Related
7afea33)Generated by Claude Code