test(preflight-audit): replay-mode eval fixture for the classifier#423
Merged
Merged
Conversation
…ifier rule PR apache#418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds: - `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue GraphQL response, one issue per rule path (Rule 1 dispatch, Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7 skip-noop, GitHub-App bot login, personal-bot needing override, fall-through dispatch, recently-closed dispatch). Each issue node carries a `_purpose` annotation documenting which rule it should land on. - `tests/test_eval_replay.py` — drives `classify_response` against the fixture with a pinned `now` (2026-06-01T12:00:00Z) and asserts: 1. The full per-decision bucket distribution (positional identifiers per bucket). 2. The same distribution under `extra_bot_logins` — one issue migrates from dispatch to skip-noop with the override. 3. Per-issue assertions with reason-substring matches, keeping the fixture's `_purpose` annotations in lock-step with the classifier behaviour. 4. A skip-rate floor (≥30%) matching the real-world target after apache#416's rule tuning. A rule change that alters the distribution fails one of the asserts; the diff in the failing assertion tells the reviewer how the rule affects coverage before they ever look at real adopter data. The eval is deterministic (no live `gh` calls, fixed `now`) so CI runs it in milliseconds. This closes the tune-then-verify loop one more rung up — PR apache#416 used a one-off `/tmp/` script, PR apache#418 promoted it to a CLI, and this PR locks the rule behaviour into the test suite.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#418 shipped the
preflight-auditCLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds the eval-fixture pattern the tool's README promised.What
tests/fixtures/synthetic_workspace_sweep.json— 12-issue GraphQL response, one issue per rule path:Each issue node carries a
_purposeannotation so the fixture documents its own intent.tests/test_eval_replay.py— drivesclassify_responseagainst the fixture with a pinnednow(2026-06-01T12:00:00Z) and asserts:extra_bot_logins(one issue migrates fromdispatch→skip-noop)Why this matters
A rule change that alters the distribution fails one of the asserts; the diff tells the reviewer how the rule affects coverage before they touch any real adopter data. The eval is deterministic (no live
ghcalls, fixednow) so CI runs it in milliseconds.This closes the tune-then-verify loop one more rung up:
/tmp/scriptTest plan
prek run --all-filesgreen🤖 Generated with Claude Code