Skip to content

test(preflight-audit): replay-mode eval fixture for the classifier#423

Merged
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-audit-eval-fixture
May 31, 2026
Merged

test(preflight-audit): replay-mode eval fixture for the classifier#423
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-audit-eval-fixture

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 31, 2026

Summary

#418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds the eval-fixture pattern the tool's README promised.

What

  • tests/fixtures/synthetic_workspace_sweep.json — 12-issue GraphQL response, one issue per rule path:

    • Rule 1 dispatch (recent human activity)
    • Rule 1 yields → Rule 7 fires (skill-drove-update)
    • Rule 2 dispatch-urgent (non-skill reply <24h after >7d gap)
    • Rules 3–7 skip-noop (post-announce, stale, all-phases-done, awaiting release, awaiting advisory)
    • GitHub-App bot login + personal-bot-needing-override
    • Fall-through dispatch + recently-closed dispatch

    Each issue node carries a _purpose annotation so the fixture documents its own intent.

  • tests/test_eval_replay.py — drives classify_response against the fixture with a pinned now (2026-06-01T12:00:00Z) and asserts:

    1. The full per-decision bucket distribution by issue number
    2. The same distribution under extra_bot_logins (one issue migrates from dispatchskip-noop)
    3. Per-issue assertions with reason-substring matches
    4. A skip-rate floor (≥30%) matching the real-world target after feat(security-issue-sync): pre-flight v2 — skill-marker detection #416's tuning

Why this matters

A rule change that alters the distribution fails one of the asserts; the diff tells the reviewer how the rule affects coverage before they touch any real adopter data. The eval is deterministic (no live gh calls, fixed now) so CI runs it in milliseconds.

This closes the tune-then-verify loop one more rung up:

Test plan

  • 4 new tests pass (41 total in preflight-audit)
  • prek run --all-files green
  • CI: workspace pytest + lychee

🤖 Generated with Claude Code

…ifier rule

PR apache#418 shipped the preflight-audit CLI with a replay mode but no
fixture exercising the full classifier end-to-end. This adds:

- `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue
  GraphQL response, one issue per rule path (Rule 1 dispatch,
  Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7
  skip-noop, GitHub-App bot login, personal-bot needing
  override, fall-through dispatch, recently-closed dispatch).
  Each issue node carries a `_purpose` annotation documenting
  which rule it should land on.

- `tests/test_eval_replay.py` — drives `classify_response`
  against the fixture with a pinned `now` (2026-06-01T12:00:00Z)
  and asserts:
  1. The full per-decision bucket distribution (positional
     identifiers per bucket).
  2. The same distribution under `extra_bot_logins` — one issue
     migrates from dispatch to skip-noop with the override.
  3. Per-issue assertions with reason-substring matches, keeping
     the fixture's `_purpose` annotations in lock-step with the
     classifier behaviour.
  4. A skip-rate floor (≥30%) matching the real-world target
     after apache#416's rule tuning.

A rule change that alters the distribution fails one of the
asserts; the diff in the failing assertion tells the reviewer
how the rule affects coverage before they ever look at real
adopter data. The eval is deterministic (no live `gh` calls,
fixed `now`) so CI runs it in milliseconds.

This closes the tune-then-verify loop one more rung up — PR
apache#416 used a one-off `/tmp/` script, PR apache#418 promoted it to a
CLI, and this PR locks the rule behaviour into the test suite.
@potiuk potiuk merged commit ee5dd9c into apache:main May 31, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant