test(preflight-audit): replay-mode eval fixture for the classifier by potiuk · Pull Request #423 · apache/airflow-steward

potiuk · 2026-05-31T18:06:11Z

Summary

#418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds the eval-fixture pattern the tool's README promised.

What

tests/fixtures/synthetic_workspace_sweep.json — 12-issue GraphQL response, one issue per rule path:
- Rule 1 dispatch (recent human activity)
- Rule 1 yields → Rule 7 fires (skill-drove-update)
- Rule 2 dispatch-urgent (non-skill reply <24h after >7d gap)
- Rules 3–7 skip-noop (post-announce, stale, all-phases-done, awaiting release, awaiting advisory)
- GitHub-App bot login + personal-bot-needing-override
- Fall-through dispatch + recently-closed dispatch
Each issue node carries a _purpose annotation so the fixture documents its own intent.
tests/test_eval_replay.py — drives classify_response against the fixture with a pinned now (2026-06-01T12:00:00Z) and asserts:
1. The full per-decision bucket distribution by issue number
2. The same distribution under extra_bot_logins (one issue migrates from dispatch → skip-noop)
3. Per-issue assertions with reason-substring matches
4. A skip-rate floor (≥30%) matching the real-world target after feat(security-issue-sync): pre-flight v2 — skill-marker detection #416's tuning

Why this matters

A rule change that alters the distribution fails one of the asserts; the diff tells the reviewer how the rule affects coverage before they touch any real adopter data. The eval is deterministic (no live gh calls, fixed now) so CI runs it in milliseconds.

This closes the tune-then-verify loop one more rung up:

feat(security-issue-sync): pre-flight v2 — skill-marker detection #416 used a one-off /tmp/ script
feat(preflight-audit): CLI for measuring bulk-mode skip-rate #418 promoted it to a CLI
This PR locks the rule behaviour into the test suite

Test plan

4 new tests pass (41 total in preflight-audit)
prek run --all-files green
CI: workspace pytest + lychee

🤖 Generated with Claude Code

…ifier rule PR apache#418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds: - `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue GraphQL response, one issue per rule path (Rule 1 dispatch, Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7 skip-noop, GitHub-App bot login, personal-bot needing override, fall-through dispatch, recently-closed dispatch). Each issue node carries a `_purpose` annotation documenting which rule it should land on. - `tests/test_eval_replay.py` — drives `classify_response` against the fixture with a pinned `now` (2026-06-01T12:00:00Z) and asserts: 1. The full per-decision bucket distribution (positional identifiers per bucket). 2. The same distribution under `extra_bot_logins` — one issue migrates from dispatch to skip-noop with the override. 3. Per-issue assertions with reason-substring matches, keeping the fixture's `_purpose` annotations in lock-step with the classifier behaviour. 4. A skip-rate floor (≥30%) matching the real-world target after apache#416's rule tuning. A rule change that alters the distribution fails one of the asserts; the diff in the failing assertion tells the reviewer how the rule affects coverage before they ever look at real adopter data. The eval is deterministic (no live `gh` calls, fixed `now`) so CI runs it in milliseconds. This closes the tune-then-verify loop one more rung up — PR apache#416 used a one-off `/tmp/` script, PR apache#418 promoted it to a CLI, and this PR locks the rule behaviour into the test suite.

potiuk merged commit ee5dd9c into apache:main May 31, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(preflight-audit): replay-mode eval fixture for the classifier#423

test(preflight-audit): replay-mode eval fixture for the classifier#423
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-audit-eval-fixture

potiuk commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

potiuk commented May 31, 2026

Summary

What

Why this matters

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant