feat(security-issue-sync): pre-flight v2 — skill-marker detection by potiuk · Pull Request #416 · apache/airflow-steward

potiuk · 2026-05-31T12:29:53Z

Summary

A dry-run of #414's pre-flight against a real adopter tracker surfaced two issues that kept the skip rate at ~5% instead of the predicted 30–50%:

"Last comment author is a bot" was structurally unreachable on single-operator private trackers. The sync skill itself writes rollup updates as the operator's personal GitHub user, not as a *[bot] account — so the gate never fired.
The 7-day updatedAt safety override caught most trackers because the sync itself bumps updatedAt every time it writes a rollup append or flips a label. The override conflated skill activity with substantive activity.

Changes

Skill-or-bot detection. Treat a comment as bot-equivalent when its body starts with the framework's marker prefix <!-- apache-steward: (matches every status-rollup / RM hand-off / wrap-up comment). Falls back to the original *[bot] login check plus an override-file hook for adopters with personal-account bots.
Relaxed lifecycle skip rules. Dropped the "idle > 14 days" gates — they were the safety net for the broken bot-detection. Once skill-or-bot detection works, the lifecycle label set + skill-last-write is enough signal.
New rule: cve allocated + fix released + skill-last → skip-noop (fix released; awaiting advisory propagation). Largest contributor to the new skip count in dry-run.
Query change: the GraphQL query now fetches body on the last comment (was just author { login } + createdAt). Response size grows moderately depending on rollup length, still cheaper than one subagent transcript.

Measurement

Same set, same trackers, before/after on the rule change: skip rate ~5% → ~30%, and the new skips were all correctly identified steady-state trackers. The savings are recurring (every sync all invocation), so the per-sweep win compounds.

Test plan

lychee on the edited file — clean
skill-and-tool-validate — no new violations
prek (markdownlint, typos, format) — green
CI lychee + tests-ok on this PR
Real-world bulk sync after merge to confirm classifier still tracks reality

🤖 Generated with Claude Code

…detection + relaxed rules A dry-run of apache#414's pre-flight against a real adopter tracker revealed the original rules misfired in two ways: - The "last comment author is a bot" check was structurally unreachable on single-operator private trackers where the sync skill writes rollup updates as the operator's personal GitHub user, not as a *[bot] account. - The 7-day updatedAt safety override caught most trackers because every tracker had been touched by the recent sync itself (rollup-comment writes, label flips) — conflating skill activity with substantive activity. Skip rate measured ~5% in this setup vs the predicted 30-50%. This tunes the classifier with two changes: 1. Skill-or-bot detection. Treat a comment as bot-equivalent when its body starts with the skill marker `<!-- apache-steward: ` (matches every status-rollup, release-manager hand-off, and wrap-up comment the framework writes). Falls back to the original `*[bot]` login check, plus an override-file hook for adopters with personal-account bots. Requires fetching body on the last comment — bumps query response size moderately (still cheaper than one subagent transcript), and the body field is what enables the skill-marker detection that drives most of the real-world skip rate. 2. Relaxed lifecycle skip rules. The original "idle > 14d" gates were a safety net for the broken bot-detection. With skill-or-bot detection working, the "all phases done; awaiting release" / "fix released; awaiting advisory" patterns are skip-eligible regardless of comment age — the skill marker itself is the "nothing new since last sync" signal. Re-running the dry-run on the same setup: skip rate ~5% → ~30%, and the skipped trackers were all correctly steady-state ones. Adds a new "fix released; awaiting advisory propagation" skip rule for the `cve allocated + fix released` label set — the single largest contributor to the new skip count.

…ifier rule (#423) PR #418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds: - `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue GraphQL response, one issue per rule path (Rule 1 dispatch, Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7 skip-noop, GitHub-App bot login, personal-bot needing override, fall-through dispatch, recently-closed dispatch). Each issue node carries a `_purpose` annotation documenting which rule it should land on. - `tests/test_eval_replay.py` — drives `classify_response` against the fixture with a pinned `now` (2026-06-01T12:00:00Z) and asserts: 1. The full per-decision bucket distribution (positional identifiers per bucket). 2. The same distribution under `extra_bot_logins` — one issue migrates from dispatch to skip-noop with the override. 3. Per-issue assertions with reason-substring matches, keeping the fixture's `_purpose` annotations in lock-step with the classifier behaviour. 4. A skip-rate floor (≥30%) matching the real-world target after #416's rule tuning. A rule change that alters the distribution fails one of the asserts; the diff in the failing assertion tells the reviewer how the rule affects coverage before they ever look at real adopter data. The eval is deterministic (no live `gh` calls, fixed `now`) so CI runs it in milliseconds. This closes the tune-then-verify loop one more rung up — PR #416 used a one-off `/tmp/` script, PR #418 promoted it to a CLI, and this PR locks the rule behaviour into the test suite.

…ity-suite refactor patterns Adds `optimize-skill` (capability:setup) — the refactoring sibling of `write-skill`. It takes an existing framework skill (or sweeps a set) and applies the five restructuring patterns proven on the security suite, as behavior-preserving proposals gated by the validator (green-before / green-after): - split — slim an oversized SKILL.md into linked siblings (the apache#410 pattern; addresses the PRINCIPLES.md P14 cap) - config-lift — move concrete values into <project-config> (apache#386/apache#387/apache#388) - out-of-context — read/PATCH one field without loading the body (apache#412 github-body-field, apache#424 github-rollup) - fetch-upfront — batch per-item round-trips (apache#347) - preflight-classifier — skip obvious no-ops before LLM passes (apache#414/apache#416) SKILL.md is 297 lines; the pass catalogue (smell / exemplar PR / mechanics / behavior-preservation guarantee / validation) lives in the patterns.md sibling. Reads only framework-internal files, so no injection-guard / Privacy-LLM callouts. Ships a step-diagnose eval (5 auto-comparable cases incl. an injection-resistance case) so the skill is not released without an eval (P8). Wires the skill into the capability->skill map and the eval index. Generated-by: Claude Code (Opus 4.8)

…ity-suite refactor patterns (#427) Adds `optimize-skill` (capability:setup) — the refactoring sibling of `write-skill`. It takes an existing framework skill (or sweeps a set) and applies the five restructuring patterns proven on the security suite, as behavior-preserving proposals gated by the validator (green-before / green-after): - split — slim an oversized SKILL.md into linked siblings (the #410 pattern; addresses the PRINCIPLES.md P14 cap) - config-lift — move concrete values into <project-config> (#386/#387/#388) - out-of-context — read/PATCH one field without loading the body (#412 github-body-field, #424 github-rollup) - fetch-upfront — batch per-item round-trips (#347) - preflight-classifier — skip obvious no-ops before LLM passes (#414/#416) SKILL.md is 297 lines; the pass catalogue (smell / exemplar PR / mechanics / behavior-preservation guarantee / validation) lives in the patterns.md sibling. Reads only framework-internal files, so no injection-guard / Privacy-LLM callouts. Ships a step-diagnose eval (5 auto-comparable cases incl. an injection-resistance case) so the skill is not released without an eval (P8). Wires the skill into the capability->skill map and the eval index. Generated-by: Claude Code (Opus 4.8)

potiuk merged commit e2b58f1 into apache:main May 31, 2026
16 checks passed

This was referenced May 31, 2026

feat(preflight-audit): CLI for measuring bulk-mode skip-rate #418

Merged

test(preflight-audit): replay-mode eval fixture for the classifier #423

Merged

potiuk mentioned this pull request Jun 1, 2026

feat(optimize-skill): skill to optimize existing skills via the security-suite refactor patterns #427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(security-issue-sync): pre-flight v2 — skill-marker detection#416

feat(security-issue-sync): pre-flight v2 — skill-marker detection#416
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-v2-skill-marker-detection

potiuk commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

potiuk commented May 31, 2026

Summary

Changes

Measurement

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant