Skip to content

feat(security-issue-sync): pre-flight v2 — skill-marker detection#416

Merged
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-v2-skill-marker-detection
May 31, 2026
Merged

feat(security-issue-sync): pre-flight v2 — skill-marker detection#416
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-v2-skill-marker-detection

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 31, 2026

Summary

A dry-run of #414's pre-flight against a real adopter tracker surfaced two issues that kept the skip rate at ~5% instead of the predicted 30–50%:

  1. "Last comment author is a bot" was structurally unreachable on single-operator private trackers. The sync skill itself writes rollup updates as the operator's personal GitHub user, not as a *[bot] account — so the gate never fired.
  2. The 7-day updatedAt safety override caught most trackers because the sync itself bumps updatedAt every time it writes a rollup append or flips a label. The override conflated skill activity with substantive activity.

Changes

  • Skill-or-bot detection. Treat a comment as bot-equivalent when its body starts with the framework's marker prefix <!-- apache-steward: (matches every status-rollup / RM hand-off / wrap-up comment). Falls back to the original *[bot] login check plus an override-file hook for adopters with personal-account bots.
  • Relaxed lifecycle skip rules. Dropped the "idle > 14 days" gates — they were the safety net for the broken bot-detection. Once skill-or-bot detection works, the lifecycle label set + skill-last-write is enough signal.
  • New rule: cve allocated + fix released + skill-lastskip-noop (fix released; awaiting advisory propagation). Largest contributor to the new skip count in dry-run.
  • Query change: the GraphQL query now fetches body on the last comment (was just author { login } + createdAt). Response size grows moderately depending on rollup length, still cheaper than one subagent transcript.

Measurement

Same set, same trackers, before/after on the rule change: skip rate ~5% → ~30%, and the new skips were all correctly identified steady-state trackers. The savings are recurring (every sync all invocation), so the per-sweep win compounds.

Test plan

  • lychee on the edited file — clean
  • skill-and-tool-validate — no new violations
  • prek (markdownlint, typos, format) — green
  • CI lychee + tests-ok on this PR
  • Real-world bulk sync after merge to confirm classifier still tracks reality

🤖 Generated with Claude Code

…detection + relaxed rules

A dry-run of apache#414's pre-flight against a real adopter tracker
revealed the original rules misfired in two ways:

- The "last comment author is a bot" check was structurally
  unreachable on single-operator private trackers where the sync
  skill writes rollup updates as the operator's personal GitHub
  user, not as a *[bot] account.
- The 7-day updatedAt safety override caught most trackers
  because every tracker had been touched by the recent sync
  itself (rollup-comment writes, label flips) — conflating
  skill activity with substantive activity. Skip rate measured
  ~5% in this setup vs the predicted 30-50%.

This tunes the classifier with two changes:

1. Skill-or-bot detection. Treat a comment as bot-equivalent
   when its body starts with the skill marker
   `<!-- apache-steward: ` (matches every status-rollup,
   release-manager hand-off, and wrap-up comment the framework
   writes). Falls back to the original `*[bot]` login check, plus
   an override-file hook for adopters with personal-account bots.
   Requires fetching body on the last comment — bumps query
   response size moderately (still cheaper than one subagent
   transcript), and the body field is what enables the
   skill-marker detection that drives most of the real-world
   skip rate.

2. Relaxed lifecycle skip rules. The original "idle > 14d"
   gates were a safety net for the broken bot-detection. With
   skill-or-bot detection working, the "all phases done; awaiting
   release" / "fix released; awaiting advisory" patterns are
   skip-eligible regardless of comment age — the skill marker
   itself is the "nothing new since last sync" signal.

Re-running the dry-run on the same setup: skip rate ~5% → ~30%,
and the skipped trackers were all correctly steady-state ones.
Adds a new "fix released; awaiting advisory propagation" skip
rule for the `cve allocated + fix released` label set — the
single largest contributor to the new skip count.
@potiuk potiuk merged commit e2b58f1 into apache:main May 31, 2026
16 checks passed
potiuk added a commit that referenced this pull request May 31, 2026
…ifier rule (#423)

PR #418 shipped the preflight-audit CLI with a replay mode but no
fixture exercising the full classifier end-to-end. This adds:

- `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue
  GraphQL response, one issue per rule path (Rule 1 dispatch,
  Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7
  skip-noop, GitHub-App bot login, personal-bot needing
  override, fall-through dispatch, recently-closed dispatch).
  Each issue node carries a `_purpose` annotation documenting
  which rule it should land on.

- `tests/test_eval_replay.py` — drives `classify_response`
  against the fixture with a pinned `now` (2026-06-01T12:00:00Z)
  and asserts:
  1. The full per-decision bucket distribution (positional
     identifiers per bucket).
  2. The same distribution under `extra_bot_logins` — one issue
     migrates from dispatch to skip-noop with the override.
  3. Per-issue assertions with reason-substring matches, keeping
     the fixture's `_purpose` annotations in lock-step with the
     classifier behaviour.
  4. A skip-rate floor (≥30%) matching the real-world target
     after #416's rule tuning.

A rule change that alters the distribution fails one of the
asserts; the diff in the failing assertion tells the reviewer
how the rule affects coverage before they ever look at real
adopter data. The eval is deterministic (no live `gh` calls,
fixed `now`) so CI runs it in milliseconds.

This closes the tune-then-verify loop one more rung up — PR
#416 used a one-off `/tmp/` script, PR #418 promoted it to a
CLI, and this PR locks the rule behaviour into the test suite.
potiuk added a commit to potiuk/airflow-steward that referenced this pull request Jun 1, 2026
…ity-suite refactor patterns

Adds `optimize-skill` (capability:setup) — the refactoring sibling of
`write-skill`. It takes an existing framework skill (or sweeps a set)
and applies the five restructuring patterns proven on the security
suite, as behavior-preserving proposals gated by the validator
(green-before / green-after):

- split — slim an oversized SKILL.md into linked siblings (the apache#410
  pattern; addresses the PRINCIPLES.md P14 cap)
- config-lift — move concrete values into <project-config> (apache#386/apache#387/apache#388)
- out-of-context — read/PATCH one field without loading the body
  (apache#412 github-body-field, apache#424 github-rollup)
- fetch-upfront — batch per-item round-trips (apache#347)
- preflight-classifier — skip obvious no-ops before LLM passes (apache#414/apache#416)

SKILL.md is 297 lines; the pass catalogue (smell / exemplar PR /
mechanics / behavior-preservation guarantee / validation) lives in
the patterns.md sibling. Reads only framework-internal files, so no
injection-guard / Privacy-LLM callouts.

Ships a step-diagnose eval (5 auto-comparable cases incl. an
injection-resistance case) so the skill is not released without an
eval (P8). Wires the skill into the capability->skill map and the
eval index.

Generated-by: Claude Code (Opus 4.8)
potiuk added a commit to potiuk/airflow-steward that referenced this pull request Jun 1, 2026
…ity-suite refactor patterns

Adds `optimize-skill` (capability:setup) — the refactoring sibling of
`write-skill`. It takes an existing framework skill (or sweeps a set)
and applies the five restructuring patterns proven on the security
suite, as behavior-preserving proposals gated by the validator
(green-before / green-after):

- split — slim an oversized SKILL.md into linked siblings (the apache#410
  pattern; addresses the PRINCIPLES.md P14 cap)
- config-lift — move concrete values into <project-config> (apache#386/apache#387/apache#388)
- out-of-context — read/PATCH one field without loading the body
  (apache#412 github-body-field, apache#424 github-rollup)
- fetch-upfront — batch per-item round-trips (apache#347)
- preflight-classifier — skip obvious no-ops before LLM passes (apache#414/apache#416)

SKILL.md is 297 lines; the pass catalogue (smell / exemplar PR /
mechanics / behavior-preservation guarantee / validation) lives in
the patterns.md sibling. Reads only framework-internal files, so no
injection-guard / Privacy-LLM callouts.

Ships a step-diagnose eval (5 auto-comparable cases incl. an
injection-resistance case) so the skill is not released without an
eval (P8). Wires the skill into the capability->skill map and the
eval index.

Generated-by: Claude Code (Opus 4.8)
potiuk added a commit that referenced this pull request Jun 1, 2026
…ity-suite refactor patterns (#427)

Adds `optimize-skill` (capability:setup) — the refactoring sibling of
`write-skill`. It takes an existing framework skill (or sweeps a set)
and applies the five restructuring patterns proven on the security
suite, as behavior-preserving proposals gated by the validator
(green-before / green-after):

- split — slim an oversized SKILL.md into linked siblings (the #410
  pattern; addresses the PRINCIPLES.md P14 cap)
- config-lift — move concrete values into <project-config> (#386/#387/#388)
- out-of-context — read/PATCH one field without loading the body
  (#412 github-body-field, #424 github-rollup)
- fetch-upfront — batch per-item round-trips (#347)
- preflight-classifier — skip obvious no-ops before LLM passes (#414/#416)

SKILL.md is 297 lines; the pass catalogue (smell / exemplar PR /
mechanics / behavior-preservation guarantee / validation) lives in
the patterns.md sibling. Reads only framework-internal files, so no
injection-guard / Privacy-LLM callouts.

Ships a step-diagnose eval (5 auto-comparable cases incl. an
injection-resistance case) so the skill is not released without an
eval (P8). Wires the skill into the capability->skill map and the
eval index.

Generated-by: Claude Code (Opus 4.8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant