Conversation
…floor observer Adds a focused PTY observer that exits at the first non-permission numbered-option render. Catches the May 2026 transcript-bug class (model wrote plan + ExitPlanMode without firing any AUQ) without needing to fingerprint or navigate past the AUQ. Why separate from runPlanSkillCounting: plan-mode AUQs render every option on a single logical line via cursor-positioning escapes that stripAnsi can't simulate, so parseNumberedOptions returns < 2 options and never records a fingerprint. Counting tests work on 25-min budgets because eventually one frame parses cleanly; gate-tier floor tests need to exit early on the first observation. Trades fingerprint precision for early-exit reliability. Also drops COMPLETION_SUMMARY_RE check from this helper — it matches "GSTACK REVIEW REPORT" anywhere in the buffer including when the agent does recon by reading existing plan files. plan_ready (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver
function in scripts/resolvers/review.ts. Plan-* review skills can now
include the clause via one placeholder line in their .tmpl rather than
cloning the paragraph four times. Future tightening edits one resolver,
all four skills update on next gen-skill-docs.
Wired into the existing RESOLVERS map alongside generateReviewDashboard
and generatePlanFileReviewReport — no gen-skill-docs.ts change needed
because the generator already does generic placeholder substitution
against that map.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the
**Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review
SKILL.md.tmpl. The four templates use different surrounding section
headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex
variants), so anchoring on the paragraph rather than the heading works
across all four.
Closes the May 2026 transcript-bug loophole: existing STOP gates name
forbidden actions only AFTER a per-section finding is identified. The
anti-shortcut clause adds the pre-emptive rule — "the plan file is the
OUTPUT of the interactive review, not a substitute for it" — covering
the case the transcript exhibited (skip per-section walk, dump every
finding into one plan write, call ExitPlanMode).
Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 4 finding-floor tests (one per plan-* skill) that catch the May 2026 transcript-bug class — model wrote a plan and called ExitPlanMode without firing any review-phase AskUserQuestion. Asserts via runPlanSkillFloorCheck that ANY non-permission AUQ render fires before the agent reaches plan_ready. Verified: - Eng floor: passed in 59s - CEO floor: passed in 197s - Design floor: passed - Devex floor: passed - Total ~$2-6 per CI run; only triggers on diff against the 4 plan-* templates, the shared resolver review.ts, the seeds fixture, or the PTY runner helper. Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant per skill. Each seed is engineered to force at least one obvious finding under that skill's review focus (architectural smell for eng, scope-creep for ceo, UI-slop for design, painful onboarding for devex). Touchfiles wiring: - E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the matching skill template, the shared resolver, the seeds fixture, and the PTY runner helper - E2E_TIERS: all 4 entries marked 'gate' - touchfiles.test.ts: count assertion bumped 21→22 with explicit plan-ceo-finding-floor containment check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…skuser-fix # Conflicts: # test/helpers/touchfiles.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E Evals: ✅ PASS66/66 tests passed | $8.61 total cost | 12 parallel runners
12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Plan-mode reviews now refuse to dump findings without asking. Four gate-tier tests catch the regression on every PR.
The four
/plan-*-reviewskills (eng, ceo, design, devex) gain an anti-shortcut clause baked in via a single shared resolver. The clause names the May 2026 transcript-bug failure mode directly: model explores, finds issues, dumps every finding into one plan write, calls ExitPlanMode without firing AskUserQuestion. The new clause closes that loophole: "the plan file is the OUTPUT of the interactive review, not a substitute for it." Future tightening edits one resolver, all four skills update on the next gen-skill-docs.Four gate-tier E2E tests catch the regression class on every PR that touches the four templates, the shared resolver, or the seeds fixture. Each test drives the matching skill against a small "forcing finding" seed and asserts the agent fires at least one AskUserQuestion before reaching plan_ready.
Verified end-to-end: Eng floor 59s, CEO floor 197s, Design + Devex floor pass. All four assert
auq_observedoutcome via a focused observer (runPlanSkillFloorCheck) that exits at the first non-permission numbered-option render.Architecture / Resolver wiring:
generateAntiShortcutClauselives alongsidegenerateReviewDashboardandgeneratePlanFileReviewReportinscripts/resolvers/review.ts{{ANTI_SHORTCUT_CLAUSE}}in theRESOLVERSmap (scripts/resolvers/index.ts)**Anti-skip rule:**paragraphTest architecture:
runPlanSkillFloorCheckis a deliberately narrower variant ofrunPlanSkillCounting. The counting helper handles 25-min periodic budgets and tries to fingerprint every AUQ across the session. The floor variant exits at the first AUQ render so it fits gate-tier wall-time constraints. Both live side-by-side intest/helpers/claude-pty-runner.ts.COMPLETION_SUMMARY_REcheck from its terminal detection — that regex matches "GSTACK REVIEW REPORT" anywhere in the visible buffer, including when the agent does recon by reading existing plan files.isPlanReadyVisible(claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking."Test Coverage
The 4 gaps in
runPlanSkillFloorCheckare defensive/error paths (silent_write, exited, timeout) that mirror existing logic inrunPlanSkillCounting, which already has unit-test coverage for those paths viaclaude-pty-runner.unit.test.ts. The helper's logic is parallel.Tests: 0 → 4 new floor tests + 1 new fixture file + 1 helper function (+ unit assertion in
touchfiles.test.ts). The plan-mode template change is exercised by every existing periodic finding-count test in addition to the new gate-tier tests.Pre-Landing Review
No issues found. The diff is additive infrastructure with no SQL, no LLM trust boundary changes, no auth changes, no migrations, no API surface changes.
Plan Completion
Plan file:
~/.claude/plans/system-instruction-you-are-working-gleaming-sonnet.md(revised after Codex consult).forcing-finding-seeds.tsfixture with 4 per-skill seedstest/fixtures/forcing-finding-seeds.tsgenerateAntiShortcutClauseresolverscripts/resolvers/review.ts:161-163ANTI_SHORTCUT_CLAUSEin RESOLVERS mapscripts/resolvers/index.ts:42{{ANTI_SHORTCUT_CLAUSE}}after Anti-skip rule in 4 templatestest/helpers/touchfiles.tsrunPlanSkillCountingrunPlanSkillFloorCheckhelper after the original approach hit the inline-options parsing bug Codex flagged in finding #4. Same contract (reviewCount >= 1↔auq_observed), simpler implementation, works on gate-tier wall time.All 4 floor tests pass against the new template. Codex's 7 findings from the consult review are all addressed.
TODOS
No items completed in this PR. The plan filed one V2 follow-up TODO during the eng review (per-finding session-content assertion — match each finding paragraph in the final plan to a specific AUQ fingerprint observed earlier in the session). Not added to TODOS.md in this PR; will file on next /retro pass.
Test plan
bun testpasses (free-tier: 93 pass, 0 fail in unit + touchfiles)bun run gen:skill-docs --host allregenerates cleanly for all 7 hostsplan-*-finding-floor.test.tspass against the current branch (~$3 wall, ~5 min cumulative)selectTestsconfirms each floor test triggers on its template / fixture /scripts/resolvers/review.ts/ PTY runner helper🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.