feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0) by garrytan · Pull Request #1065 · garrytan/gstack

garrytan · 2026-04-18T16:07:47Z

Summary

Restores the distinctive energy of two skills after V1's writing-style shipped.

The problem: V1 (v1.0.0.0) added Writing Style rules 2–4 to scripts/resolvers/preamble.ts. Each rule had one example, and all three examples pointed to diagnostic-pain framing ("3-second spinner", "double-click button"). Models follow concrete examples more reliably than abstract rules, so any skill with a non-diagnostic mode posture got flattened at runtime — /plan-ceo-review SCOPE EXPANSION collapsed into feature bullets, /office-hours Q3 softened into "Who's your target user?", builder mode turned into PRD-voice brainstorms.

The fix: rewrite rules 2 + 4 to show three paired framings (pain / upside / forcing). Rule 3 gets an exception for stacked multi-part questions. Add inline exemplars anchored on stable headings in both skill templates. Add three gate-tier E2E tests + Sonnet judge so the regression can't silently ship again.

Commits on this branch

feat: rewrite preamble rule 2-4 examples, add plan-ceo-review 0D-prelude + office-hours inline exemplars
chore: regenerate SKILL.md cascade (29 files across claude + external hosts)
test: gate-tier mode-posture E2E tests + judgePosture helper + 3 fixtures + touchfile registrations
test: update golden ship baselines + touchfile count
chore: bump to v1.1.2.0

Test Coverage

3 new gate-tier E2E tests registered in E2E_TOUCHFILES + E2E_TIERS:
- plan-ceo-review-expansion-energy (Opus generator, Sonnet judge)
- office-hours-forcing-energy (Sonnet generator, Sonnet judge)
- office-hours-builder-wildness (Sonnet generator, Sonnet judge)
Each test has a dual-axis judge rubric (surface framing + decision preservation / stacking + domain-matched / unexpected-combinations + excitement-over-optimization). Pass threshold 4/5 per axis.
Free-tier bun test suite: 766 pass, 0 fail after golden-baseline regen + touchfile count update.

Pre-Landing Review

Plan went through /plan-ceo-review HOLD SCOPE (clean, 0 unresolved).
Codex outside voice found 30 issues. 16 resolved in plan, 8 accepted as known risks, 4 noted as V1.2 candidates, 2 addressed with mitigation. The big win: Codex's approach pivot — rewrite rule 2-4 examples instead of adding a new rule fix: harden /browse lifecycle and ref safety #5 taxonomy. Simpler, less preamble bloat, no skill-name coupling in shared preamble.
/plan-eng-review (clean, 1 finding — the original test-infrastructure target was wrong: test/skill-llm-eval.test.ts does static analysis, needed to be E2E). Fixed before merge.
Token-ceiling warnings on 5 SKILL.md files are pre-existing on main (verified by sizes on origin/main); this branch adds a few KB to two of them, neither crosses a new threshold.

Known limitations (accepted V1.1 tradeoffs)

Judge tests prose markers more than deep decision quality. V1.2 candidate: add Codex as adversarial cross-provider judge on the same output.
No captured v1.0.0.0 transcript proving the regression was real prior to the fix. Hypothetical-evidence accepted during CEO review.
Writing Style rule carve-out ships alongside the existing clarity posture — power users on EXPLAIN_LEVEL: terse still get V0 prose. No regression for them.

CHANGELOG

User-facing entry at v1.1.2.0. Two "Fixed" items (expansion mode stays expansive, forcing questions stay forcing). One "Added" item (gate-tier eval tests). Contributor section at the bottom covers implementation details.

Test plan

bun test passes (766/766 on targeted suites)
bun run gen:skill-docs --host all regenerates all hosts cleanly
No new token-ceiling violations (5 pre-existing)
VERSION + package.json both bumped to 1.1.2.0
Golden ship baselines updated after preamble change
Touchfile E2E_TOUCHFILES + E2E_TIERS have matching keys (98 each)
Gate-tier eval run: EVALS_TIER=gate EVALS=1 bun run test:e2e requires ANTHROPIC_API_KEY — run manually post-merge to verify the three new cases fire
Manual spot check: invoke /plan-ceo-review SCOPE EXPANSION on a real plan, verify proposals lead with felt-experience framing

🤖 Generated with Claude Code

…tput Rewrites Writing Style rule 2-4 examples in scripts/resolvers/preamble.ts to cover three framing families (pain reduction, upside/delight, forcing pressure) instead of diagnostic-pain only. Adds inline exemplars to plan-ceo-review (0D-prelude shared between SCOPE + SELECTIVE EXPANSION) and office-hours (Q3 forcing exemplar with career/day/weekend domain gating, builder operating principles wild exemplar). V1 shipped rule 2-4 examples that all pointed to diagnostic-pain framing ("3-second spinner", "double-click button"). Models follow concrete examples over abstract taxonomies, so any skill with a non-diagnostic mode posture (expansion, forcing, delight) got flattened at runtime even when the template itself said "dream big" or "direct to the point of discomfort." This change targets the actual lever: swap the single diagnostic example for three paired framings, one per posture family. Preserves V1 clarity gains — rules 2, 3, 4 principles unchanged, only examples expanded. Terse mode (EXPLAIN_LEVEL: terse) still skips the block entirely.

Mechanical cascade from `bun run gen:skill-docs --host all` after the Writing Style rule 2-4 example rewrite and the plan-ceo-review / office-hours template exemplar additions. No hand edits — every change flows from the prior commit's templates.

Three gate-tier E2E tests detect when preamble / template changes flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or /office-hours (startup Q3, builder mode). The V1 regression that this PR fixes shipped without anyone catching it at ship time — this is the ongoing signal so the same thing doesn't happen again. Pieces: - `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet judge with mode-specific dual-axis rubric (expansion: surface_framing + decision_preservation; forcing: stacking_preserved + domain_matched_consequence; builder: unexpected_combinations + excitement_over_optimization). Pass threshold 4/5 on both axes. - Three fixtures in `test/fixtures/mode-posture/` — deterministic input for expansion proposal generation, Q3 forcing question, and builder adjacent-unlock riffing. - `plan-ceo-review-expansion-energy` case appended to `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge: Sonnet. - New `test/skill-e2e-office-hours.test.ts` with `office-hours-forcing-energy` + `office-hours-builder-wildness` cases. Generator: Sonnet. Judge: Sonnet. - Touchfile registration in `test/helpers/touchfiles.ts` — all three as `gate` tier in `E2E_TIERS`, triggered by changes to `scripts/resolvers/preamble.ts`, the relevant skill template, the judge helper, or any mode-posture fixture. Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus generator for the plan-ceo-review case dominates. Known V1.1 tradeoff: judges test prose markers more than deep behavior. V1.2 candidate is a cross-provider (Codex) adversarial judge on the same output to decouple house-style bias.

… entries Mechanical test updates after the mode-posture work: - Golden ship SKILL.md baselines (claude + codex + factory hosts) regenerate with the rewritten Writing Style rule 2-4 examples from preamble.ts. - Touchfile selection test expects 6 matches for a plan-ceo-review/ change (was 5) because E2E_TOUCHFILES now includes plan-ceo-review-expansion-energy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-18T16:19:19Z

E2E Evals: ✅ PASS

64/64 tests passed | $7.58 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	9/9	✅	$0.42
e2e-deploy	6/6	✅	$1.12
e2e-design	3/3	✅	$0.42
e2e-plan	8/8	✅	$1.7
e2e-qa-workflow	3/3	✅	$1.19
e2e-review	6/6	✅	$1.68
e2e-workflow	4/4	✅	$0.55
llm-judge	25/25	✅	$0.5

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan and others added 7 commits April 18, 2026 23:45

Merge remote-tracking branch 'origin/main' into garrytan/ceo-fix

6d18667

Merge remote-tracking branch 'origin/main' into garrytan/ceo-fix

ec710b4

chore: bump version and changelog (v1.1.2.0)

82857af

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan merged commit 8ee16b8 into main Apr 18, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0)#1065

feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0)#1065
garrytan merged 7 commits intomainfrom
garrytan/ceo-fix

garrytan commented Apr 18, 2026

Uh oh!

github-actions bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 18, 2026

Summary

Commits on this branch

Test Coverage

Pre-Landing Review

Known limitations (accepted V1.1 tradeoffs)

CHANGELOG

Test plan

Uh oh!

github-actions bot commented Apr 18, 2026

E2E Evals: ✅ PASS

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant