Skip to content

feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0)#1065

Merged
garrytan merged 7 commits intomainfrom
garrytan/ceo-fix
Apr 18, 2026
Merged

feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0)#1065
garrytan merged 7 commits intomainfrom
garrytan/ceo-fix

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Restores the distinctive energy of two skills after V1's writing-style shipped.

The problem: V1 (v1.0.0.0) added Writing Style rules 2–4 to scripts/resolvers/preamble.ts. Each rule had one example, and all three examples pointed to diagnostic-pain framing ("3-second spinner", "double-click button"). Models follow concrete examples more reliably than abstract rules, so any skill with a non-diagnostic mode posture got flattened at runtime — /plan-ceo-review SCOPE EXPANSION collapsed into feature bullets, /office-hours Q3 softened into "Who's your target user?", builder mode turned into PRD-voice brainstorms.

The fix: rewrite rules 2 + 4 to show three paired framings (pain / upside / forcing). Rule 3 gets an exception for stacked multi-part questions. Add inline exemplars anchored on stable headings in both skill templates. Add three gate-tier E2E tests + Sonnet judge so the regression can't silently ship again.

Commits on this branch

  • feat: rewrite preamble rule 2-4 examples, add plan-ceo-review 0D-prelude + office-hours inline exemplars
  • chore: regenerate SKILL.md cascade (29 files across claude + external hosts)
  • test: gate-tier mode-posture E2E tests + judgePosture helper + 3 fixtures + touchfile registrations
  • test: update golden ship baselines + touchfile count
  • chore: bump to v1.1.2.0

Test Coverage

  • 3 new gate-tier E2E tests registered in E2E_TOUCHFILES + E2E_TIERS:
    • plan-ceo-review-expansion-energy (Opus generator, Sonnet judge)
    • office-hours-forcing-energy (Sonnet generator, Sonnet judge)
    • office-hours-builder-wildness (Sonnet generator, Sonnet judge)
  • Each test has a dual-axis judge rubric (surface framing + decision preservation / stacking + domain-matched / unexpected-combinations + excitement-over-optimization). Pass threshold 4/5 per axis.
  • Free-tier bun test suite: 766 pass, 0 fail after golden-baseline regen + touchfile count update.

Pre-Landing Review

  • Plan went through /plan-ceo-review HOLD SCOPE (clean, 0 unresolved).
  • Codex outside voice found 30 issues. 16 resolved in plan, 8 accepted as known risks, 4 noted as V1.2 candidates, 2 addressed with mitigation. The big win: Codex's approach pivot — rewrite rule 2-4 examples instead of adding a new rule fix: harden /browse lifecycle and ref safety #5 taxonomy. Simpler, less preamble bloat, no skill-name coupling in shared preamble.
  • /plan-eng-review (clean, 1 finding — the original test-infrastructure target was wrong: test/skill-llm-eval.test.ts does static analysis, needed to be E2E). Fixed before merge.
  • Token-ceiling warnings on 5 SKILL.md files are pre-existing on main (verified by sizes on origin/main); this branch adds a few KB to two of them, neither crosses a new threshold.

Known limitations (accepted V1.1 tradeoffs)

  • Judge tests prose markers more than deep decision quality. V1.2 candidate: add Codex as adversarial cross-provider judge on the same output.
  • No captured v1.0.0.0 transcript proving the regression was real prior to the fix. Hypothetical-evidence accepted during CEO review.
  • Writing Style rule carve-out ships alongside the existing clarity posture — power users on EXPLAIN_LEVEL: terse still get V0 prose. No regression for them.

CHANGELOG

User-facing entry at v1.1.2.0. Two "Fixed" items (expansion mode stays expansive, forcing questions stay forcing). One "Added" item (gate-tier eval tests). Contributor section at the bottom covers implementation details.

Test plan

  • bun test passes (766/766 on targeted suites)
  • bun run gen:skill-docs --host all regenerates all hosts cleanly
  • No new token-ceiling violations (5 pre-existing)
  • VERSION + package.json both bumped to 1.1.2.0
  • Golden ship baselines updated after preamble change
  • Touchfile E2E_TOUCHFILES + E2E_TIERS have matching keys (98 each)
  • Gate-tier eval run: EVALS_TIER=gate EVALS=1 bun run test:e2e requires ANTHROPIC_API_KEY — run manually post-merge to verify the three new cases fire
  • Manual spot check: invoke /plan-ceo-review SCOPE EXPANSION on a real plan, verify proposals lead with felt-experience framing

🤖 Generated with Claude Code

garrytan and others added 7 commits April 18, 2026 23:45
…tput

Rewrites Writing Style rule 2-4 examples in scripts/resolvers/preamble.ts
to cover three framing families (pain reduction, upside/delight, forcing
pressure) instead of diagnostic-pain only. Adds inline exemplars to
plan-ceo-review (0D-prelude shared between SCOPE + SELECTIVE EXPANSION)
and office-hours (Q3 forcing exemplar with career/day/weekend domain
gating, builder operating principles wild exemplar).

V1 shipped rule 2-4 examples that all pointed to diagnostic-pain framing
("3-second spinner", "double-click button"). Models follow concrete
examples over abstract taxonomies, so any skill with a non-diagnostic
mode posture (expansion, forcing, delight) got flattened at runtime
even when the template itself said "dream big" or "direct to the point
of discomfort." This change targets the actual lever: swap the single
diagnostic example for three paired framings, one per posture family.

Preserves V1 clarity gains — rules 2, 3, 4 principles unchanged, only
examples expanded. Terse mode (EXPLAIN_LEVEL: terse) still skips the
block entirely.
Mechanical cascade from `bun run gen:skill-docs --host all` after the
Writing Style rule 2-4 example rewrite and the plan-ceo-review /
office-hours template exemplar additions. No hand edits — every change
flows from the prior commit's templates.
Three gate-tier E2E tests detect when preamble / template changes
flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or
/office-hours (startup Q3, builder mode). The V1 regression that this
PR fixes shipped without anyone catching it at ship time — this is the
ongoing signal so the same thing doesn't happen again.

Pieces:
- `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet
  judge with mode-specific dual-axis rubric (expansion: surface_framing
  + decision_preservation; forcing: stacking_preserved +
  domain_matched_consequence; builder: unexpected_combinations +
  excitement_over_optimization). Pass threshold 4/5 on both axes.
- Three fixtures in `test/fixtures/mode-posture/` — deterministic input
  for expansion proposal generation, Q3 forcing question, and builder
  adjacent-unlock riffing.
- `plan-ceo-review-expansion-energy` case appended to
  `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge:
  Sonnet.
- New `test/skill-e2e-office-hours.test.ts` with
  `office-hours-forcing-energy` + `office-hours-builder-wildness`
  cases. Generator: Sonnet. Judge: Sonnet.
- Touchfile registration in `test/helpers/touchfiles.ts` — all three as
  `gate` tier in `E2E_TIERS`, triggered by changes to
  `scripts/resolvers/preamble.ts`, the relevant skill template, the
  judge helper, or any mode-posture fixture.

Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus
generator for the plan-ceo-review case dominates.

Known V1.1 tradeoff: judges test prose markers more than deep behavior.
V1.2 candidate is a cross-provider (Codex) adversarial judge on the
same output to decouple house-style bias.
… entries

Mechanical test updates after the mode-posture work:
- Golden ship SKILL.md baselines (claude + codex + factory hosts) regenerate with
  the rewritten Writing Style rule 2-4 examples from preamble.ts.
- Touchfile selection test expects 6 matches for a plan-ceo-review/ change (was 5)
  because E2E_TOUCHFILES now includes plan-ceo-review-expansion-energy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

E2E Evals: ✅ PASS

64/64 tests passed | $7.58 total cost | 12 parallel runners

Suite Result Status Cost
e2e-browse 9/9 $0.42
e2e-deploy 6/6 $1.12
e2e-design 3/3 $0.42
e2e-plan 8/8 $1.7
e2e-qa-workflow 3/3 $1.19
e2e-review 6/6 $1.68
e2e-workflow 4/4 $0.55
llm-judge 25/25 $0.5

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

@garrytan garrytan merged commit 8ee16b8 into main Apr 18, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant