Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
ab66193
refactor(plan-ceo-review): carve review body into on-demand section
garrytan May 31, 2026
5fe32c9
chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0)
garrytan May 31, 2026
d5af85d
docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} refere…
garrytan May 31, 2026
8bd21a4
test(auq): Layer 0 — guarantee AUQ format spec is always-loaded
garrytan Jun 2, 2026
e98bdeb
test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B
garrytan Jun 2, 2026
2819937
test(auq): consistency — same trigger N runs, stable format + substance
garrytan Jun 2, 2026
63ea6bb
test(auq): behavioral matrix across AUQ-heavy skills
garrytan Jun 2, 2026
9e53ec8
test(codex): live recommendation-substance grade for /codex
garrytan Jun 2, 2026
ed996ca
test(auq): deterministic trigger for format-compliance gate
garrytan Jun 2, 2026
be3ebbb
refactor(office-hours): carve Phase 5+6 into on-demand section
garrytan Jun 2, 2026
ca41305
chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v…
garrytan Jun 2, 2026
c52d3ab
refactor(preamble): carve CJK-escaping manual to on-demand doc
garrytan Jun 2, 2026
0329319
refactor(plan-eng-review): carve review body into on-demand section
garrytan Jun 2, 2026
d4938b4
refactor(plan-design-review): carve review body into on-demand section
garrytan Jun 2, 2026
4cbcc4f
refactor(plan-devex-review): carve review body into on-demand section
garrytan Jun 2, 2026
dce715a
chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0)
garrytan Jun 2, 2026
b0a6977
test: refresh ship golden baselines + gbrain-detection union after ca…
garrytan Jun 2, 2026
0690066
test(auq): grade format-compliance gate from SDK capture, not the TUI
garrytan Jun 3, 2026
6c610bf
test: fix carve-broken CI evals (union reads + section fixtures)
garrytan Jun 3, 2026
659d968
test: copy carved sections into all e2e fixtures (prevent more carve-…
garrytan Jun 3, 2026
857f100
Merge origin/main into garrytan/token-usage-reduction
garrytan Jun 3, 2026
8bb733f
test: migrate section-loading E2E to lossless SDK tool-stream detection
garrytan Jun 3, 2026
da47259
chore: consolidate branch to v1.56.0.0 (single MINOR above main)
garrytan Jun 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,65 @@
# Changelog

## [1.56.0.0] - 2026-06-03

## **Five heavy skills now load their bulk on demand, the shared question preamble slimmed corpus-wide, and a paranoid test suite proves the questions never got worse.**

The token-reduction program lands its biggest wave. Five of the heaviest skills — `/plan-ceo-review`, `/office-hours`, `/plan-eng-review`, `/plan-design-review`, and `/plan-devex-review` — are now a small always-loaded skeleton plus an on-demand `sections/` file the agent opens only when it reaches the work. The conversational front half (Step 0 scope, the live interview) stays always-loaded; the deep review bodies, design-doc templates, outside-voice rules, and required-output writers move behind a single STOP-Read. The shared AskUserQuestion preamble also shed its rarely-needed CJK escaping manual, so every interactive skill is a little lighter at once. And because the most user-facing surface in gstack is the question it asks you, a new paranoid test suite proves that slimming a skill never strands or degrades that question.

### The numbers that matter

Measured from the generated skeletons (`wc -c <skill>/SKILL.md`), regenerated for all hosts:

| Skill | Before | After | Δ |
|-------|--------|-------|---|
| plan-ceo-review | 138,838 B | 80,731 B | -42.0% |
| office-hours | 118,280 B | 88,975 B | -24.8% |
| plan-eng-review | 106,984 B | 54,892 B | -48.7% |
| plan-design-review | 112,057 B | 76,024 B | -32.2% |
| plan-devex-review | 110,621 B | 69,658 B | -37.0% |

On top of the per-skill carves, the shared AskUserQuestion preamble dropped its inline CJK manual to a one-line rule + a doc pointer, trimming ~29,524 B across the Claude-host corpus (every interactive skill, ~900 B each).

The AskUserQuestion proof, measured by SDK capture (`test/skill-e2e-auq-matrix.test.ts`):

| Guarantee | Result |
|-----------|--------|
| Carved vs verbose, same trigger | 7/7 format, substance 5 == 7/7 format, substance 5 (no degradation) |
| Matrix across 7 AUQ-heavy skills | all 7/7 format, substance 4-5 |
| Same trigger 3× (consistency) | stable, every format element every run |
| AUQ format spec always-loaded | guaranteed in every skeleton (Layer 0) |

Every review runs identical pass for pass; the only thing that changed is what sits in context before the work begins.

### What this means for you

The skills you run most start markedly lighter: a `/plan-ceo-review` opens ~42% smaller, the others 25-49%, and they pull in their review bodies only when they reach them. You will not notice any behavior change in the reviews themselves; they run section for section as before. What you get is more of the context window left for your actual work, paid back on every invocation. And every skill that asks you questions now carries a guarantee, enforced by tests: the decision brief (plain-English ELI10, an explicit recommendation with a real reason, pros and cons, the stakes) is provably in context the instant any question fires. External hosts (codex, factory, kiro, opencode) still receive the full inline skill, so nothing regresses off Claude.

### Itemized changes

#### Added
- `plan-ceo-review/sections/review-sections.md` — the 11-section deep review, outside-voice rules, required-output registries, completion summary, review report writer, next-step chaining, and mode quick reference, behind a STOP-Read pointer with a passive `manifest.json`.
- `office-hours/sections/design-and-handoff.md` — Phase 5 design-doc templates + Phase 6 tiered handoff, behind a STOP-Read pointer with a passive `manifest.json`.
- `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review` each gain a `sections/review-sections.md` carved behind a post-Step-0 STOP-Read.
- `docs/askuserquestion-cjk.md` — full non-ASCII / CJK escaping rationale + worked example, read on demand.
- `test/auq-format-always-loaded.test.ts` — free per-PR keystone: every interactive skill must carry the full AskUserQuestion format spec in its always-loaded skeleton, never stranded in a section. 51 cases plus a negative control.
- `test/skill-e2e-auq-matrix.test.ts` — drives each AUQ-heavy skill to its first question and grades it (7/7 format, substance >=4).
- `test/skill-e2e-auq-verbose-vs-carved-ab.test.ts` — proves a carved skill's question is not worse than the pre-carve monolith's, on the same trigger.
- `test/skill-e2e-auq-consistency.test.ts` — same trigger N times, fails on any format element that flickers between runs.
- `test/codex-e2e-recommendation-substance.test.ts` — grades `/codex`'s live recommendation substance.
- `test/skill-ceo-section-ordering.test.ts` — gate-tier static guard: the STOP fires after Step 0, the review body is absent from the skeleton, the report writer lives in the section, and nothing review-governing sits below the STOP.
- `test/skill-e2e-plan-ceo-review-section-loading.test.ts` — periodic backstop that asserts the carved section is Read before the report.

#### Changed
- `/plan-ceo-review`, `/office-hours`, `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review` are each a skeleton + one on-demand section on Claude; the conversational front (Step 0 / Phases 1-4.5) stays always-loaded; external hosts still receive the full inline skill (no behavior change off Claude).
- The AskUserQuestion preamble trims the inline CJK manual to the operative rule + a doc pointer; the always-loaded self-check is unchanged.
- Parity, size-budget, section-manifest, gen-skill-docs, and skill-validation treat every carved skill consistently (content + size floors run against the skeleton + section union; skeleton-shrink assertions guard the always-loaded win).

#### For contributors
- `test/helpers/auq-sdk-capture.ts` — reusable SDK capture engine: drives a skill to its AUQ and captures the verbatim generated text cleanly (real-PTY mangles plan-mode questions), grades format + recommendation substance robust to the connective, and detects section reads losslessly from the tool-use stream.
- `section-manifest-consistency` discovers every carved skill automatically, so the next carve is covered the moment its manifest lands.
- The `/ship` and `/plan-ceo-review` section-loading E2E tests detect section reads from the `claude -p` tool-use stream instead of scraping the real-PTY screen buffer, so they are reliable (the PTY path silently saw nothing in some terminals) and run hermetically against the worktree carve without mutating the installed skill.

## [1.55.1.0] - 2026-06-02

## **Telemetry now tells you exactly what it records and where it stays. The project-slug helper hands the shell a safe identifier on every path.**
Expand Down
32 changes: 32 additions & 0 deletions TODOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,38 @@ v1.47.0.0 baselines retained in `test/fixtures/` for the v1→v2 audit trail. Th
captured skill bytes match `origin/main` exactly (the rebasing branch left every
SKILL.md untouched). `bun test` is green again.

## Token-reduction follow-ups (Phase B, filed via /plan-eng-review on the plan-ceo-review carve)

### P3: Carve the always-loaded `{{PREAMBLE}}` reference blocks into an on-demand doc

**What:** The per-skill section carves (`/ship` v1.54, `/plan-ceo-review` v1.56) yield
real but bounded wins (-42% to -59% on the carved skill) because the shared
`{{PREAMBLE}}` (~40-50KB on every tier-3/4 skill) is the dominant always-loaded cost
and stays inline. Move the rarely-needed preamble REFERENCE blocks (the AskUserQuestion
split-rules and the CJK / lone-surrogate escaping reference) into an on-demand
section-style doc the agent reads only when it hits those edge cases, leaving the hot
path (voice, completeness principle, recommendation format) inline.

**Why:** Highest-ROI remaining token target. One preamble carve helps EVERY tier-≥2
skill at once, not one skill per PR. The eng-review on the plan-ceo carve flagged that
per-skill carves stay modest precisely because the preamble dominates the always-loaded
surface.

**Pros:** A single change reduces always-loaded cost across the whole skill pack.
**Cons:** The preamble is load-bearing and shared; a botched carve regresses every skill.
Needs the same union-parity + per-push freshness guards the section carves use, applied
corpus-wide.

**Context:** Builds on the v2 section pipeline (`scripts/resolvers/sections.ts`,
`{{SECTION:id}}` / `{{SECTION_INDEX}}`). The preamble source is
`scripts/resolvers/preamble.ts`. Measure which sub-blocks are cold (escaping reference,
split-rules) vs hot (voice, recommendation format) before cutting. Validate on one skill,
then roll corpus-wide.

**Effort estimate:** L (human team) → M (CC+gstack)
**Priority:** P3
**Depends on / blocked by:** The section pipeline (shipped v1.54). No hard blocker.

## gbrowser memory follow-ups (filed via /plan-eng-review + /codex on the v1.49 leak-fix PR)

These four items came out of the memory-leak investigation that shipped
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.55.1.0
1.56.0.0
25 changes: 6 additions & 19 deletions autoplan/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -371,25 +371,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
**Full rule + worked examples + Hold/dependency semantics:** see
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.

**Non-ASCII characters — write directly, never \u-escape.** When any
string field (question, option label, option description) contains
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
the literal UTF-8 characters in the JSON string. **Never escape them
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
and passes characters through unchanged. Manually escaping requires
recalling each codepoint from training, which is unreliable for long
CJK strings — the model regularly emits the wrong codepoint (e.g.
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
The trigger is long, multi-line questions with hundreds of CJK
characters: that is exactly when reflexive escaping kicks in and
exactly when miscoding is most damaging. Long ≠ escape. Keep
characters literal.

Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
Right: `"question": "請選擇管理工具"`

Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
**Non-ASCII characters — write directly, never \u-escape.** When any string
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.

### Self-check before emitting

Expand Down
25 changes: 6 additions & 19 deletions canary/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,25 +363,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
**Full rule + worked examples + Hold/dependency semantics:** see
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.

**Non-ASCII characters — write directly, never \u-escape.** When any
string field (question, option label, option description) contains
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
the literal UTF-8 characters in the JSON string. **Never escape them
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
and passes characters through unchanged. Manually escaping requires
recalling each codepoint from training, which is unreliable for long
CJK strings — the model regularly emits the wrong codepoint (e.g.
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
The trigger is long, multi-line questions with hundreds of CJK
characters: that is exactly when reflexive escaping kicks in and
exactly when miscoding is most damaging. Long ≠ escape. Keep
characters literal.

Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
Right: `"question": "請選擇管理工具"`

Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
**Non-ASCII characters — write directly, never \u-escape.** When any string
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.

### Self-check before emitting

Expand Down
25 changes: 6 additions & 19 deletions codex/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,25 +366,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
**Full rule + worked examples + Hold/dependency semantics:** see
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.

**Non-ASCII characters — write directly, never \u-escape.** When any
string field (question, option label, option description) contains
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
the literal UTF-8 characters in the JSON string. **Never escape them
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
and passes characters through unchanged. Manually escaping requires
recalling each codepoint from training, which is unreliable for long
CJK strings — the model regularly emits the wrong codepoint (e.g.
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
The trigger is long, multi-line questions with hundreds of CJK
characters: that is exactly when reflexive escaping kicks in and
exactly when miscoding is most damaging. Long ≠ escape. Keep
characters literal.

Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
Right: `"question": "請選擇管理工具"`

Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
**Non-ASCII characters — write directly, never \u-escape.** When any string
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.

### Self-check before emitting

Expand Down
25 changes: 6 additions & 19 deletions context-restore/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,25 +367,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
**Full rule + worked examples + Hold/dependency semantics:** see
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.

**Non-ASCII characters — write directly, never \u-escape.** When any
string field (question, option label, option description) contains
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
the literal UTF-8 characters in the JSON string. **Never escape them
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
and passes characters through unchanged. Manually escaping requires
recalling each codepoint from training, which is unreliable for long
CJK strings — the model regularly emits the wrong codepoint (e.g.
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
The trigger is long, multi-line questions with hundreds of CJK
characters: that is exactly when reflexive escaping kicks in and
exactly when miscoding is most damaging. Long ≠ escape. Keep
characters literal.

Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
Right: `"question": "請選擇管理工具"`

Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
**Non-ASCII characters — write directly, never \u-escape.** When any string
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.

### Self-check before emitting

Expand Down
25 changes: 6 additions & 19 deletions context-save/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,25 +366,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
**Full rule + worked examples + Hold/dependency semantics:** see
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.

**Non-ASCII characters — write directly, never \u-escape.** When any
string field (question, option label, option description) contains
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
the literal UTF-8 characters in the JSON string. **Never escape them
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
and passes characters through unchanged. Manually escaping requires
recalling each codepoint from training, which is unreliable for long
CJK strings — the model regularly emits the wrong codepoint (e.g.
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
The trigger is long, multi-line questions with hundreds of CJK
characters: that is exactly when reflexive escaping kicks in and
exactly when miscoding is most damaging. Long ≠ escape. Keep
characters literal.

Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
Right: `"question": "請選擇管理工具"`

Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
**Non-ASCII characters — write directly, never \u-escape.** When any string
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.

### Self-check before emitting

Expand Down
25 changes: 6 additions & 19 deletions cso/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,25 +369,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
**Full rule + worked examples + Hold/dependency semantics:** see
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.

**Non-ASCII characters — write directly, never \u-escape.** When any
string field (question, option label, option description) contains
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
the literal UTF-8 characters in the JSON string. **Never escape them
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
and passes characters through unchanged. Manually escaping requires
recalling each codepoint from training, which is unreliable for long
CJK strings — the model regularly emits the wrong codepoint (e.g.
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
The trigger is long, multi-line questions with hundreds of CJK
characters: that is exactly when reflexive escaping kicks in and
exactly when miscoding is most damaging. Long ≠ escape. Keep
characters literal.

Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
Right: `"question": "請選擇管理工具"`

Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
**Non-ASCII characters — write directly, never \u-escape.** When any string
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.

### Self-check before emitting

Expand Down
Loading
Loading