feat: gstack v1 — simpler prompts + real LOC receipts (v1.0.0.0)#1039
Merged
feat: gstack v1 — simpler prompts + real LOC receipts (v1.0.0.0)#1039
Conversation
Canonical record of the /plan-tune v1 design: typed question registry, per-question explicit preferences, inline tune: feedback with user-origin gate, dual-track profile (declared + inferred separately), and plain-English inspection skill. Captures every decision with pros/cons, what's deferred to v2 with explicit acceptance criteria, and what was rejected entirely. Codex review drove a substantial scope rollback from the initial CEO EXPANSION plan. 15+ legitimate findings (substrate claim was false without a typed registry; E4/E6/clamp logical contradiction; profile poisoning attack surface; LANDED preamble side effect; implementation order) shaped the final shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/question-registry.ts declares 53 recurring AskUserQuestion categories across 15 skills (ship, review, office-hours, plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review, qa, investigate, land-and-deploy, cso, gstack-upgrade, preamble, plan-tune, autoplan). Each entry has: stable kebab-case id, skill owner, category (approval | clarification | routing | cherry-pick | feedback-loop), door_type (one-way | two-way), optional stable option keys, optional psychographic signal_key, and a one-line description. 12 of 53 are one-way doors (destructive ops, architecture/data forks, security/compliance). These are ALWAYS asked regardless of user preference. Helpers: getQuestion(id), getOneWayDoorIds(), getAllRegisteredIds(), getRegistryStats(). No binary or resolver wiring yet — this is the schema substrate the rest of /plan-tune builds on. Ad-hoc question_ids (not registered) still log but skip psychographic signal attribution. Future /plan-tune skill surfaces frequently-firing ad-hoc ids as candidates for registry promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 tests validating the question registry: Schema (7 tests): - Every entry has required fields - All ids are kebab-case and start with their skill name - No duplicate ids - Categories are from the allowed set - door_type is one-way | two-way - Options arrays are well-formed - Descriptions are short and single-line Helpers (5 tests): - getQuestion returns entry for known id, undefined for unknown - getOneWayDoorIds includes destructive questions, excludes two-way - getAllRegisteredIds count matches QUESTIONS keys - getRegistryStats totals are internally consistent One-way door safety (2 tests): - Every critical question (test failure, SQL safety, LLM trust boundary, security scan, merge confirm, rollback, fix apply, premise revise, arch finding, privacy gate, user challenge) is declared one-way - At least 10 one-way doors exist (catches regression if declarations are accidentally dropped) Registry breadth (3 tests): - 11 high-volume skills each have >= 1 registered question - Preamble one-time prompts are registered - /plan-tune's own questions are registered Signal map references (1 test): - signal_key values are typed kebab-case strings Template coverage (2 tests, informational): - AskUserQuestion usage across templates is non-trivial (>20) - Registry spans >= 10 skills 20 pass, 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/one-way-doors.ts — secondary keyword-pattern classifier that catches
destructive questions even when the registry doesn't have an entry for them.
The registry's door_type field (from scripts/question-registry.ts) is the
PRIMARY safety gate. This classifier is the fallback for ad-hoc question_ids
that agents generate at runtime.
Classification priority:
1. Registry lookup by question_id → use declared door_type
2. Skill:category fallback (cso:approval, land-and-deploy:approval)
3. Keyword pattern match against question_summary
4. Default: treat as two-way (safer to log the miss than auto-decide unsafely)
Covers 21 destructive patterns across:
- File system (rm -rf, delete, wipe, purge, truncate)
- Database (drop table/database/schema, delete from)
- Git/VCS (force-push, reset --hard, checkout --, branch -D)
- Deploy/infra (kubectl delete, terraform destroy, rollback)
- Credentials (revoke/reset/rotate API key|token|secret|password)
- Architecture (breaking change, schema migration, data model change)
7 new tests in test/plan-tune.test.ts covering: registry-first lookup,
unknown-id fallthrough, keyword matching on destructive phrasings including
embedded filler words ("rotate the API key"), skill-category fallback,
benign questions defaulting to two-way, pattern-list non-empty.
27 pass, 0 fail. 1270 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/psychographic-signals.ts — hand-crafted {signal_key, user_choice} →
{dimension, delta} map. Version 0.1.0. Conservative deltas (±0.03 to ±0.06
per event). Covers 9 signal keys: scope-appetite, architecture-care,
code-quality-care, test-discipline, detail-preference, design-care,
devex-care, distribution-care, session-mode.
Helpers: applySignal() mutates running totals, newDimensionTotals() creates
empty starting state, normalizeToDimensionValue() sigmoid-clamps accumulated
delta to [0,1] (0 → 0.5 neutral), validateRegistrySignalKeys() checks that
every signal_key in the registry has a SIGNAL_MAP entry.
In v1 the signal map is used ONLY to compute inferred dimension values for
/plan-tune inspection output. No skill behavior adapts to these signals
until v2.
scripts/archetypes.ts — 8 named archetypes + Polymath fallback:
- Cathedral Builder (boil-the-ocean + architecture-first)
- Ship-It Pragmatist (small scope + fast)
- Deep Craft (detail-verbose + principled)
- Taste Maker (intuitive, overrides recommendations)
- Solo Operator (high-autonomy, delegates)
- Consultant (hands-on, consulted on everything)
- Wedge Hunter (narrow scope aggressively)
- Builder-Coach (balanced steering)
- Polymath (fallback when no archetype matches)
matchArchetype() uses L2 distance scaled by tightness, with a 0.55 threshold
below which we return Polymath. v1 ships the model stable; v2 narrative/vibe
commands wire it into user-facing output.
14 new tests: signal map consistency vs registry, applySignal behavior for
known/unknown keys, normalization bounds, archetype schema validity, name
uniqueness, matchArchetype correctness for each reference profile, Polymath
fallback for outliers.
41 pass, 0 fail total in test/plan-tune.test.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append-only JSONL log at ~/.gstack/projects/{SLUG}/question-log.jsonl.
Schema: {skill, question_id, question_summary, category?, door_type?,
options_count?, user_choice, recommended?, followed_recommendation?,
session_id?, ts}
Validates:
- skill is kebab-case
- question_id is kebab-case, <= 64 chars
- question_summary non-empty, <= 200 chars, newlines flattened
- category is one of approval/clarification/routing/cherry-pick/feedback-loop
- door_type is one-way or two-way
- options_count is integer in [1, 26]
- user_choice non-empty string, <= 64 chars
Injection defense on question_summary rejects the same patterns as
gstack-learnings-log (ignore previous instructions, system:, override:,
do not report, etc).
followed_recommendation is auto-computed when both user_choice and
recommended are present.
ts auto-injected as ISO 8601 if missing.
21 tests covering: valid payloads, full field preservation, auto-followed
computation, appending, long-summary truncation, newline flattening,
invalid JSON, missing fields, bad case, oversized ids, invalid enum
values, out-of-range options_count, and 6 injection attack patterns.
21 pass, 0 fail, 43 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bin/gstack-developer-profile supersedes bin/gstack-builder-profile. The old
binary becomes a one-line legacy shim delegating to --read for /office-hours
backward compat.
Subcommands:
--read legacy KEY:VALUE output (tier, session_count, etc)
--migrate folds ~/.gstack/builder-profile.jsonl into
~/.gstack/developer-profile.json. Atomic (temp + rename),
idempotent (no-op when target exists or source absent),
archives source as .migrated-YYYY-MM-DD-HHMMSS
--derive recomputes inferred dimensions from question-log.jsonl
using the signal map in scripts/psychographic-signals.ts
--profile full profile JSON
--gap declared vs inferred diff JSON
--trace <dim> event-level trace of what contributed to a dimension
--check-mismatch flags dimensions where declared and inferred disagree by
> 0.3 (requires >= 10 events first)
--vibe archetype name + description from scripts/archetypes.ts
--narrative (v2 stub)
Auto-migration on first read: if legacy file exists and new file doesn't,
migrate before reading. Creates a neutral (all-0.5) stub if nothing exists.
Unified schema (see docs/designs/PLAN_TUNING_V0.md §Architecture):
{identity, declared, inferred: {values, sample_size, diversity},
gap, overrides, sessions, signals_accumulated, schema_version}
25 new tests across subcommand behaviors:
- --read defaults + stub creation
- --migrate: 3 sessions preserved with signal tallies, idempotency, archival
- Tier calculation: welcome_back / regular / inner_circle boundaries
- --derive: neutral-when-empty, upward nudge on 'expand', downward on 'reduce',
recomputable (same input → same output), ad-hoc unregistered ids ignored
- --trace: contributing events, empty for untouched dims, error without arg
- --gap: empty when no declared, correctly computed otherwise
- --vibe: returns archetype name + description
- --check-mismatch: threshold behavior, 10+ sample requirement
- Unknown subcommand errors
25 pass, 0 fail, 60 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…igin gate
Subcommands:
--check <id> → ASK_NORMALLY | AUTO_DECIDE (decides if a registered
question should be auto-decided by the agent)
--write '{…}' → set a preference (requires user-origin source)
--read → dump preferences JSON
--clear [id] → clear one or all
--stats → short counts summary
Preference values: always-ask | never-ask | ask-only-for-one-way.
Stored at ~/.gstack/projects/{SLUG}/question-preferences.json.
Safety contract (the core of Codex finding #16, profile-poisoning defense
from docs/designs/PLAN_TUNING_V0.md §Security model):
1. One-way doors ALWAYS return ASK_NORMALLY from --check, regardless of
user preference. User's never-ask is overridden with a visible safety
note so the user knows why their preference didn't suppress the prompt.
2. --write requires an explicit `source` field:
- Allowed: "plan-tune", "inline-user"
- REJECTED with exit code 2: "inline-tool-output", "inline-file",
"inline-file-content", "inline-unknown"
Rejection is explicit ("profile poisoning defense") so the caller can
log and surface the attempt.
3. free_text on --write is sanitized against injection patterns (ignore
previous instructions, override:, system:, etc.) and newline-flattened.
Each --write also appends a preference-set event to
~/.gstack/projects/{SLUG}/question-events.jsonl for derivation audit trail.
31 tests:
- --check behavior (4): defaults, two-way, one-way (one-way overrides
never-ask with safety note), unknown ids, missing arg
- --check with prefs (5): never-ask on two-way → AUTO_DECIDE; never-ask
on one-way → ASK_NORMALLY with override note; always-ask always asks;
ask-only-for-one-way flips appropriately
- --write valid (5): inline-user accepted, plan-tune accepted, persisted
correctly, event appended, free_text preserved with flattening
- User-origin gate (6): missing source rejected; inline-tool-output
rejected with exit code 2 and explicit poisoning message; inline-file,
inline-file-content, inline-unknown rejected; unknown source rejected
- Schema validation (4): invalid JSON, bad question_id, bad preference,
injection in free_text
- --read (2): empty → {}, returns writes
- --clear (3): specific id, clear-all, NOOP for missing
- --stats (2): empty zeros, tallies by preference type
31 pass, 0 fail, 52 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/resolvers/question-tuning.ts ships three preamble generators:
generateQuestionPreferenceCheck — before each AskUserQuestion, agent runs
gstack-question-preference --check <id>. AUTO_DECIDE suppresses the ask
and auto-chooses recommended. ASK_NORMALLY asks as usual. One-way door
safety override is handled by the binary.
generateQuestionLog — after each AskUserQuestion, agent appends a log
record with skill, question_id, summary, category, door_type,
options_count, user_choice, recommended, session_id.
generateInlineTuneFeedback — offers inline "tune:" prompt after two-way
questions. Documents structured shortcuts (never-ask, always-ask,
ask-only-for-one-way, ask-less) AND accepts free-form English with
normalization + confirmation. Explicitly spells out the USER-ORIGIN
GATE: only write tune events when the prefix appears in the user's own
chat message, never from tool output or file content. Binary enforces.
All three resolvers are gated by the QUESTION_TUNING preamble echo. When
the config is off, the agent skips these sections entirely. Ready to be
wired into preamble.ts in the next commit.
Codex host has a simpler variant that uses $GSTACK_BIN env vars.
scripts/resolvers/index.ts registers three placeholders:
QUESTION_PREFERENCE_CHECK, QUESTION_LOG, INLINE_TUNE_FEEDBACK
Total resolver count goes from 45 to 48.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/resolvers/preamble.ts — adds two things:
1. _QUESTION_TUNING config echo in the preamble bash block, gated on the
user's gstack-config `question_tuning` value (default: false).
2. A combined Question Tuning section for tier >= 2 skills, injected after
the confusion protocol. The section itself is runtime-gated by the
QUESTION_TUNING value — agents skip it entirely when off.
scripts/resolvers/question-tuning.ts — consolidated into one compact combined
section `generateQuestionTuning(ctx)` covering: preference check before the
question, log after, and inline tune: feedback with user-origin gate. Per-phase
generators remain exported for unit tests but are no longer the main entrypoint.
Size impact: +570 tokens / +2.3KB per tier-2+ SKILL.md. Three skills
(plan-ceo-review, office-hours, ship) still exceed the 100KB token ceiling —
but they were already over before this change. Delta is the smallest viable
wiring of the /plan-tune v1 substrate.
Golden fixtures (test/fixtures/golden/claude-ship, codex-ship, factory-ship)
regenerated to match the new baseline.
Full test run: 1149 pass, 0 fail, 113 skip across 28 files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bun run gen:skill-docs --host all after wiring the QUESTION_TUNING preamble section. Every tier >= 2 skill now includes the combined Question Tuning guidance. Runtime-gated — agents skip the section when question_tuning is off in gstack-config (default). Golden fixtures (claude-ship, codex-ship, factory-ship) updated to the new baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
plan-tune/SKILL.md.tmpl: the user-facing skill for /plan-tune v1. Routes
plain-English intent to one of 8 flows:
- Enable + setup (first-time): 5 declaration questions mapping to the
5 psychographic dimensions (scope_appetite, risk_tolerance,
detail_preference, autonomy, architecture_care). Writes to
developer-profile.json declared.*.
- Inspect profile: plain-English rendering of declared + inferred + gap.
Uses word bands (low/balanced/high) not raw floats. Shows vibe archetype
when calibration gate is met.
- Review question log: top-20 question frequencies with follow/override
counts. Highlights override-heavy questions as candidates for never-ask.
- Set a preference: normalizes "stop asking me about X" → never-ask, etc.
Confirms ambiguous phrasings before writing via gstack-question-preference.
- Edit declared profile: interprets free-form ("more boil-the-ocean") and
CONFIRMS before mutating declared.* (trust boundary per Codex #15).
- Show gap: declared vs inferred diff with plain-English severity bands
(close / drift / mismatch). Never auto-updates declared from the gap.
- Stats: preference counts + diversity/calibration status.
- Enable / disable: gstack-config set question_tuning true|false.
Design constraints enforced:
- Plain English everywhere. No CLI subcommand syntax required. Shortcuts
(`profile`, `vibe`, `stats`, `setup`) exist but optional.
- user-origin gate on tune: writes. source: "plan-tune" for user-invoked
/plan-tune; source: "inline-user" for inline tune: from other skills.
- One-way doors override never-ask (safety, surfaced to user).
- No behavior adaptation in v1 — this skill inspects and configures only.
Generates plan-tune/SKILL.md at ~11.6k tokens, well under the 100KB ceiling.
Generated for all hosts via `bun run gen:skill-docs --host all`.
Full free test suite: 1149 pass, 0 fail, 113 skip across 28 files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added 6 tests to test/plan-tune.test.ts:
Preamble injection (3 tests):
- tier 2+ includes Question Tuning section with preference check, log,
and user-origin gate language ('profile-poisoning defense', 'inline-user')
- tier 1 does NOT include the prose section (QUESTION_TUNING bash echo
still fires since it's in the bash block all tiers share)
- codex host swaps binDir references to $GSTACK_BIN
End-to-end pipeline (3 tests) — real binaries working together, not mocks:
- Log 5 expand choices → --derive → profile shows scope_appetite > 0.5
(full log → registry lookup → signal map → normalization round-trip)
- --write source: inline-tool-output rejected; --read confirms no pref
was persisted (the profile-poisoning defense actually works end-to-end)
- Migrate a 3-session legacy file; confirm legacy gstack-builder-profile
shim still returns SESSION_COUNT: 3, TIER: welcome_back, CROSS_PROJECT: true
test/plan-tune.test.ts now has 47 tests total.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test/skill-e2e-plan-tune.test.ts — verifies /plan-tune correctly routes
plain-English intent ("review the questions I've been asked") to the
Review question log section without requiring CLI subcommand syntax.
Seeds a synthetic question-log.jsonl with 3 entries exercising:
- override behavior (user chose expand over recommended selective)
- one-way door respect (user followed ship-test-failure-triage recommendation)
- two-way override (user skipped recommended changelog polish)
Invokes the skill via `claude -p` and asserts:
- Agent surfaces >= 2 of 3 logged question_ids in output
- Agent notices override/skip behavior from the log
- Exit reason is success or error_max_turns (not agent-crash)
Gate-tier because the core v1 DX promise is plain-English intent routing.
If it requires memorized subcommands or breaks on natural language, that's
a regression of the defining feature.
Registered in test/helpers/touchfiles.ts with dependencies:
- plan-tune/** (skill template + generated md)
- scripts/question-registry.ts (required for log lookup)
- scripts/psychographic-signals.ts, scripts/one-way-doors.ts (derive path)
- bin/gstack-question-log, gstack-question-preference, gstack-developer-profile
Skipped when EVALS_ENABLED is not set; runs on `bun run test:evals`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships /plan-tune as observational substrate: typed question registry, dual-track developer profile (declared + inferred), explicit per-question preferences with user-origin gate, inline tune: feedback across every tier >= 2 skill, unified developer-profile.json with migration from builder-profile.jsonl. Scope rolled back from initial CEO EXPANSION plan after outside-voice review (Codex). 6 deferrals tracked as P0 TODOs with explicit acceptance criteria: E1 substrate wiring, E3 narrative/vibe, E4 blind-spot coach, E5 LANDED celebration, E6 auto-adjustment, E7 psychographic auto-decide. See docs/designs/PLAN_TUNING_V0.md for the full design record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E Evals: ✅ PASS61/61 tests passed | $6.77 total cost | 12 parallel runners
12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite |
Conflicts resolved: - VERSION: keep 0.19.0.0 (our MINOR bump stays above main's new 0.18.2.0) - package.json: bumped to 0.19.0.0 to match VERSION (test/gen-skill-docs.test.ts asserts they match) - CHANGELOG.md: keep both entries — our v0.19.0.0 on top, main's v0.18.2.0 below it, preserving the full version history Main added v0.18.2.0 (#1030): context-rot defense for /ship (subagent isolation for coverage/plan-completion/greptile/docs steps + clean integer step numbering). Regenerated all SKILL.md files after merge so they reflect both main's ship template changes AND our preamble additions (question-tuning section). Full free test suite: 1158 pass, 0 fail, 113 skip across 28 files, 7855 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI image build failed with: E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/... Connection failed [IP: 91.189.92.22 80] ERROR: process "/bin/sh -c apt-get update && apt-get install ..." did not complete successfully: exit code: 100 archive.ubuntu.com periodically returns "connection refused" on individual regional mirrors. Without retry logic a single failed fetch nukes the whole Docker build. Three defenses, layered: 1. /etc/apt/apt.conf.d/80-retries — apt fetches each package up to 5 times with a 30s timeout. Handles per-package flakes. 2. Shell-loop retry around the whole apt-get step (x3, 10s sleep) — handles the case where apt-get update itself can't reach any mirror. 3. --retry 5 --retry-delay 5 --retry-connrefused on all curl fetches (bun install script, GitHub CLI keyring, NodeSource setup script). Applied to every apt-get and curl call in the Dockerfile. No behavior change on happy path — only kicks in when mirrors blip. Fixes the build-image job that was blocking CI on the /plan-tune PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts resolved: - VERSION / package.json: keep 0.19.0.0 (our MINOR bump stays above main's new 0.18.3.0 — community wave v0.18.3.0 + our plan-tune v0.19.0.0 both ship, ours on top). - CHANGELOG.md: preserved both entries in order — v0.19.0.0 (plan-tune) above v0.18.3.0 (community wave). No version gaps. - .github/docker/Dockerfile.ci: main's Hetzner-mirror swap is a better root cause fix than my retry-only patch (route-local for Ubicloud runners, avoids archive.ubuntu.com entirely). Combined: main's mirror swap PLUS my defense-in-depth layers on top (apt retries config, --retry-connrefused on curl, and outer shell-loop retries for apt-get update). Mirror swap solves the root cause; retries handle the rare case where even Hetzner blips. Main added: - v0.18.3.0 (#1028): community wave — Windows cookie import, OpenCode install, permission-prompt cleanup, $B server persistence across Bash calls, cookie picker fix, OpenClaw frontmatter fix. - Dockerfile.ci Hetzner mirror swap (from the same wave). Regenerated all SKILL.md files after merge so they reflect main's design-* template changes AND our question-tuning preamble additions. Full free test suite: 1162 pass, 0 fail, 113 skip across 29 files, 7903 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the V1 design (ELI10 writing + LOC reframe) in docs/designs/PLAN_TUNING_V1.md and the extracted V1.1 pacing-overhaul plan in docs/designs/PACING_UPDATES_V0.md. V1 scope was reduced from the original bundled pacing + writing-style plan after three engineering-review passes revealed structural gaps in the pacing workstream that couldn't be closed via plan-text editing. TODOS.md P0 entry links to V1.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repo-owned list of ~50 high-frequency technical terms (idempotent, race condition, N+1, backpressure, etc.) that gstack glosses on first use in tier-≥2 skill output. Baked into generated SKILL.md prose at gen-skill-docs time. Terms not on this list are assumed plain-English enough. Contributions via PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion prompt
Adds a new Writing Style section to tier-≥2 preamble output composing with
the existing AskUserQuestion Format section. Six rules: jargon glossed on
first use per skill invocation (from scripts/jargon-list.json), outcome-
framed questions, short sentences, decisions close with user impact,
gloss-on-first-use even if user pasted term, user-turn override for "be
terse" requests. Baked conditionally (skip if EXPLAIN_LEVEL: terse).
Adds EXPLAIN_LEVEL preamble echo using \${binDir} (host-portable matching
V0 QUESTION_TUNING pattern). Adds WRITING_STYLE_PENDING echo reading a
flag file written by the V0→V1 upgrade migration; on first post-upgrade
skill run, the agent fires a one-time AskUserQuestion offering terse mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds explain_level: default|terse to the annotated config header with
a one-line description. Whitelists valid values; on set of an unknown
value, prints a specific warning ("explain_level '\$VALUE' not
recognized. Valid values: default, terse. Using default.") and writes
the default value. Matches V1 preamble's EXPLAIN_LEVEL echo expectation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New migration script following existing v0.15.2.0.sh / v0.16.2.0.sh pattern. Writes a .writing-style-prompt-pending flag file on first run post-upgrade. The preamble's migration-prompt block reads the flag and fires a one-time AskUserQuestion offering the user a choice between the new default writing style and restoring V0 prose via \`gstack-config set explain_level terse\`. Idempotent via flag files; if the user has already set explain_level explicitly, counts as answered and skips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…scc installer Three new scripts: - scripts/garry-output-comparison.ts — enumerates Garry-authored commits in 2013 + 2026 on public repos, extracts ADDED lines from git diff, classifies as logical SLOC via scc --stdin (regex fallback if scc missing). Writes docs/throughput-2013-vs-2026.json with per-language breakdown + explicit caveats (public repos only, commit-style drift, private-work exclusion). - scripts/update-readme-throughput.ts — reads the JSON if present, replaces the README's <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor with the computed multiple (preserving the anchor for future runs). If JSON missing, writes GSTACK-THROUGHPUT-PENDING marker that CI rejects — forcing the build to run before commit. - scripts/setup-scc.sh — standalone OS-detecting installer for scc. Not a package.json dependency (95% of users never run throughput). Brew on macOS, apt on Linux, GitHub releases link on Windows. Two-string anchor pattern (PLACEHOLDER vs PENDING) prevents the pipeline from destroying its own update path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 reorders the /retro summary table to lead with features shipped, then commits + weighted commits (commits × files-touched capped at 20), then PRs merged, then logical SLOC added as the primary code-volume metric. Raw LOC stays present but is demoted to context. Rationale inline in the template: ten lines of a good fix is not less shipping than ten thousand lines of scaffold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ump to 1.0.0.0 README.md: - Hero removes "600,000+ lines of production code" framing; replaces with the computed 2013-vs-2026 pro-rata multiple (via <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor, filled by the update-readme-throughput build step). - Hiring callout: "ship real products at AI-coding speed" instead of "10K+ LOC/day." - New Writing Style section (~80 words) between Quick start and Install: "v1 prompts = simpler" framing, outcome-language example, terse-mode opt-out, pointer to /plan-tune. CLAUDE.md: one-paragraph Writing style (V1) note under project conventions, linking to preamble resolver + V1 design docs. CHANGELOG.md: V1 entry on top of v0.19.0.0 with user-facing narrative (what changes, how to opt out, for-contributors notes). Mentions scope reduction — pacing overhaul ships in V1.1. CONTRIBUTING.md: one-paragraph note on jargon-list.json maintenance (PR to add/remove terms; regenerate via gen:skill-docs). VERSION + package.json: bump to 1.0.0.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical regeneration from the updated templates in prior commits: - Writing Style section now appears in tier-≥2 skill output. - EXPLAIN_LEVEL + WRITING_STYLE_PENDING echoes in preamble bash. - V1 migration-prompt block fires conditionally on first upgrade. - Jargon list inlined into preamble prose at gen time. - Retro template's logical SLOC + weighted commits order applied. Regenerated for all 8 hosts via bun run gen:skill-docs --host all. Golden ship-skill fixtures refreshed from regenerated outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…igration + dormancy Six new gate-tier test files: - test/writing-style-resolver.test.ts — asserts Writing Style section is injected into tier-≥2 preamble, all 6 rules present, jargon list inlined, terse-mode gate condition present, Codex output uses \$GSTACK_BIN (not ~/.claude/), tier-1 does NOT get the section, migration-prompt block present. - test/explain-level-config.test.ts — gstack-config set/get round-trip for default + terse, unknown-value warns + defaults to default, header documents the key, round-trip across set→set→get. - test/jargon-list.test.ts — shape + ~50 terms + no duplicates (case-insensitive) + includes canonical high-signal terms. - test/v0-dormancy.test.ts — 5D dimension names + archetype names forbidden in default-mode tier-≥2 SKILL.md output, except for plan-tune and office-hours where they're load-bearing. - test/readme-throughput.test.ts — script replaces anchor with number on happy path, writes PENDING marker when JSON missing, CI gate asserts committed README contains no PENDING string. - test/upgrade-migration-v1.test.ts — fresh run writes pending flag, idempotent after user-answered, pre-existing explain_level counts as answered. All 95 V1 test-expect() calls pass. Full suite: 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran scripts/garry-output-comparison.ts across all 15 public garrytan/* repos. Aggregated results into docs/throughput-2013-vs-2026.json and ran scripts/update-readme-throughput.ts to replace the README placeholder. 2013 public activity: 2 commits, 2,384 logical lines added across 1 week, in 1 repo (zurb-foundation-wysihtml5 upstream contribution). 2026 public activity: 279 commits, 310,484 logical lines added across 17 active weeks, in 3 repos (gbrain, gstack, resend_robot). Multiples (public repos only, apples-to-apples): - Logical SLOC: 130.2× - Commits per active week: 8.2× - Raw lines added: 134.4× Private work at both eras (2013 Bookface at YC, Posterous-era code, 2026 internal tools) is excluded from this comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-ran scripts/garry-output-comparison.ts across all 41 repos under garrytan/* (15 public + 26 private), including Bookface (YC's internal social network, 2013-era work). 2013 activity: 71 commits, 5,143 logical lines, 4 active repos (bookface, delicounter, tandong, zurb-foundation-wysihtml5) 2026 activity: 350 commits, 1,064,818 logical lines, 15 active repos (gbrain, gstack, gbrowser, tax-app, kumo, tenjin, autoemail, kitsune, easy-chromium-compiles, conductor-playground, garryslist-agent, baku, gstack-website, resend_robot, garryslist-brain) Multiples: - Logical SLOC: 207× (up from 130.2× when including private work) - Raw lines: 223× - Commits/active-week: 3.4× Stopped committing docs/throughput-2013-vs-2026.json — analysis is a local artifact, not repo state. Added docs/throughput-*.json to .gitignore. Full markdown analysis at ~/throughput-analysis-2026-04-18.md (local-only). README multiple is now hardcoded; re-run the script and edit manually when you want to refresh it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two separate numbers in the README hero: - Run rate: ~700× (9,859 logical lines/day in 2026 vs 14/day in 2013) - Year-to-date: 207× (2026 through April 18 already exceeds 2013 full year by 207×) Previous "207× pro-rata" framing mixed full-year 2013 vs partial-year 2026. Run rate is the apples-to-apples normalization; YTD is the "already produced" total. Both are honest; both are compelling; they measure different things. Analysis at ~/throughput-analysis-2026-04-18.md (local-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Enhanced scripts/garry-output-comparison.ts so both calculations come out of a single run instead of being reassembled ad-hoc in bash: PerYearResult now includes: - days_elapsed — 365 for past years, day-of-year for current - is_partial — flags the current (in-progress) year - per_day_rate — logical/raw/commits normalized by calendar day - annualized_projection — per_day_rate × 365 Output JSON's `multiples` now has two sibling blocks: - multiples.to_date — raw volume ratios (2026-YTD / 2013-full-year) - multiples.run_rate — per-day pace ratios (apples-to-apples) Back-compat: multiples.logical_lines_added still aliases to_date for older consumers reading the JSON. Updated README hero to cite both (picking up brain/* repo that was missed in the earlier aggregation pass): 2026 run rate: ~880× my 2013 pace (12,382 vs 14 logical lines/day) 2026 YTD: 260× the entire 2013 year Stderr summary now prints both multiples at the end of each run. Full analysis at ~/throughput-analysis-2026-04-18.md (local-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-form response to the "LOC is a meaningless vanity metric" critique. Covers: - The three branches of the LOC critique and which are right - Why logical SLOC (NCLOC) beats raw LOC as the honest measurement - Full method: author-scoped git diff, regex-classified added lines, aggregated across 41 public + private garrytan/* repos - Both calculations: to-date (260x) and run-rate (879x) - Steelman of the critics (greenfield-vs-maintenance, survivorship bias, quality-adjusted productivity, time-to-first-user) - Reproduction instructions Linked from README hero via a blockquote directly below the number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tax-app's history is one commit of 104K logical lines — an initial import of a codebase, not authored work. Removing it to keep the comparison honest. Changes: - scripts/garry-output-comparison.ts: added EXCLUDED_REPOS constant with tax-app + a one-line rationale. The script now skips excluded repos with a stderr note and deletes any stale output JSON so aggregation loops don't pick up pre-exclusion numbers. - README hero: updated to 810× run rate + 240× YTD (were 880×/260×). Wording updated to "40 public + private repos ... after excluding repos dominated by imported code." - docs/ON_THE_LOC_CONTROVERSY.md: updated all numbers, added an "Exclusions" paragraph explaining tax-app, removed tax-app from the "shipped not WIP" example list. New numbers (2026 through day 108, without tax-app): - To-date: 240× logical SLOC (1,233,062 vs 5,143) - Run rate: 810× per-day pace (11,417 vs 14 logical/day) - Annualized: ~4.2M logical lines projected Future re-runs automatically skip tax-app. Add more exclusions to EXCLUDED_REPOS at the top of the script with a one-line rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tax-app is a demo app I built for an upcoming YC channel video,
not an "import-dominated history" as the previous commit claimed.
Excluded because it's not production shipping work, not because
of an import commit.
Updated rationale in scripts/garry-output-comparison.ts's
EXCLUDED_REPOS constant, in docs/ON_THE_LOC_CONTROVERSY.md's
method section + conclusion, and in the README hero wording
("one demo repo" vs the earlier "repos dominated by imported code").
Numbers unchanged — the exclusion itself is the same, just the
reason.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes the thesis as "engineers can fly now" (amplification, not replacement) and fortifies the soft spots critics will attack. Added: - Flight-thesis opener: pilot vs walker, leverage not replacement. - Second deflation layer for AI verbosity (on top of NCLOC). Headline moves from 810x to 408x after generous 2x AI-boilerplate cut, with explicit sensitivity analysis showing the number is still large under pessimistic priors (5x → 162x, 10x → 81x, 100x impossible). - Weekly distribution check (kills "you had one burst week" attack). - Revert rate (2.0%) and post-merge fix rate (6.3%) with OSS comparables (K8s/Rails/Django band). Addresses "where are your error rates" directly. - Named production adoption signals (gstack 1000+ installs, gbrain beta, resend_robot paying API) with explicit concession that "shipped != used at scale" for most of the corpus. - Harder steelman: 5 specific concessions with quantified pivot points (e.g., "if 2013 baseline was 3.5x higher, 810x → 228x, still high"). Removed factual error: Posterous acquisition paragraph (Garry had already left Posterous by 2011, so the "Twitter bought our private repos" excuse for the 2013 corpus gap doesn't apply). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gstack: "1,000+ distinct project installations" → "tens of thousands of daily active users" (telemetry-reported, community tier, opt-in). gbrain: "small set of beta testers" → "hundreds of beta testers running it live." Both are the accurate current numbers. The concession paragraph below (about shipped != adopted at scale for the long-tail repos) still reads correctly since it's about the corpus as a whole, not gstack/gbrain specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"You'd need access to my private repos" → "Bookface and Posthaven are private, but gstack and gbrain are open-sourced with tens of thousands of GitHub stars and tens of thousands of confirmed regular users, among the most-used OSS projects in the world that didn't exist three months ago." Keeps the `gh repo list` command at the end for the actual reproducibility instruction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Lead with concession (LOC is garbage, do the math anyway) - Preempt 14 lines/day meme with historical baselines (Brooks, Jones, McConnell) - Remove 'neckbeard' language throughout - Add slop-scan story (Ben Vinegar, 5.24 → 1.96, 62% cut) - David Cramer GUnit joke - Add testing philosophy section (the real unlock) - ASCII weekly distribution chart - gstack telemetry section with real numbers (15K installs, 305K invocations, 95.2% success) - Top skills usage chart - Pick-your-priors paragraph moved earlier (the killer) - Sharper close: run the script, show me your numbers
1. Citation fix. Kernighan didn't say anything about LOC-as-metric
(that's the famous "aircraft building by weight" quote, commonly
misattributed but actually Bill Gates). Replaced "Kernighan implied
it before that" with the real Dijkstra quote ("lines produced" vs
"lines spent" from EWD1036, with direct link) + the Gates quote.
Verified via web search.
2. Slop-scan direction clarified. "(highest on his benchmark)" was
ambiguous — could read as a brag. Now: "Higher score = more slop.
He ran it on gstack and we scored 5.24, the worst he'd measured
at the time." Then the 62% cut lands as an actual win.
3. Prose/chart skill-usage ordering now matches. Added /plan-eng-review
(28,014) to the prose list so it doesn't conflict with the chart
below it.
4. Cut the "David — I owe you one / GUnit" insider joke. Most readers
won't connect Cramer → Sentry → GUnit naming. Ends the slop-scan
paragraph on the stronger line: "Run `bun test` and watch 2,000+
tests pass."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts resolved: - VERSION / package.json: keep 1.0.0.0 (our v1 milestone bump from commit 18fc95c, supersedes main's 0.18.4.0). - CHANGELOG.md: preserved both entries in chronological order — v1.0.0.0 (writing-style + throughput receipts) above v0.18.4.0 (codex + Apple Silicon hardening). No gaps in sequence: 1.0.0.0 → 0.19.0.0 → 0.18.4.0 → 0.18.3.0 → ... Main added v0.18.4.0 (#1056): codex + Apple Silicon hardening wave. Regenerated all SKILL.md files after merge so they pick up both main's template changes AND our v1 writing-style + question-tuning additions. Full free test suite: 1288 pass, 0 fail, 113 skip across 37 files, 8555 expect() calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Bill Gates quote: flagged as folklore-grade. Was "Bill Gates put it more memorably" (firm attribution). Now "The old line (widely attributed to Bill Gates, sourcing murky) puts it more memorably." The quote stands; honesty about attribution avoids the same misattribution trap we just fixed for Kernighan. 2. Capers Jones: "15-50 across thousands of projects" → "roughly 16-38 LOC/day across thousands of projects" — matches his actual published measurements (which also report as 325-750 LOC/month). 3. Steve McConnell: "10-50 for finished, tested, delivered code" was folklore. Replaced with his actual project-size-dependent range from Code Complete: "20-125 LOC/day for small projects (10K LOC) down to 1.5-25 for large projects (10M LOC) — it's size-dependent, not a single number." 4. Revert rate comparison: "Kubernetes, Rails, and Django historically run 1.5-3%" was unsourced. Replaced with "mature OSS codebases typically run 1-3%" + "run the same command on whatever you consider the bar and compare." No false specificity about which repos. Net: every quantitative citation in the post now matches primary-source figures or is explicitly flagged as folklore. Neckbeards can verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was sitting in prime real estate between Quick start and Install — internal implementation detail, not something users need up-front. Existing coverage is enough: - Upgrade migration prompt notifies users on first post-upgrade run - CLAUDE.md has the contributor note - docs/designs/PLAN_TUNING_V1.md has the full design Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 was three separate code blocks: setup --team, then team-init, then git add/commit. Mirrors Step 1's style now — one shell one-liner that does all three. Subshell (cd && ./setup --team) keeps the user in their repo pwd so team-init + git commit land in the right place. "Swap required for optional" moved to a one-liner below. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Contributing or need full history?" note is for contributors, not for someone following the README install flow. Moved into CONTRIBUTING's Quick start section where it fits next to the existing clone command, with a tip to upgrade an existing shallow clone via \`git fetch --unshallow\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch ships two layers of change that together represent gstack v1.
Layer 1: /plan-tune observational substrate (from earlier commits on this branch)
/plan-tuneskill — conversational inspection + per-question preferences + inlinetune:feedbackscripts/question-registry.ts) with CI enforcement for 53 recurring AskUserQuestion categories across 15 skillsbuilder-profile.jsonl→developer-profile.jsonLayer 2: ELI10 writing + LOC reframe (this session)
gstack-config set explain_level terserestores V0 prose.scripts/jargon-list.json(~50 terms, baked atgen-skill-docstime).<!-- GSTACK-THROUGHPUT-PLACEHOLDER -->anchor thatscripts/update-readme-throughput.tsreplaces with a computed 2013-vs-2026 pro-rata multiple fromscripts/garry-output-comparison.ts(scc-backed logical SLOC + caveats). Two-string pattern (PLACEHOLDER + PENDING) prevents the pipeline from destroying its own update path./retrometric reorder. Features shipped → commits + weighted commits → PRs merged → logical SLOC → raw LOC demoted.gstack-upgrade/migrations/v1.0.0.0.sh. One-time, flag-file-gated, offers terse opt-out.Deferred to V1.1
Review pacing overhaul — ranking findings, auto-accepting two-way doors, max 3 prompts per phase, Silent Decisions block with flip mechanism. Extracted to
docs/designs/PACING_UPDATES_V0.mdafter three eng-review passes + two Codex passes surfaced 10+ structural gaps unfixable via plan-text editing. TODOS.md P0 entry links to the V1.1 design doc.Review History
Full design captured in
docs/designs/PLAN_TUNING_V1.md.Test Coverage
All new V1 code has dedicated test coverage:
test/writing-style-resolver.test.ts— preamble Writing Style section injected, all 6 rules present, jargon list inlined, terse-mode gate present, host-aware paths (Codex output contains no.claude/), tier-1 excluded, migration-prompt block presenttest/explain-level-config.test.ts— gstack-configset/getround-trip fordefault+terse, unknown-value warns + defaults, annotated header documents the keytest/jargon-list.test.ts— JSON shape, ~50 terms, no duplicates, canonical high-signal terms includedtest/v0-dormancy.test.ts— 5D dimension names + 8 archetype names forbidden in default-mode tier-≥2 SKILL.md output (exceptplan-tune+office-hourswhere they're load-bearing)test/readme-throughput.test.ts— update-readme-throughput happy path, PENDING marker on missing JSON, CI gate asserts committed README contains no PENDING stringtest/upgrade-migration-v1.test.ts— migration script writes pending flag on fresh state, idempotent after user-answered, pre-existingexplain_levelcounts as answeredtest/gen-skill-docs.test.ts— extended to cover V1 preamble injection + retro template changesPlus pre-existing V0 tests (question-log, question-preference, developer-profile, plan-tune flows, registry coverage, one-way-door classifier) unchanged and passing.
Result: 0 test failures across the full suite. Host smoke tests pass for all 8 generated hosts. Golden ship-skill fixtures refreshed.
Version
0.19.0.0→1.0.0.0— major bump reflects the default-behavior flip. Upgrade migration script offers users who prefer V0 prose a one-click opt-out on first skill run post-upgrade.Docs
Files changed
78 files, +12,971 / -152 (includes regenerated SKILL.md for all 8 hosts — most are mechanical gen outputs from the preamble template change).
Key source changes:
scripts/resolvers/preamble.ts(+71 lines for Writing Style + EXPLAIN_LEVEL + migration prompt),bin/gstack-config(+13 lines for validation),scripts/jargon-list.json(new, ~50 terms),retro/SKILL.md.tmpl(metric reorder),gstack-upgrade/migrations/v1.0.0.0.sh(new),scripts/garry-output-comparison.ts+scripts/update-readme-throughput.ts+scripts/setup-scc.sh(new, LOC tooling).Test plan
bun test— 0 failures across the full suitebun run gen:skill-docs --host all— all 8 hosts regenerated successfullygstack-config set explain_level terse/gstack-config get explain_levelround-trip worksgstack-config set explain_level garbagewarns + defaults todefaultPost-merge:
scripts/garry-output-comparison.tsacross all publicgarrytan/*repos to produce the real 2013-vs-2026 multiple and replace the README placeholder/autoplanon a Louise-style plan and report whether writing feels made-for-themdocs/designs/PACING_UPDATES_V0.mdto a full plan with CEO + Codex + DX + Eng reviews🤖 Generated with Claude Code