Skip to content

v1.33.1.0 fix(learnings): token-OR query + task-shaped retrieval in 3 long skills#1442

Merged
garrytan merged 5 commits into
mainfrom
garrytan/santiago-v1
May 12, 2026
Merged

v1.33.1.0 fix(learnings): token-OR query + task-shaped retrieval in 3 long skills#1442
garrytan merged 5 commits into
mainfrom
garrytan/santiago-v1

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 12, 2026

Summary

For the last 30+ versions, every gstack skill loaded learnings the same way — gstack-learnings-search --limit 10 at the top of the skill, generic top-10, no query, no refresh. Short skills are fine; long skills (/investigate, /qa, /ship) drift away from whatever was loaded at minute zero. This PR fixes that for the three long skills, plus a root-cause fix to the binary's --query semantics that was silently broken.

Headline changes:

  • bin/gstack-learnings-search--query now uses token-OR matching. Multi-word queries split on whitespace; a learning matches if ANY token appears in ANY of key/insight/files. The old whole-string substring behavior was a silent footgun — --query "debug investigation" returned nothing on real-world stores.
  • {{LEARNINGS_SEARCH:query=KEYWORD}} macro — parameterized resolver with shell-injection guard. Query value must match ^[A-Za-z0-9 _-]+$ at gen-skill-docs time, so $(), backticks, semicolons, quotes throw a build error.
  • /investigate, /qa, /ship — top-of-skill pull keyed to their theme (e.g. ship → release ship version changelog merge pr). Each gains a mid-flow refresh at the phase boundary (/investigate Phase 1→2, /qa triage→fix-loop, /ship pre-VERSION) with a concrete keyword recipe + worked examples that enforce alphanumeric-only keywords to sidestep shell quoting.

Other 13 skills are byte-identical in their generated SKILL.md output — backwards-compat verified via diff.

Numbers (real measurements on this project's learnings.jsonl, 35 entries)

Query Before (substring) After (token-OR) Δ
"debug investigation root cause" 0 5 +5
"qa testing bug regression" 0 2 +2
"release ship version changelog" 0 8 +8
"skill resolver" 0 12 +12

Without the binary fix, the rest of this PR would have shipped silent — the bash would run, the binary would exit 0 with no output, and the skill would render the same empty section it always rendered.

Pre-Landing Review

Eng Review (PLAN) ran on 2026-05-11: CLEAR, 2 issues found (D1 superseded by D4 root-cause binary fix, D2 covered by 5 test assertions), 0 critical gaps, 0 unresolved.

Codex Outside Voice (plan-time) caught the load-bearing substring-vs-token-OR semantics flaw that invalidated the original D1 reasoning. Adopted as D4 (root-cause binary fix).

Codex Adversarial (diff-time, this PR): caught a shell-injection vector in the resolver — queryArg was being interpolated into bash without escaping. Fixed by adding QUERY_SAFE_RE = /^[A-Za-z0-9 _-]+$/ validation at gen-skill-docs time. New test assertion covers $(whoami), backticks, ;, &, ", \, $x payloads. Recommendation upgraded from "block" to "ship".

Test Coverage

  • test/gstack-learnings-search.test.ts (NEW): 3 assertions on token-OR semantics — multi-token, single-token, no-query backwards compat.
  • test/gen-skill-docs.test.ts: 5 NEW assertions on the resolver — no-args, claude+query=foo bar (both bash branches), codex host variant, empty value falls through, shell-injection payloads throw.
  • Golden ship snapshot tests pass (claude + codex + factory hosts).
  • Full bun test suite passes.

Plan Completion

All 8 plan items DONE (verified against /Users/garrytan/.claude/plans/system-instruction-you-are-working-immutable-sunrise.md):

  1. ✅ Binary token-OR (Step 1) — commit a0539b5
  2. ✅ Resolver args parameter (Step 2) — commit 7a4f714
  3. ✅ 3 long skill queries (Step 3) — commit 513c966
  4. ✅ Mid-flow refresh checkpoints with concrete recipe (Step 4) — commit 513c966
  5. ✅ 5 resolver assertions (Step 5) — commit 7a4f714 (4 originally planned + 1 injection guard added at ship time)
  6. ✅ 3 binary assertions (Step 6) — commit a0539b5
  7. ✅ Regenerate SKILL.md (Step 7) — commit 513c966 + 7968290 (golden refresh)
  8. ✅ VERSION + CHANGELOG bump (Step 8) — commit accc864

Scope Drift

Clean. Plan intent: token-shaped learnings retrieval for long skills + binary fix. Delivered: exactly that, plus one defense-in-depth shell-injection guard added during ship-time Codex review.

For contributors

Contributed by @Fergtic (chronicle-write-up) who flagged the load-once + no-refresh pattern and the spend-per-success data point that motivated this work.

Test plan

  • bun test test/gstack-learnings-search.test.ts — 3 pass
  • bun test test/gen-skill-docs.test.ts -t "LEARNINGS_SEARCH" — 14 pass (5 LEARNINGS_SEARCH + 9 related)
  • bun test test/host-config.test.ts -t "golden" — 3 pass
  • Manual smoke: bin/gstack-learnings-search --query "skill resolver" --limit 5 returns 12 token-OR matches
  • Backwards-compat: 13 short skills' generated SKILL.md unchanged (verified via git diff after bun run gen:skill-docs --host all)

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 5 commits May 11, 2026 18:58
Split the query on whitespace into tokens; a learning matches if ANY
token appears as a substring in ANY of key/insight/files. Previously
the whole query was a single substring, so multi-word queries like
"debug investigation" only matched learnings whose insight contained
that exact contiguous phrase, which is usually nothing.

Whitespace-only query falls through to no-query (matches today's no-flag
behavior). Single-word queries behave exactly as before.

Adds test/gstack-learnings-search.test.ts: 3 assertions covering
multi-token, single-token, and no-query backwards compat.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uard

The {{LEARNINGS_SEARCH}} macro now accepts a query=KEYWORD argument that
gets interpolated as --query "<keyword>" into the generated bash. Empty
value falls through to no-query (principle of least surprise: a stray
{{LEARNINGS_SEARCH:query=}} placeholder gets today's behavior, not a
build failure). Pattern reuses the parameterized-macro parsing from
composition.ts. The 13 templates that don't pass a query stay
byte-identical in their generated SKILL.md output.

Shell-injection guard: the query value is whitelisted to
^[A-Za-z0-9 _-]+$ at gen-skill-docs time. Any \$(), backticks,
semicolons, or quotes throw a loud build error instead of emitting
executable bash. Static template queries are safe by inspection;
this defends against future contributors writing dangerous values.

Adds 5 assertions to test/gen-skill-docs.test.ts covering no-args,
claude+query=foo bar on both cross-project and project-scoped branches,
codex host variant, empty value semantics, and shell-injection payloads
(\$(whoami), backticks, ;, &, ", \\, \$x) throwing build errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…/qa /ship

The three long skills now pull learnings keyed to their theme at the
top, then re-pull at phase boundaries as work shifts to new sub-tasks.

Top-of-skill queries (5-6 token unions, token-OR matched):
- investigate: "debug investigation root cause hypothesis bug fix"
- qa: "qa testing bug regression flake fixture"
- ship: "release ship version changelog merge pr"

Mid-flow refresh blocks (concrete keyword recipe + worked examples):
- investigate: between Phase 1 (hypothesis) and Phase 2 (analysis),
  keyed to the hypothesis noun. Examples: auth-cookie, session-expiry.
- qa: between Phase 7 (triage) and Phase 8 (fix loop), keyed to the
  buggy component name. Examples: checkout-button, signup-form.
- ship: just before Step 12 (VERSION bump), keyed to the headline
  feature. Examples: learnings-search, pacing, worktree-ship.

Keyword recipe enforces alphanumeric+hyphen only (no quotes, slashes,
dots, colons) so dynamic queries cannot inject shell metacharacters.

The other 13 short-lived skills keep the bare {{LEARNINGS_SEARCH}} form.
Backwards-compat verified via diff: their generated SKILL.md output is
byte-identical to before this change.

Golden ship fixtures regenerated to match the new ship/SKILL.md output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to 513c966 — the codex and factory host outputs needed
regeneration too, missed in the initial commit because gen:skill-docs
was only run for the claude host. Now matches gen:skill-docs --host all.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

E2E Evals: ✅ PASS

13/13 tests passed | $2.13 total cost | 12 parallel runners

Suite Result Status Cost
e2e-qa-workflow 2/2 $0.7
e2e-review 2/2 $0.46
e2e-workflow 2/2 $0.17
llm-judge 5/5 $0.1
e2e-qa-workflow 2/2 $0.7

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

@garrytan garrytan merged commit 1a4f0c9 into main May 12, 2026
23 of 24 checks passed
Willardgmoore added a commit to Willardgmoore/gstack that referenced this pull request May 12, 2026
Brings in:
- v1.33.2.0: setup guard against Conductor worktree pollution (garrytan#1446)
- v1.33.1.0: learnings token-OR query + task-shaped retrieval (garrytan#1442)
- v1.33.0.0: /sync-gbrain memory stage batch-import refactor (garrytan#1432)

VERSION stays at 1.34.0.0 (no collision — queue-aware check confirms).
CHANGELOG: our [1.34.0.0] entry at top, three new main entries below.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant