perf(agent): tighten commandments + framework prompts for scale and quality by kelsonpw · Pull Request #799 · amplitude/wizard

kelsonpw · 2026-05-15T22:33:29Z

Summary

Three load-bearing rules added to the wizard's always-on commandments, plus a contradiction fix in the supplement skill that was causing agents to attempt dashboard work the universal rule forbids.

Changes shipped (ranked by ROI)

1. Strategy retry cap (3-approach ceiling)

Motivating signal: ~5pt completion→activation drop. Agents finish nominally but have looped on the same broken approach. The existing "retry budget 1, after 2 failures STOP" rule covers tactical retries; this caps strategic retry at 3 distinct approaches per goal and forces escalation into the setup report's "Known limitations" section.

2. Destructive bash pre-emption

Motivating signal: Bash Policy denies at ~80/day peak. src/lib/safety-scanner.ts already blocks rm -rf, git reset --hard, git push --force, curl … | sh, install -g, publish, and sudo — but the prompt never said so. Agents discovered the deny rule by hitting it, paying a full retry cycle each time. Pre-emption names every blocked shape upfront.

3. Monorepo scope clamp

Motivating signal: Large-monorepo failure mode where 1-event runs turned into 30-file blast-radius PRs. The rule restricts default work to the install-dir subtree and forces a wizard_feedback confirmation hop when the install dir is itself a workspace root with ambiguous intent (single package vs whole workspace).

4. Supplement skill contradiction fix (deduplication)

post-instrumentation-events-and-dashboard.md still said "create 4–6 charts and a dashboard via the Amplitude MCP, then call record_dashboard" — directly contradicting the universal commandment forbidding chart/dashboard tools in this run (DEFER_DASHBOARD_PLAN PR 4). Reconciled.

5. Setup Report receipt quality

setup-report-requirements.md now demands a structured Files Changed table with +/- line counts and an Events table with file + line metadata, lifting the receipts ledger above narrative prose.

Token delta

Assembled commandments (per-turn, prompt-cached):

Mode	Before	After	Δ
Universal (mobile/server/generic)	~4972 tok	~5405 tok	+433 (+8.7%)
Browser (Next.js, Vue, React Router, …)	~6160 tok	~6593 tok	+433 (+7.0%)

Each added rule trades ~150 cached tokens for one avoided retry/loop cycle (1500–5000 uncached tokens per loop). Net positive at expected occurrence rates — the Bash Policy denies alone (~80/day × ~3000 tokens/avoidance) dwarf the per-turn overhead.

Regression test

scale + safety guardrails commandments in src/lib/__tests__/commandments.test.ts pins:

Strategy cap text + Known limitations escalation sentinel
Every destructive-bash shape by exact substring (matches every rule in safety-scanner.ts)
Monorepo scope language + the wizard_feedback escalation hop

3 new tests, 38 total in the file (was 35). All 4294 tests pass on the branch.

Top 3 deferred improvements

Consolidate src/frameworks/generic/generic-wizard-agent.ts (11KB) inline prompt. Heavy overlap with both commandments and the browser-sdk-init-defaults supplement reference. Would save ~1KB on the generic-fallback path but needs a careful pass to preserve the CSP / Netlify-redirect guidance specific to static sites.
Move the browser SDK init-defaults supplement (4156 bytes) to JIT-only loading. Currently pre-staged on every browser run; only the init phase actually reads it. Would save ~1KB on the pre-staged menu but requires changes to the staging pipeline.
Central glossary for wizard-tools.ts tool descriptions. Several descriptions duplicate phrasing the commandments already established (e.g. check_env_keys vs the universal "never use Bash to verify env vars" rule). Light gain (~100 tokens), high test coverage churn.

Test plan

pnpm tsc --noEmit clean
pnpm lint clean
pnpm test (4294 tests, all passing)
src/utils/wizard-abort.ts untouched
src/ui/tui/ untouched
Worktree at /tmp/prompt-audit will be removed after PR open
Confirm token-delta math against a live --agent mode dry-run on a sample Next.js repo
Confirm Bash Policy denies rate drops in production telemetry over the week after merge

🤖 Generated with Claude Code

Note

Medium Risk
Medium risk because it changes the wizard’s core prompt/guardrails and adds strict test sentinels, which can alter agent behavior across all runs and cause brittle test failures on copy edits.

Overview
Tightens the wizard’s universal commandment prompt with three new guardrails: a 3-approach strategy retry cap that forces escalation to a setup report Known limitations section, explicit pre-emption of destructive bash command shapes, and a monorepo scope clamp that defaults work to the install-directory subtree and requires wizard_feedback before cross-workspace edits.

Updates the prompt supplement to clarify .amplitude/events.json ownership/shape (optionally adding a per-event file pointer) and to defer all chart/dashboard work to the separate amplitude-wizard dashboard flow, including removing dashboard link expectations from the setup report.

Strengthens setup report requirements by mandating receipts-style tables (files changed with +/- line counts; per-track() call event table with file+line+properties), and adds regression tests that pin the new guardrail language to prevent drift.

^{Reviewed by Cursor Bugbot for commit c4f6e6f. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Commandment falsely claims --force-with-lease is scanner-blocked
- Removed / --force-with-lease`` from the commandment's blocked list since the safety scanner intentionally allows it via a negative lookahead.
✅ Fixed: Test misses three safety-scanner rules it claims to cover
- Added the three missing scanner shapes (git checkout ., git restore, git clean -f) and split the test arrays into scannerShapes and allowlistShapes with an accurate comment.

Or push these changes by commenting:

@cursor push 7da65a036c

Preview (7da65a036c)

diff --git a/src/lib/__tests__/commandments.test.ts b/src/lib/__tests__/commandments.test.ts
--- a/src/lib/__tests__/commandments.test.ts
+++ b/src/lib/__tests__/commandments.test.ts
@@ -615,18 +615,20 @@
 
   it('pre-empts destructive bash commands by name', () => {
     // The agent should learn about these from the prompt, not by tripping
-    // safety-scanner.ts and burning a retry cycle. Each pattern matches a
-    // rule in `src/lib/safety-scanner.ts`.
-    const blockedShapes = [
+    // safety-scanner.ts and burning a retry cycle. Each pattern in
+    // `scannerShapes` matches a rule in `src/lib/safety-scanner.ts`;
+    // `allowlistShapes` are blocked by the bash allowlist, not the scanner.
+    const scannerShapes = [
       'rm -rf',
       'git reset --hard',
       'git push --force',
+      'git checkout .',
+      'git restore',
+      'git clean -f',
       'curl ... | sh',
-      'install -g',
-      'publish',
-      'sudo',
     ];
-    for (const shape of blockedShapes) {
+    const allowlistShapes = ['install -g', 'publish', 'sudo'];
+    for (const shape of [...scannerShapes, ...allowlistShapes]) {
       expect(
         text,
         `commandments should pre-empt "${shape}" so the agent never burns a retry discovering it's blocked.`,

diff --git a/src/lib/commandments.ts b/src/lib/commandments.ts
--- a/src/lib/commandments.ts
+++ b/src/lib/commandments.ts
@@ -71,7 +71,7 @@
 
   // Motivated by Bash Policy denies at ~80/day peak. These commands are
   // hard-blocked by safety-scanner.ts; pre-empting saves a retry cycle.
-  'NEVER attempt these destructive bash commands — they are pre-blocked by the safety scanner and no rephrasing changes the outcome: `rm -rf` (any form), `git reset --hard`, `git push --force` / `--force-with-lease`, `git checkout .` / broad `git restore`, `git clean -f`, `curl ... | sh` / `wget ... | bash` (any pipe-to-shell), `npm install -g` / `pnpm add -g` / `yarn global add`, `npm publish` / `pnpm publish` / `yarn publish`, `sudo` (anything). The wizard never needs these. If a workflow seems to require one, the workflow itself is wrong — note it in the setup report and proceed without.',
+  'NEVER attempt these destructive bash commands — they are pre-blocked by the safety scanner and no rephrasing changes the outcome: `rm -rf` (any form), `git reset --hard`, `git push --force`, `git checkout .` / broad `git restore`, `git clean -f`, `curl ... | sh` / `wget ... | bash` (any pipe-to-shell), `npm install -g` / `pnpm add -g` / `yarn global add`, `npm publish` / `pnpm publish` / `yarn publish`, `sudo` (anything). The wizard never needs these. If a workflow seems to require one, the workflow itself is wrong — note it in the setup report and proceed without.',
 
   'When a wizard tool returns a structured error payload (`{"success": false, "error": ..., "guidance": ..., "suggestedTool": ..., "suggestedArgs": ..., "context": ...}`), READ the `guidance` field and follow it. If `suggestedTool` / `suggestedArgs` are present, call THAT tool with THOSE args next — do NOT retry the failing tool with the same args. The same shape comes back for PreToolUse denials (Bash policy, denied paths, denied event-plan / dashboard writes). Treating structured errors as recovery instructions is the difference between a 1-turn fix and a 5-turn loop that trips the consecutive-deny circuit breaker.',

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 957305b. Configure here.}

cursor · 2026-05-15T22:39:34Z

+
+  // Motivated by Bash Policy denies at ~80/day peak. These commands are
+  // hard-blocked by safety-scanner.ts; pre-empting saves a retry cycle.
+  'NEVER attempt these destructive bash commands — they are pre-blocked by the safety scanner and no rephrasing changes the outcome: `rm -rf` (any form), `git reset --hard`, `git push --force` / `--force-with-lease`, `git checkout .` / broad `git restore`, `git clean -f`, `curl ... | sh` / `wget ... | bash` (any pipe-to-shell), `npm install -g` / `pnpm add -g` / `yarn global add`, `npm publish` / `pnpm publish` / `yarn publish`, `sudo` (anything). The wizard never needs these. If a workflow seems to require one, the workflow itself is wrong — note it in the setup report and proceed without.',


Commandment falsely claims --force-with-lease is scanner-blocked

Medium Severity

The destructive-bash commandment lists `git push --force` / `--force-with-lease` as "pre-blocked by the safety scanner and no rephrasing changes the outcome." However, safety-scanner.ts explicitly allows --force-with-lease via a (?!-with-lease) negative lookahead, and the scanner's own deny message suggests it as a safer alternative. This creates a bidirectional contradiction: the prompt claims something is hard-blocked that is in fact intentionally permitted, and if the agent ever encounters the scanner's suggestion to use --force-with-lease, the prompt has already told it that's also impossible. The / separator in the listing unambiguously reads as "both variants are blocked," matching the pattern used elsewhere in the same sentence.

^{Reviewed by Cursor Bugbot for commit 957305b. Configure here.}

cursor · 2026-05-15T22:39:34Z

+        text,
+        `commandments should pre-empt "${shape}" so the agent never burns a retry discovering it's blocked.`,
+      ).toContain(shape);
+    }


Test misses three safety-scanner rules it claims to cover

Low Severity

The test comment says "Each pattern matches a rule in src/lib/safety-scanner.ts" but the blockedShapes array omits three actual safety-scanner rules — git checkout (broad), git restore (broad), and git clean -f — while including three shapes (install -g, publish, sudo) that have no corresponding safety-scanner rule (they're blocked by a separate bash allowlist). Future edits removing those scanner-matched shapes from the commandments won't be caught by this test.

^{Reviewed by Cursor Bugbot for commit 957305b. Configure here.}

…uality Three load-bearing rules added to the wizard's always-on commandments, plus a contradiction fix in the supplement skill that was causing agents to attempt dashboard work that the universal rule forbids. Motivating dashboard signals: 1. Strategy retry cap (3-approach ceiling) Closes the ~5pt completion→activation gap. Agents finish nominally but have looped on the same broken approach. Caps strategic retry at three attempts per goal and forces escalation into the setup report's "Known limitations" section instead of silent looping. 2. Destructive bash pre-emption Bash Policy denies peaked ~80/day. The safety scanner already blocks `rm -rf`, `git reset --hard`, `git push --force`, `curl … | sh`, `install -g`, `publish`, and `sudo` — but the prompt never said so, so agents discovered the deny rule by hitting it. Pre-emption saves ~one full retry cycle per occurrence. 3. Monorepo scope clamp Large monorepos saw cross-package edits the user never asked for. The rule restricts default work to the install-dir subtree, and forces a `wizard_feedback` confirmation when the install dir is itself a workspace root with ambiguous intent. Also (deduplication / contradiction fix): 4. Supplement skill no longer instructs agents to create dashboards The `post-instrumentation-events-and-dashboard.md` reference still said "create 4–6 charts and a dashboard via the Amplitude MCP, then call `record_dashboard`" — directly contradicting the universal commandment forbidding chart/dashboard tools in this run (DEFER _DASHBOARD_PLAN PR 4). Reconciled to match. 5. Setup Report requirements now demand a structured Files Changed table with +/- line counts and an Events table with file + line metadata, raising receipt quality. Token delta (assembled commandments, per-turn cached): Universal: 4972 → 5405 tokens (+433, +8.7%) Browser: 6160 → 6593 tokens (+433, +7.0%) Each added rule trades ~150 cached tokens for one avoided retry/loop cycle (1500–5000 tokens). Net positive at expected occurrence rates. Regression test: `scale + safety guardrails commandments` in `src/lib/__tests__/commandments.test.ts` pins all three new sentinels (strategy cap, destructive-bash list, monorepo scope language). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kelsonpw · 2026-05-16T00:05:46Z

@cursor push 7da65a0

cursor · 2026-05-16T00:05:51Z

Could not push Autofix changes. The PR branch may have changed since the Autofix ran, or the Autofix commit may no longer exist.

kelsonpw requested a review from a team as a code owner May 15, 2026 22:33

cursor Bot reviewed May 15, 2026

View reviewed changes

kelsonpw force-pushed the perf/agent-prompts branch from 957305b to c4f6e6f Compare May 15, 2026 23:54

kelsonpw mentioned this pull request May 17, 2026

refactor: dedupe framework configs via framework-shared #810

Merged

5 tasks

kelsonpw removed the request for review from a team May 18, 2026 18:04

kelsonpw merged commit 5dcc524 into main May 22, 2026
14 checks passed

amplitude-release-bot Bot mentioned this pull request May 22, 2026

chore(main): release wizard 1.18.0 #694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(agent): tighten commandments + framework prompts for scale and quality#799

perf(agent): tighten commandments + framework prompts for scale and quality#799
kelsonpw merged 1 commit into
mainfrom
perf/agent-prompts

kelsonpw commented May 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 15, 2026

Uh oh!

cursor Bot May 15, 2026

Uh oh!

kelsonpw commented May 16, 2026

Uh oh!

cursor Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kelsonpw commented May 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes shipped (ranked by ROI)

1. Strategy retry cap (3-approach ceiling)

2. Destructive bash pre-emption

3. Monorepo scope clamp

4. Supplement skill contradiction fix (deduplication)

5. Setup Report receipt quality

Token delta

Regression test

Top 3 deferred improvements

Test plan

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Commandment falsely claims --force-with-lease is scanner-blocked

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Test misses three safety-scanner rules it claims to cover

Uh oh!

kelsonpw commented May 16, 2026

Uh oh!

cursor Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kelsonpw commented May 15, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

Commandment falsely claims `--force-with-lease` is scanner-blocked