Session permanently wedged by "Invalid \`signature\` in \`thinking\` block" CAPIError after background sub-agent completes — no auto-recovery, no rewind affordance

## Describe the bug

A long-running session became **permanently wedged** with three identical `CAPIError: 400 — "messages.5.content.2: Invalid \`signature\` in \`thinking\` block"` failures, each one firing within ~1 second of a background research sub-agent completing. After the first failure, every subsequent attempt to `continue` produced the same error against the same `messages.5.content.2` slot, because the bad thinking-block sits permanently in the conversation history and is re-sent on every retry. The session had to be abandoned — there is no in-product way to repair, prune, or rewind past the corrupted thinking block, and no way for the user to see *which* message is the corrupted one.

The trigger pattern is very reproducible from the events log: every error is preceded (within ≤5 s) by a `subagent.completed` for a `research` sub-agent and the corresponding `system.notification` that splices the completion into the parent loop. The parent then issues a brand-new `assistant.turn_start` → `assistant.turn_end` (very fast, ~4–5 s, meaning the API call returned an error rather than streaming a real response) → `session.error`.

The parent agent was running `claude-opus-4.7-1m-internal`; the sub-agents were `research` agents (the most recent one ran on `claude-opus-4.6-1m`). One plausible root cause is that thinking-block signatures issued for one Anthropic model are being kept in the parent's history when a sub-agent on a *different* model is integrated — Anthropic's API ties thinking-block signatures to the specific (model, turn) that produced them, and rejects the request if they are presented out of context.

What makes this worse than a normal transient API error:

- **No automatic recovery.** The CLI surfaces the error verbatim and stops the turn, but does not strip / regenerate / quarantine the offending thinking block, so the *next* user prompt re-sends the same poisoned history and gets the same 400.
- **No user-visible repair affordance.** There is no `/rewind` to a known-good turn, no "drop the corrupted thinking block" option, no diagnostic telling the user "your conversation history has been wedged by a stale thinking-block signature; start a new session".
- **The natural fallback (closing and `/resume`-ing the session, or sending `continue` again) does not help** as long as the bad block stays in the cached history.

## Affected version

```
GitHub Copilot CLI 1.0.49
```

Stack trace references the same code path in `1.0.48` (`app.js`).

```
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:1254:1046  t.fromAPIError
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:3439:15527 vmt.getCompletionWithTools
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:3472:2751  O3e.getCompletionWithTools
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4483:4797  t.runAgenticLoop
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4481:12744 t.processQueuedItems
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4481:3688  t.processQueue
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4479:4392  t.send
```

(The session was started under 1.0.48 and the binary auto-updated to 1.0.49 between the failure and the time of this report; the failing code path is the same.)

## Steps to reproduce the behavior

I can't synthetically reproduce this on demand yet, but the repro signature from the events log is:

1. Open a long session on `claude-opus-4.7-1m-internal`.
2. Launch one or more background `research` sub-agents that run on a *different* Claude model (in my case `claude-opus-4.6-1m`) — e.g. via the `task` tool with `mode: "background"` and `agent_type: "research"`.
3. Continue working in the parent session while the sub-agents are running (turns that produce thinking blocks).
4. Let one or more of the sub-agents complete *between* parent turns, so the completion notification is delivered while the parent is preparing its next turn.
5. The next parent turn issues an API request whose serialized history has a thinking block at some `messages.N.content.M` with a `signature` value the API no longer accepts → `400 invalid_request_error: Invalid signature in thinking block`.
6. Every subsequent `continue` (or any user prompt) reproduces the *identical* error against the same `messages.N.content.M` until you abandon the session.

In my case three different `research` sub-agents (`tier1-r5-docs`, `tier1-r5-code`, `tier1-r5-tax-law`) each triggered the same failure when they reported back into the parent loop:

```
2026-05-18T21:21:00.502Z  subagent.completed     tier1-r5-docs
2026-05-18T21:21:00.521Z  system.notification    Agent "tier1-r5-docs" (research) has completed successfully…
2026-05-18T21:21:00.859Z  assistant.turn_start
2026-05-18T21:21:05.107Z  assistant.turn_end
2026-05-18T21:21:05.163Z  session.error          CAPIError: 400 … messages.5.content.2: Invalid `signature` in `thinking` block  (request_id: req_011CbAjZdRJmTT961kRWg8Ws)

2026-05-18T21:21:24.222Z  subagent.completed     tier1-r5-code
2026-05-18T21:21:24.237Z  system.notification    Agent "tier1-r5-code" (research) has completed successfully…
2026-05-18T21:21:24.593Z  assistant.turn_start
2026-05-18T21:21:28.665Z  assistant.turn_end
2026-05-18T21:21:28.723Z  session.error          CAPIError: 400 … messages.5.content.2: Invalid `signature` in `thinking` block  (request_id: req_011CbAjbNMziS4DdrkguBbBG)

2026-05-18T22:10:03.836Z  subagent.completed     tier1-r5-tax-law
2026-05-18T22:10:03.856Z  system.notification    Agent "tier1-r5-tax-law" (research) has completed successfully…
2026-05-18T22:10:04.623Z  assistant.turn_start
2026-05-18T22:10:10.120Z  assistant.turn_end
2026-05-18T22:10:10.178Z  session.error          CAPIError: 400 … messages.5.content.2: Invalid `signature` in `thinking` block  (request_id: req_vrtx_011CbAoJhWv6GbNY7j8vdC93)
```

All three errors point at the **same** `messages.5.content.2` slot — i.e. a single corrupted thinking block at a fixed position in the cached history is wedging every retry.

The error messages as they appeared in the TUI were exactly:

```
✗ Execution failed: CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error","message":"messages.5.content.2: Invalid `signature` in `thinking` block"},"request_id":"req_011CbAjZdRJmTT961kRWg8Ws"} (Request ID: 603E:D8F5A:C58A2B:D7672B:6A0B82BF)
● Background agent "Tier-1 round-5 code review" (research) completed
  └ You are a Tier-1 reviewer for PR #132 of github.com/jay-tau/ibkr-fa, a Python…
✗ Execution failed: CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error","message":"messages.5.content.2: Invalid `signature` in `thinking` block"},"request_id":"req_011CbAjbNMziS4DdrkguBbBG"} (Request ID: 603E:D8F5A:C605AF:D7EF6C:6A0B82D6)
● Background agent "Tier-1 round-5 tax-law review" (research) completed
  └ You are a Tier-1 reviewer for PR #132 of github.com/jay-tau/ibkr-fa, a Python…
✗ Execution failed: CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error","message":"messages.5.content.2: Invalid `signature` in `thinking` block"},"request_id":"req_vrtx_011CbAoJhWv6GbNY7j8vdC93"} (Request ID: 08FF:29E602:F6CB52:10E625B:6A0B8E3F)
```

## Expected behavior

The CLI should **not let a single corrupted thinking-block signature permanently brick a session.** At least one of, ideally several of:

1. **Detect and recover from `invalid_request_error: Invalid signature in thinking block` automatically** by stripping/redacting the offending thinking block from the cached history and retrying (Anthropic's API explicitly allows you to omit thinking blocks when not using extended thinking on a follow-up turn, and to mark blocks as `redacted_thinking` when their signatures can't be regenerated).
2. **Guarantee thinking-block signatures stay paired with the model that produced them.** When a sub-agent runs under a different Claude model than the parent (e.g. `opus-4.6` sub-agent under an `opus-4.7` parent), the integration step should *not* leave any of the sub-agent's signed thinking blocks in the parent's serialized history. (And vice-versa — the parent's signed thinking blocks must not leak into the sub-agent's request.)
3. **Surface a user-actionable error**, not a verbatim CAPI dump. Something like: *"This session's conversation history was rejected by the model (corrupted reasoning signature at message 5). I've removed the corrupted block; please try again."* — or, if recovery isn't possible, *"…please run `/rewind` to roll back to turn N, or `/new` to start fresh."*
4. **Expose `/rewind` (or similar) as a recovery option in the error message itself**, since today the only options the user can guess at — `continue`, closing the terminal and `/resume`-ing, sending the same prompt again — all fail identically because they all re-send the same poisoned history.
5. **Telemetry**: a session-level counter / health indicator that flips when `400 invalid_request_error` is encountered, so the TUI can render a "session corrupted — start fresh" affordance instead of looking indistinguishable from a working session.

## Additional context

- **OS**: Linux
- **Parent model**: `claude-opus-4.7-1m-internal`
- **Sub-agent model**: `claude-opus-4.6-1m` (running as a `research` sub-agent)
- **Workspace**: `/home/jaytau/temp/ibkr-fa` (commit `8923d5fe`)
- **Session ID**: `96bf5b50-79d8-4de9-b01a-8b57e95b3eaf`
- **Provider request IDs** (one per failure, in order):
  - `req_011CbAjZdRJmTT961kRWg8Ws` → CAPI Request ID `603E:D8F5A:C58A2B:D7672B:6A0B82BF`
  - `req_011CbAjbNMziS4DdrkguBbBG` → CAPI Request ID `603E:D8F5A:C605AF:D7EF6C:6A0B82D6`
  - `req_vrtx_011CbAoJhWv6GbNY7j8vdC93` → CAPI Request ID `08FF:29E602:F6CB52:10E625B:6A0B8E3F`
- The session was using background `research` sub-agents extensively (a 5-round multi-tier code/docs/tax-law review workflow) — so triggering this required nothing exotic, just sustained parallel `task(mode: "background", agent_type: "research")` usage on a tab- or branch-of-thought- heavy parent agent.

### Workaround

The only workaround I found was to:

1. End the wedged `copilot` process.
2. `/resume` the session (and accept that the corrupted thinking block from the assistant's last turn would still be in the history, but the very next user `continue` after resume *happened to* succeed in my case — possibly because the resume path re-serializes the history slightly differently, possibly because the sub-agent results that were causing the conflict had now been fully consumed and dropped from the live cache).

No in-CLI recovery (`continue`, retrying the same prompt, sending a new prompt, switching models with `/model`) helped before the resume.

### Related (not duplicates)

- #1274 — "CLI constantly getting 400 errors for invalid request body" (also a 400, but for a GPT-5 reasoning shape on a code-review prompt, not Anthropic thinking-block signatures from sub-agent integration).
- #3371 — "CLI silently hangs on stalled HTTPS sockets…" (different class — network-layer hang with no error; this one *does* surface an error but offers no recovery).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session permanently wedged by "Invalid \`signature\` in \`thinking\` block" CAPIError after background sub-agent completes — no auto-recovery, no rewind affordance #3407

Describe the bug

Affected version

Steps to reproduce the behavior

Expected behavior

Additional context

Workaround

Related (not duplicates)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Session permanently wedged by "Invalid \signature\ in \thinking\ block" CAPIError after background sub-agent completes — no auto-recovery, no rewind affordance #3407

Description

Describe the bug

Affected version

Steps to reproduce the behavior

Expected behavior

Additional context

Workaround

Related (not duplicates)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Session permanently wedged by "Invalid \`signature\` in \`thinking\` block" CAPIError after background sub-agent completes — no auto-recovery, no rewind affordance #3407