Skip to content

Session permanently wedged by "Invalid \signature\ in \thinking\ block" CAPIError after background sub-agent completes — no auto-recovery, no rewind affordance #3407

@jay-tau

Description

@jay-tau

Describe the bug

A long-running session became permanently wedged with three identical CAPIError: 400 — "messages.5.content.2: Invalid \signature` in `thinking` block"failures, each one firing within ~1 second of a background research sub-agent completing. After the first failure, every subsequent attempt tocontinueproduced the same error against the samemessages.5.content.2` slot, because the bad thinking-block sits permanently in the conversation history and is re-sent on every retry. The session had to be abandoned — there is no in-product way to repair, prune, or rewind past the corrupted thinking block, and no way for the user to see which message is the corrupted one.

The trigger pattern is very reproducible from the events log: every error is preceded (within ≤5 s) by a subagent.completed for a research sub-agent and the corresponding system.notification that splices the completion into the parent loop. The parent then issues a brand-new assistant.turn_startassistant.turn_end (very fast, ~4–5 s, meaning the API call returned an error rather than streaming a real response) → session.error.

The parent agent was running claude-opus-4.7-1m-internal; the sub-agents were research agents (the most recent one ran on claude-opus-4.6-1m). One plausible root cause is that thinking-block signatures issued for one Anthropic model are being kept in the parent's history when a sub-agent on a different model is integrated — Anthropic's API ties thinking-block signatures to the specific (model, turn) that produced them, and rejects the request if they are presented out of context.

What makes this worse than a normal transient API error:

  • No automatic recovery. The CLI surfaces the error verbatim and stops the turn, but does not strip / regenerate / quarantine the offending thinking block, so the next user prompt re-sends the same poisoned history and gets the same 400.
  • No user-visible repair affordance. There is no /rewind to a known-good turn, no "drop the corrupted thinking block" option, no diagnostic telling the user "your conversation history has been wedged by a stale thinking-block signature; start a new session".
  • The natural fallback (closing and /resume-ing the session, or sending continue again) does not help as long as the bad block stays in the cached history.

Affected version

GitHub Copilot CLI 1.0.49

Stack trace references the same code path in 1.0.48 (app.js).

file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:1254:1046  t.fromAPIError
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:3439:15527 vmt.getCompletionWithTools
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:3472:2751  O3e.getCompletionWithTools
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4483:4797  t.runAgenticLoop
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4481:12744 t.processQueuedItems
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4481:3688  t.processQueue
file:///home/jaytau/.copilot/pkg/universal/1.0.48/app.js:4479:4392  t.send

(The session was started under 1.0.48 and the binary auto-updated to 1.0.49 between the failure and the time of this report; the failing code path is the same.)

Steps to reproduce the behavior

I can't synthetically reproduce this on demand yet, but the repro signature from the events log is:

  1. Open a long session on claude-opus-4.7-1m-internal.
  2. Launch one or more background research sub-agents that run on a different Claude model (in my case claude-opus-4.6-1m) — e.g. via the task tool with mode: "background" and agent_type: "research".
  3. Continue working in the parent session while the sub-agents are running (turns that produce thinking blocks).
  4. Let one or more of the sub-agents complete between parent turns, so the completion notification is delivered while the parent is preparing its next turn.
  5. The next parent turn issues an API request whose serialized history has a thinking block at some messages.N.content.M with a signature value the API no longer accepts → 400 invalid_request_error: Invalid signature in thinking block.
  6. Every subsequent continue (or any user prompt) reproduces the identical error against the same messages.N.content.M until you abandon the session.

In my case three different research sub-agents (tier1-r5-docs, tier1-r5-code, tier1-r5-tax-law) each triggered the same failure when they reported back into the parent loop:

2026-05-18T21:21:00.502Z  subagent.completed     tier1-r5-docs
2026-05-18T21:21:00.521Z  system.notification    Agent "tier1-r5-docs" (research) has completed successfully…
2026-05-18T21:21:00.859Z  assistant.turn_start
2026-05-18T21:21:05.107Z  assistant.turn_end
2026-05-18T21:21:05.163Z  session.error          CAPIError: 400 … messages.5.content.2: Invalid `signature` in `thinking` block  (request_id: req_011CbAjZdRJmTT961kRWg8Ws)

2026-05-18T21:21:24.222Z  subagent.completed     tier1-r5-code
2026-05-18T21:21:24.237Z  system.notification    Agent "tier1-r5-code" (research) has completed successfully…
2026-05-18T21:21:24.593Z  assistant.turn_start
2026-05-18T21:21:28.665Z  assistant.turn_end
2026-05-18T21:21:28.723Z  session.error          CAPIError: 400 … messages.5.content.2: Invalid `signature` in `thinking` block  (request_id: req_011CbAjbNMziS4DdrkguBbBG)

2026-05-18T22:10:03.836Z  subagent.completed     tier1-r5-tax-law
2026-05-18T22:10:03.856Z  system.notification    Agent "tier1-r5-tax-law" (research) has completed successfully…
2026-05-18T22:10:04.623Z  assistant.turn_start
2026-05-18T22:10:10.120Z  assistant.turn_end
2026-05-18T22:10:10.178Z  session.error          CAPIError: 400 … messages.5.content.2: Invalid `signature` in `thinking` block  (request_id: req_vrtx_011CbAoJhWv6GbNY7j8vdC93)

All three errors point at the same messages.5.content.2 slot — i.e. a single corrupted thinking block at a fixed position in the cached history is wedging every retry.

The error messages as they appeared in the TUI were exactly:

✗ Execution failed: CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error","message":"messages.5.content.2: Invalid `signature` in `thinking` block"},"request_id":"req_011CbAjZdRJmTT961kRWg8Ws"} (Request ID: 603E:D8F5A:C58A2B:D7672B:6A0B82BF)
● Background agent "Tier-1 round-5 code review" (research) completed
  └ You are a Tier-1 reviewer for PR #132 of github.com/jay-tau/ibkr-fa, a Python…
✗ Execution failed: CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error","message":"messages.5.content.2: Invalid `signature` in `thinking` block"},"request_id":"req_011CbAjbNMziS4DdrkguBbBG"} (Request ID: 603E:D8F5A:C605AF:D7EF6C:6A0B82D6)
● Background agent "Tier-1 round-5 tax-law review" (research) completed
  └ You are a Tier-1 reviewer for PR #132 of github.com/jay-tau/ibkr-fa, a Python…
✗ Execution failed: CAPIError: 400 {"type":"error","error":{"type":"invalid_request_error","message":"messages.5.content.2: Invalid `signature` in `thinking` block"},"request_id":"req_vrtx_011CbAoJhWv6GbNY7j8vdC93"} (Request ID: 08FF:29E602:F6CB52:10E625B:6A0B8E3F)

Expected behavior

The CLI should not let a single corrupted thinking-block signature permanently brick a session. At least one of, ideally several of:

  1. Detect and recover from invalid_request_error: Invalid signature in thinking block automatically by stripping/redacting the offending thinking block from the cached history and retrying (Anthropic's API explicitly allows you to omit thinking blocks when not using extended thinking on a follow-up turn, and to mark blocks as redacted_thinking when their signatures can't be regenerated).
  2. Guarantee thinking-block signatures stay paired with the model that produced them. When a sub-agent runs under a different Claude model than the parent (e.g. opus-4.6 sub-agent under an opus-4.7 parent), the integration step should not leave any of the sub-agent's signed thinking blocks in the parent's serialized history. (And vice-versa — the parent's signed thinking blocks must not leak into the sub-agent's request.)
  3. Surface a user-actionable error, not a verbatim CAPI dump. Something like: "This session's conversation history was rejected by the model (corrupted reasoning signature at message 5). I've removed the corrupted block; please try again." — or, if recovery isn't possible, "…please run /rewind to roll back to turn N, or /new to start fresh."
  4. Expose /rewind (or similar) as a recovery option in the error message itself, since today the only options the user can guess at — continue, closing the terminal and /resume-ing, sending the same prompt again — all fail identically because they all re-send the same poisoned history.
  5. Telemetry: a session-level counter / health indicator that flips when 400 invalid_request_error is encountered, so the TUI can render a "session corrupted — start fresh" affordance instead of looking indistinguishable from a working session.

Additional context

  • OS: Linux
  • Parent model: claude-opus-4.7-1m-internal
  • Sub-agent model: claude-opus-4.6-1m (running as a research sub-agent)
  • Workspace: /home/jaytau/temp/ibkr-fa (commit 8923d5fe)
  • Session ID: 96bf5b50-79d8-4de9-b01a-8b57e95b3eaf
  • Provider request IDs (one per failure, in order):
    • req_011CbAjZdRJmTT961kRWg8Ws → CAPI Request ID 603E:D8F5A:C58A2B:D7672B:6A0B82BF
    • req_011CbAjbNMziS4DdrkguBbBG → CAPI Request ID 603E:D8F5A:C605AF:D7EF6C:6A0B82D6
    • req_vrtx_011CbAoJhWv6GbNY7j8vdC93 → CAPI Request ID 08FF:29E602:F6CB52:10E625B:6A0B8E3F
  • The session was using background research sub-agents extensively (a 5-round multi-tier code/docs/tax-law review workflow) — so triggering this required nothing exotic, just sustained parallel task(mode: "background", agent_type: "research") usage on a tab- or branch-of-thought- heavy parent agent.

Workaround

The only workaround I found was to:

  1. End the wedged copilot process.
  2. /resume the session (and accept that the corrupted thinking block from the assistant's last turn would still be in the history, but the very next user continue after resume happened to succeed in my case — possibly because the resume path re-serializes the history slightly differently, possibly because the sub-agent results that were causing the conflict had now been fully consumed and dropped from the live cache).

No in-CLI recovery (continue, retrying the same prompt, sending a new prompt, switching models with /model) helped before the resume.

Related (not duplicates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:agentsSub-agents, fleet, autopilot, plan mode, background agents, and custom agentsarea:modelsModel selection, availability, switching, rate limits, and model-specific behaviorarea:sessionsSession management, resume, history, session picker, and session state

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions