Skip to content

SDK: orphan tool_use left mid-conversation after hard kill + resume causes persistent messages.N: tool_use ids were found without tool_result blocks (400) #3183

@ulugbekna

Description

@ulugbekna

Affected package: @github/copilot (./sdk/index.js)
Version observed: 1.0.39

Edit: This issue was originally written assuming the orphan was caused by subagent messages interleaving with main-agent messages. After a closer look at the persisted state, that's not the case — these are concurrent main-agent interactions (different interactionIds on top-level events with no agentId/parentToolCallId). I've rewritten the body below with the corrected analysis. The user-visible bug and the proposed fix to the orphan repair are unchanged.

Symptom

Sending a message to a resumed session produces a 400 from the upstream API:

CAPIError: 400 messages.465: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_0158b9C6JwY4ZzCHZdgnibXw.
Each `tool_use` block must have a corresponding `tool_result` block in the next message.

The SDK's own log line Completing 1 orphaned tool calls. runs at session resume, but the failure persists across retries — meaning the in-memory chat history still contains an unrepaired orphan deeper in the conversation. Once a session enters this state, every subsequent send fails with the same 400 and the session is effectively bricked.

(Surface: reproducible from VS Code Insiders' Agents window with a Copilot CLI session, but the bug is in the SDK chat-history reconstruction / persistence, not in the embedder.)

What's actually in the persisted state

Looking at the affected session's events.jsonl, the relevant region (timestamps + interactionIds) shows two concurrent top-level interactions writing into the same event log after a hard kill + session.resume:

Line Timestamp Event interactionId Notes
2107 2026-05-06T15:01:29.857Z assistant.message e103a9a8 tool_use toolu_014VYJyECct5io issued
2108 2026-05-06T15:01:29.858Z tool.execution_start toolu_014… starts
2109 2026-05-06T15:01:29.858Z hook.start / hook.end last write before kill
2111 2026-05-06T15:22:12.945Z session.resume 21-min gap; no preceding session.shutdown (IDE killed mid-tool)
2116 2026-05-06T15:22:17.811Z user.message "Continue" d75f8db4 NEW interaction
2119 2026-05-06T15:22:18.781Z assistant.turn_start d75f8db4 new agentic loop starts
2120 2026-05-06T15:22:57.574Z assistant.message d75f8db4 new tool_use
2131 2026-05-06T15:23:39.383Z tool.execution_complete e103a9a8 OLD interaction's tool completes after resume
2133 2026-05-06T15:23:39.385Z assistant.turn_start e103a9a8 turn=39 OLD interaction continues with fresh CAPI requestId
2143 2026-05-06T15:23:57.716Z assistant.message e103a9a8 new model call (requestId=A0F1:1CDDAB:30A4DC:33FAC3:69FB5CFC)
events from e103a9a8 and d75f8db4 continue interleaving
2161 2026-05-06T15:24:15.621Z assistant.message e103a9a8 tool_use toolu_0158b9C6JwY4ZzCHZdgnibXw ← the eventual culprit
2185 2026-05-06T15:24:48.426Z tool.execution_complete e103a9a8 toolu_0158… finally completes — but ~24 events of d75f8db4 activity sit in between

Concretely:

  • events.jsonl timestamps are strictly monotonic (verified across the whole file) — these aren't out-of-order writes.
  • The e103a9a8 events after session.resume are real new model calls — they have fresh CAPI requestIds (A0F1:…, D2E0:…).
  • Top-level events from both interactions have no agentId and no data.parentToolCallId, so the SDK's yY() filter (return t.agentId ? t.agentId : t.data?.parentToolCallId) does not skip them when rebuilding _chatMessages via processEventForState.

So when _chatMessages is rebuilt from this stream and sent to the API, the tool_use from e103a9a8 (toolu_0158…, L2161) is followed by d75f8db4's assistant.messages instead of the matching tool_result. CAPI rejects it: messages.465: tool_use ids were found without tool_result blocks immediately after: toolu_0158….

What I'm not yet sure about

I haven't fully traced how the SDK ended up running two concurrent agentic loops for the same session after a hard kill + resume. It could be:

  • the SDK resuming the previously in-flight interaction (e103a9a8) and also processing the new queued user message (d75f8db4) concurrently, or
  • some path through the embedder that revives old in-flight state on resume.

Either way, what's clear from the persisted event log is that two top-level interactions wrote to the same session log post-resume, and the SDK had no defense in _chatMessages rebuild for that.

Why the SDK's existing repair doesn't catch it

In node_modules/@github/copilot/sdk/index.js (around line 3888), the orphan-repair helper (minified name dns) walks _chatMessages backwards and stops as soon as it finds a non-assistant message after the last assistant:

function dns(t) {
    if (t.length === 0) return t;
    let e = [];          // orphaned tool_use ids
    let r = new Set();   // tool_call_ids that already have a result
    let n = false;       // hasSeenAssistant
    for (let a = t.length - 1; a >= 0; a--) {
        let l = t[a];
        if (l.role === "assistant") n = true;
        if (l.role === "assistant" && "tool_calls" in l && l.tool_calls && l.tool_calls.length > 0) {
            for (let c of l.tool_calls)
                if (!r.has(c.id)) e.push(c.id);
        } else {
            if (n) break;     // ← only repairs orphans at the tail
            if (l.role === "tool" && l.tool_call_id) r.add(l.tool_call_id);
        }
    }
    if (e.length === 0) return t;
    let o = "The execution of this tool, or a previous tool was interrupted.";
    U.info(`Completing ${e.length} orphaned tool calls.`);
    let s = e.map(a => ({ role: "tool", tool_call_id: a, content: o }));
    return [...t, ...s];
}

Call sites: case "abort" and case "session.resume" in the same file (~line 3892).

Because the loop breaks on the first non-assistant message encountered after the last assistant, it can only:

  • collect tool-results that come after the last assistant message, and
  • mark orphans on that last assistant message only.

Any orphan further back (e.g. a tool_use whose matching tool_result got separated by another interaction's assistant.message) is never discovered or repaired. That's exactly the configuration the persisted state above produces.

Suggested fix

Whatever the upstream cause of the concurrent interactions writing into the same log, the in-memory _chatMessages reconstruction should defensively repair every orphan, not just the tail. A whole-history walk:

function repairAllOrphans(messages) {
    const resolvedToolCallIds = new Set();
    for (const m of messages) {
        if (m.role === "tool" && m.tool_call_id) {
            resolvedToolCallIds.add(m.tool_call_id);
        }
    }
    const out = [];
    for (const m of messages) {
        out.push(m);
        if (m.role === "assistant" && m.tool_calls?.length) {
            const orphans = m.tool_calls.filter(tc => !resolvedToolCallIds.has(tc.id));
            if (orphans.length === 0) continue;
            const synthetic = orphans.map(tc => ({
                role: "tool",
                tool_call_id: tc.id,
                content: "The execution of this tool, or a previous tool was interrupted."
            }));
            out.push(...synthetic);
            for (const o of orphans) resolvedToolCallIds.add(o.id);
        }
    }
    return out;
}

This is a strict generalisation of the existing helper: it still repairs tail orphans, and it repairs the deeper-orphan case that fires messages.<N>: tool_use ids were found without tool_result blocks after a hard kill + resume.

Separately (and arguably more important), it'd be worth investigating why two top-level interactions for the same session can write to the same event log post-resume — the orphan-repair fix above is a safety net, not a root-cause fix for that.

Reproduction notes

  1. Run a session, let the main agent reach a long-running tool call (e.g. a slow bash).
  2. Kill the IDE / process while the tool is still in flight (no graceful session.shutdown).
  3. Reopen, the session resumes (session.resume fires).
  4. Send any user message → 400 from CAPI.

Versions

  • @github/copilot 1.0.39 (./sdk/index.js)
  • Surfaced from VS Code Insiders' Agents window using a Copilot CLI session

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:sessionsSession management, resume, history, session picker, and session statearea:toolsBuilt-in tools: file editing, shell, search, LSP, git, and tool call behavior

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions