SDK: orphan `tool_use` left mid-conversation after hard kill + resume causes persistent `messages.N: tool_use ids were found without tool_result blocks` (400)

**Affected package:** `@github/copilot` (`./sdk/index.js`)
**Version observed:** `1.0.39`

> **Edit:** This issue was originally written assuming the orphan was caused by *subagent* messages interleaving with main-agent messages. After a closer look at the persisted state, that's not the case — these are concurrent **main-agent interactions** (different `interactionId`s on top-level events with no `agentId`/`parentToolCallId`). I've rewritten the body below with the corrected analysis. The user-visible bug and the proposed fix to the orphan repair are unchanged.

## Symptom

Sending a message to a resumed session produces a 400 from the upstream API:

```
CAPIError: 400 messages.465: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_0158b9C6JwY4ZzCHZdgnibXw.
Each `tool_use` block must have a corresponding `tool_result` block in the next message.
```

The SDK's own log line `Completing 1 orphaned tool calls.` runs at session resume, but the failure persists across retries — meaning the in-memory chat history still contains an unrepaired orphan deeper in the conversation. Once a session enters this state, **every subsequent `send` fails with the same 400** and the session is effectively bricked.

(Surface: reproducible from VS Code Insiders' Agents window with a Copilot CLI session, but the bug is in the SDK chat-history reconstruction / persistence, not in the embedder.)

## What's actually in the persisted state

Looking at the affected session's `events.jsonl`, the relevant region (timestamps + interactionIds) shows two **concurrent top-level interactions** writing into the same event log after a hard kill + `session.resume`:

| Line | Timestamp                  | Event                       | `interactionId` | Notes |
|-----:|----------------------------|-----------------------------|-----------------|-------|
| 2107 | 2026-05-06T15:01:29.857Z   | `assistant.message`         | `e103a9a8`      | tool_use `toolu_014VYJyECct5io` issued |
| 2108 | 2026-05-06T15:01:29.858Z   | `tool.execution_start`      | —               | `toolu_014…` starts |
| 2109 | 2026-05-06T15:01:29.858Z   | `hook.start` / `hook.end`   | —               | last write before kill |
| 2111 | 2026-05-06T**15:22:12.945Z** | **`session.resume`**      | —               | 21-min gap; **no preceding `session.shutdown`** (IDE killed mid-tool) |
| 2116 | 2026-05-06T15:22:17.811Z   | `user.message` "Continue"   | `d75f8db4`      | NEW interaction |
| 2119 | 2026-05-06T15:22:18.781Z   | `assistant.turn_start`      | `d75f8db4`      | new agentic loop starts |
| 2120 | 2026-05-06T15:22:57.574Z   | `assistant.message`         | `d75f8db4`      | new tool_use |
| 2131 | 2026-05-06T15:23:39.383Z   | `tool.execution_complete`   | **`e103a9a8`**  | **OLD interaction's tool completes after resume** |
| 2133 | 2026-05-06T15:23:39.385Z   | `assistant.turn_start`      | **`e103a9a8`** turn=39 | **OLD interaction continues** with fresh CAPI requestId |
| 2143 | 2026-05-06T15:23:57.716Z   | `assistant.message`         | `e103a9a8`      | new model call (`requestId=A0F1:1CDDAB:30A4DC:33FAC3:69FB5CFC`) |
| …    | …                          | …                           | …               | events from `e103a9a8` and `d75f8db4` continue interleaving |
| 2161 | 2026-05-06T15:24:15.621Z   | `assistant.message`         | `e103a9a8`      | tool_use **`toolu_0158b9C6JwY4ZzCHZdgnibXw`** ← the eventual culprit |
| 2185 | 2026-05-06T15:24:48.426Z   | `tool.execution_complete`   | `e103a9a8`      | `toolu_0158…` finally completes — but ~24 events of `d75f8db4` activity sit in between |

Concretely:

- `events.jsonl` timestamps are strictly monotonic (verified across the whole file) — these aren't out-of-order writes.
- The `e103a9a8` events after `session.resume` are real new model calls — they have **fresh CAPI `requestId`s** (`A0F1:…`, `D2E0:…`).
- Top-level events from both interactions have no `agentId` and no `data.parentToolCallId`, so the SDK's `yY()` filter (`return t.agentId ? t.agentId : t.data?.parentToolCallId`) does **not** skip them when rebuilding `_chatMessages` via `processEventForState`.

So when `_chatMessages` is rebuilt from this stream and sent to the API, the `tool_use` from `e103a9a8` (`toolu_0158…`, L2161) is followed by `d75f8db4`'s `assistant.message`s instead of the matching `tool_result`. CAPI rejects it: `messages.465: tool_use ids were found without tool_result blocks immediately after: toolu_0158…`.

## What I'm not yet sure about

I haven't fully traced how the SDK ended up running two concurrent agentic loops for the same session after a hard kill + resume. It could be:
- the SDK resuming the previously in-flight interaction (`e103a9a8`) **and** also processing the new queued user message (`d75f8db4`) concurrently, or
- some path through the embedder that revives old in-flight state on resume.

Either way, what's clear from the persisted event log is that **two top-level interactions wrote to the same session log post-resume**, and the SDK had no defense in `_chatMessages` rebuild for that.

## Why the SDK's existing repair doesn't catch it

In `node_modules/@github/copilot/sdk/index.js` (around line 3888), the orphan-repair helper (minified name `dns`) walks `_chatMessages` backwards and stops as soon as it finds a non-assistant message after the last assistant:

```js
function dns(t) {
    if (t.length === 0) return t;
    let e = [];          // orphaned tool_use ids
    let r = new Set();   // tool_call_ids that already have a result
    let n = false;       // hasSeenAssistant
    for (let a = t.length - 1; a >= 0; a--) {
        let l = t[a];
        if (l.role === "assistant") n = true;
        if (l.role === "assistant" && "tool_calls" in l && l.tool_calls && l.tool_calls.length > 0) {
            for (let c of l.tool_calls)
                if (!r.has(c.id)) e.push(c.id);
        } else {
            if (n) break;     // ← only repairs orphans at the tail
            if (l.role === "tool" && l.tool_call_id) r.add(l.tool_call_id);
        }
    }
    if (e.length === 0) return t;
    let o = "The execution of this tool, or a previous tool was interrupted.";
    U.info(`Completing ${e.length} orphaned tool calls.`);
    let s = e.map(a => ({ role: "tool", tool_call_id: a, content: o }));
    return [...t, ...s];
}
```

Call sites: `case "abort"` and `case "session.resume"` in the same file (~line 3892).

Because the loop `break`s on the first non-assistant message encountered after the last assistant, it can only:
- collect tool-results that come **after** the last assistant message, and
- mark orphans on **that last assistant message** only.

Any orphan further back (e.g. a `tool_use` whose matching `tool_result` got separated by another interaction's `assistant.message`) is never discovered or repaired. That's exactly the configuration the persisted state above produces.

## Suggested fix

Whatever the upstream cause of the concurrent interactions writing into the same log, the in-memory `_chatMessages` reconstruction should defensively repair every orphan, not just the tail. A whole-history walk:

```js
function repairAllOrphans(messages) {
    const resolvedToolCallIds = new Set();
    for (const m of messages) {
        if (m.role === "tool" && m.tool_call_id) {
            resolvedToolCallIds.add(m.tool_call_id);
        }
    }
    const out = [];
    for (const m of messages) {
        out.push(m);
        if (m.role === "assistant" && m.tool_calls?.length) {
            const orphans = m.tool_calls.filter(tc => !resolvedToolCallIds.has(tc.id));
            if (orphans.length === 0) continue;
            const synthetic = orphans.map(tc => ({
                role: "tool",
                tool_call_id: tc.id,
                content: "The execution of this tool, or a previous tool was interrupted."
            }));
            out.push(...synthetic);
            for (const o of orphans) resolvedToolCallIds.add(o.id);
        }
    }
    return out;
}
```

This is a strict generalisation of the existing helper: it still repairs tail orphans, **and** it repairs the deeper-orphan case that fires `messages.<N>: tool_use ids were found without tool_result blocks` after a hard kill + resume.

Separately (and arguably more important), it'd be worth investigating why two top-level interactions for the same session can write to the same event log post-resume — the orphan-repair fix above is a safety net, not a root-cause fix for that.

## Reproduction notes

1. Run a session, let the main agent reach a long-running tool call (e.g. a slow `bash`).
2. Kill the IDE / process while the tool is still in flight (no graceful `session.shutdown`).
3. Reopen, the session resumes (`session.resume` fires).
4. Send any user message → 400 from CAPI.

## Versions

- `@github/copilot` `1.0.39` (`./sdk/index.js`)
- Surfaced from VS Code Insiders' Agents window using a Copilot CLI session


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK: orphan `tool_use` left mid-conversation after hard kill + resume causes persistent `messages.N: tool_use ids were found without tool_result blocks` (400) #3183

Symptom

What's actually in the persisted state

What I'm not yet sure about

Why the SDK's existing repair doesn't catch it

Suggested fix

Reproduction notes

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Line	Timestamp	Event	`interactionId`	Notes
2107	2026-05-06T15:01:29.857Z	`assistant.message`	`e103a9a8`	tool_use `toolu_014VYJyECct5io` issued
2108	2026-05-06T15:01:29.858Z	`tool.execution_start`	—	`toolu_014…` starts
2109	2026-05-06T15:01:29.858Z	`hook.start` / `hook.end`	—	last write before kill
2111	2026-05-06T15:22:12.945Z	`session.resume`	—	21-min gap; no preceding `session.shutdown` (IDE killed mid-tool)
2116	2026-05-06T15:22:17.811Z	`user.message` "Continue"	`d75f8db4`	NEW interaction
2119	2026-05-06T15:22:18.781Z	`assistant.turn_start`	`d75f8db4`	new agentic loop starts
2120	2026-05-06T15:22:57.574Z	`assistant.message`	`d75f8db4`	new tool_use
2131	2026-05-06T15:23:39.383Z	`tool.execution_complete`	`e103a9a8`	OLD interaction's tool completes after resume
2133	2026-05-06T15:23:39.385Z	`assistant.turn_start`	`e103a9a8` turn=39	OLD interaction continues with fresh CAPI requestId
2143	2026-05-06T15:23:57.716Z	`assistant.message`	`e103a9a8`	new model call (`requestId=A0F1:1CDDAB:30A4DC:33FAC3:69FB5CFC`)
…	…	…	…	events from `e103a9a8` and `d75f8db4` continue interleaving
2161	2026-05-06T15:24:15.621Z	`assistant.message`	`e103a9a8`	tool_use `toolu_0158b9C6JwY4ZzCHZdgnibXw` ← the eventual culprit
2185	2026-05-06T15:24:48.426Z	`tool.execution_complete`	`e103a9a8`	`toolu_0158…` finally completes — but ~24 events of `d75f8db4` activity sit in between

SDK: orphan tool_use left mid-conversation after hard kill + resume causes persistent messages.N: tool_use ids were found without tool_result blocks (400) #3183

Description

Symptom

What's actually in the persisted state

What I'm not yet sure about

Why the SDK's existing repair doesn't catch it

Suggested fix

Reproduction notes

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

SDK: orphan `tool_use` left mid-conversation after hard kill + resume causes persistent `messages.N: tool_use ids were found without tool_result blocks` (400) #3183