-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Labels
Description
Describe the bug
During extended sessions with parallel background agents (launched via the task tool), the CLI crashes abruptly mid-execution with what appears to be a TypeScript/JavaScript runtime error. The session terminates without graceful shutdown, losing any in-flight agent work.
Pattern observed across 3 separate sessions:
- Session runs for 30-60 minutes with complex multi-step builds
- 4-7 background agents launched in parallel waves (via
tasktool withmode: "background") - Context window accumulates significantly (large plan + agent results + file verification)
- CLI suddenly exits with a script/TypeScript error — no clean shutdown, no checkpoint
The crash appears correlated with:
- High parallel agent count (4+ simultaneous background agents)
- Large context window accumulation (no automatic compaction triggered before crash)
- Heavy
events.jsonlwrite activity from multiple concurrent agent completions
Environment
- OS: Windows 11 (PowerShell 7)
- CLI Version: GitHub Copilot CLI 1.0.x (latest via winget)
- Models used: Claude Opus 4.6, Claude Sonnet 4.6 (switched mid-session)
- Session duration at crash: ~45-60 minutes
- Parallel agents at time of crash: 4-7 background agents
Steps to reproduce
- Start
copilotin a large workspace (30+ repos) - Create a complex plan with 30+ todos tracked in SQLite
- Launch 4+ background agents simultaneously via
tasktool (mode: "background") - Allow agents to complete and accumulate results in context
- Launch another wave of 3-4 agents without running
/compact - CLI crashes within 10-20 minutes of sustained multi-agent activity
Expected behavior
- CLI should gracefully handle parallel agent load or surface a warning when approaching memory limits
- If crash is unavoidable, CLI should checkpoint session state before exiting
events.jsonlshould be flushed/synced before crash to enable clean session resume
Workarounds implemented
We've built a resilience layer to mitigate this:
- SQLite as state store (not
events.jsonl) — todos, agent_log, wave_checkpoints tables survive crash - Capped parallel agents at 4 (down from 7) to reduce memory pressure
- Mandatory
/compactafter each wave to keep context window manageable - JSON checkpoint files written to
files/after each wave for crash recovery - Session documentation enabling full state restoration in a new session
Relationship to existing issues
- CLI crashes randomly after upgrading to v1 GA #1993 — Same symptom (random crash), but our scenario is specifically triggered by parallel agents + large context
- Copilot CLI crashes with
JavaScript heap out of memory#1457 — JavaScript heap OOM — likely the underlying cause - Recover queued and pending prompts after agent crash or unexpected exit #1780 — Request to recover queued prompts after crash — our SQLite approach is a user-space implementation of this
Suggestion
Consider exposing:
- A context window usage threshold that triggers automatic
/compactbefore OOM - A max-parallel-agents limit (configurable via config.json)
- A pre-crash checkpoint hook that saves session state atomically
- WAL mode for
events.jsonlwrites (or switch to SQLite internally for event storage)
Additional context
Detailed workaround architecture and crash recovery playbook available if helpful to the team. Happy to provide events.jsonl samples or session state artifacts from crashed sessions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Fields
Give feedbackNo fields configured for issues without a type.