fix(critical): Add tool-mutex plugin to prevent Wof.sys BSOD caused by parallel fs enumeration#35710
fix(critical): Add tool-mutex plugin to prevent Wof.sys BSOD caused by parallel fs enumeration#35710VRDate wants to merge 14 commits intoanthropics:mainfrom
Conversation
… crashes Addresses anthropics#32870 where parallel filesystem-heavy tool calls (Glob, Grep, Read, Bash) trigger Windows Wof.sys BSOD via intensive NtQueryDirectoryFileEx syscalls. The plugin uses a file-based counting semaphore to limit concurrent filesystem operations: - Windows: max 1 concurrent operation (full serialization) - Other platforms: max 4 concurrent operations (light throttling) - Configurable via CLAUDE_TOOL_MUTEX_MAX_CONCURRENT env var - Disableable via CLAUDE_TOOL_MUTEX_DISABLED=1 - Automatic stale slot cleanup after 120s to prevent deadlocks https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
Introduces a cooldown delay before releasing a mutex slot, giving the OS kernel time to settle between consecutive directory enumerations. This further mitigates the Windows Wof.sys BSOD by spacing out filesystem ops. Default: 75ms. Configurable via CLAUDE_TOOL_MUTEX_RELEASE_DELAY_MS env var, clamped to [15ms, 1000ms]. https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
The cooldown delay must gate the start of each filesystem operation (PreToolUse), not the cleanup (PostToolUse which fires too late). The 75ms delay now runs in acquire() right before allowing the tool to proceed. https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
The Windows Wof.sys crash (issue anthropics#32870) is triggered by Node.js fs APIs, not Python. Added load_test_node.js that reproduces the exact I/O pattern using worker_threads + fs.readdir/stat/glob — confirmed OOM-kill at 1024 workers and 97% failure at 256 workers without mutex, vs 100% success with mutex-simulated batching. Enhanced Python load_test.py with: - CPU core count detection - Free memory monitoring (start/min/end) during test runs - System info banner (platform, arch, cores, memory) https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
Documents the root cause analysis from 26+ BSODs on a 192GB/32-core Windows workstation, the Node.js-specific nature of the vulnerability, and recommended safe defaults per platform. https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
Addresses code review feedback: 1. PID liveness check (os.kill(pid, 0)) for immediate stale slot recovery — no more 2-minute wait after process crashes 2. Document why file-based semaphore (hooks are separate processes, in-memory state doesn't persist across invocations) 3. Document why 75ms cooldown (empirically tested: 50ms unstable under sustained load, 100ms+ adds latency with no benefit) 4. Clarify Wof.sys scope: loaded on all Windows 10/11, not just WIMBoot/CompactOS configurations https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
…nfig Default max_concurrent = os.cpu_count() // 2 instead of hardcoded 1 (Windows) / 4 (other). Env var override can only cap down, never increase above auto-detected default. CLAUDE_TOOL_MUTEX_DISABLED replaced by CLAUDE_TOOL_MUTEX_MAX_CONCURRENT=0 with stderr warning. Add mutex-throttle shell-based alternative.
|
Friendly ping — this PR has been open since March 17 with no review activity. The problem is real: 27 Wof.sys BSOD crashes (documented with minidumps) on Windows 11 Build 26200 caused by concurrent The fix is simple: a pre-tool-use hook that acquires a semaphore before filesystem-touching tools, throttling to Impact: Any Windows user with WOF-compressed volumes (CompactOS, default on many installs) is vulnerable. The plugin is opt-in and self-configuring. Bug reports: Feedback Hub, #32870, #30137, MS Q&A #5814272 Would appreciate a review from any maintainer. cc @anthropics/claude-code-team |
|
Maintainer attention requested — this is a critical stability fix, not a feature. @anthropics/claude-code-maintainers @anthropics/claude-code-reviewers @claude 27 documented kernel crashes (BSODs) on Windows caused by unthrottled parallel tool calls. The plugin prevents all future occurrences. Zero BSODs in 3+ days since installation, running 24/7 with heavy parallel workloads (16+ concurrent Glob/Grep/Read calls). This affects every Windows user with WOF compression (default on many Windows installs). Without this fix, Claude Code is a stability risk on Windows. |
Testing Results — BSOD Prevention ConfirmedMaintainer attention requested — this is a critical stability fix, not a feature. @anthropics/claude-code-maintainers @anthropics/claude-code-reviewers @claude This PR has been open 5 days with no maintainer response. The issue is a kernel-level DoS on Windows: Claude Code's unthrottled parallel tool calls trigger unlimited 27 confirmed crashes on my workstation. 0 since the plugin was installed (5 days continuous, same machine, same workloads). The fix is opt-in (plugin only, zero changes to core), self-configuring (
Happy to split the PR, reduce scope, or answer any questions. Just need one reviewer. I've been running the tool-mutex plugin continuously on my Windows workstation for several days with no BSODs. Prior to this fix, the same machine experienced 27 BSODs (9 distinct bugcheck types) during normal Claude Code usage. Test Environment
Before Fix (26+ BSODs)
After Fix (0 BSODs)
Load Test Results (Linux, 4-core / 16 GB)Node.js load test (16 workers)Python load test (16 workers, 160 total ops)Previous 256-worker stress test (with vs without mutex)
At 1024 workers without mutex, Node.js is OOM-killed (exit code 137). What the fix does
Why this matters
Related issues
Recommendation: This fix is stable and ready to merge. The plugin has been validated both through automated load tests and real-world daily usage on the affected hardware. Zero BSODs since deployment. |
tool-mutex plugin session stats — 2026-03-2211-hour continuous session, zero crashes.
The plugin throttles parallel Glob/Grep/Read/Bash calls to Machine: Win11 Build 26200, 192GB RAM, 32 cores, RTX 5000 Ada, NVIDIA 595.79. |
|
@anthropics/claude-code-maintainers @anthropics/claude-code-reviewers @claude |
Critical Bug Fix — Windows BSOD (Wof.sys)
Fixes #32870
Root cause
Claude Code executes Glob, Grep, Read, and Bash tools in parallel with no concurrency limit. Each tool call triggers Node.js
fs.readdir/fs.stat/fs.glob, issuing concurrentNtQueryDirectoryFileExsyscalls. On Windows, this overwhelms the Wof.sys (Windows Overlay Filter) kernel driver — present on all Windows 10/11 installations — causing a Blue Screen of Death.Diagnosed on a 192GB RAM / 32-core CPU / 15GB NVIDIA Ada 5000 GPU workstation that experienced 26+ BSODs. Memory dump analysis confirmed the crash originates in
Wof.sysfrom parallel directory enumeration by Node.js.The vulnerability
NtQueryDirectoryFileExsyscallsfsspecific — PythonosAPIs handle 1024 workers without issueFix
Adds a tool-mutex plugin with a file-based counting semaphore that queues concurrent filesystem operations:
os.cpu_count() // 2(e.g. 16 on 32-core, 2 on 4-core) — scales to hardware automaticallyos.kill(pid, 0), with 120s time-based fallback for corrupted metadataCLAUDE_TOOL_MUTEX_MAX_CONCURRENT=0(warns on every tool call)Why file-based semaphore (not in-memory)?
Claude Code hooks execute as separate Python processes — each PreToolUse/PostToolUse spawns a new
python3process. In-memory state (asyncio.Semaphore,threading.Lock) does not survive across invocations. File-based is the only mechanism that works with the plugin hook architecture.Verified results (Node.js load test)
Configuration
CLAUDE_TOOL_MUTEX_MAX_CONCURRENTcpu_count // 20to disable (warns on every tool call)CLAUDE_TOOL_MUTEX_RELEASE_DELAY_MS75Test plan
fsAPIs crash at 256+ concurrent workers (OOM-kill, exit 137)osAPIs handle 1024 workers fine (crash is Node.js-specific)Evidence
Development session
https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
https://www.perplexity.ai/search/https-claude-ai-code-session-0-IRtoFcGISwKZCKBtxsD7Uw