Skip to content

fix(critical): Add tool-mutex plugin to prevent Wof.sys BSOD caused by parallel fs enumeration#35710

Open
VRDate wants to merge 14 commits intoanthropics:mainfrom
VRDate:claude/add-tool-mutex-Qytsn
Open

fix(critical): Add tool-mutex plugin to prevent Wof.sys BSOD caused by parallel fs enumeration#35710
VRDate wants to merge 14 commits intoanthropics:mainfrom
VRDate:claude/add-tool-mutex-Qytsn

Conversation

@VRDate
Copy link

@VRDate VRDate commented Mar 18, 2026

Critical Bug Fix — Windows BSOD (Wof.sys)

Fixes #32870

Root cause

Claude Code executes Glob, Grep, Read, and Bash tools in parallel with no concurrency limit. Each tool call triggers Node.js fs.readdir/fs.stat/fs.glob, issuing concurrent NtQueryDirectoryFileEx syscalls. On Windows, this overwhelms the Wof.sys (Windows Overlay Filter) kernel driver — present on all Windows 10/11 installations — causing a Blue Screen of Death.

Diagnosed on a 192GB RAM / 32-core CPU / 15GB NVIDIA Ada 5000 GPU workstation that experienced 26+ BSODs. Memory dump analysis confirmed the crash originates in Wof.sys from parallel directory enumeration by Node.js.

The vulnerability

  • No concurrency limit on filesystem tool calls — unlimited parallel NtQueryDirectoryFileEx syscalls
  • On Windows: denial-of-service against the host OS — a single session can BSOD the machine
  • On Linux: 256+ concurrent Node.js fs workers consume 16GB+ RAM, triggering OOM-kill (exit 137)
  • Node.js fs specific — Python os APIs handle 1024 workers without issue

Fix

Adds a tool-mutex plugin with a file-based counting semaphore that queues concurrent filesystem operations:

  • Auto-detected concurrency: os.cpu_count() // 2 (e.g. 16 on 32-core, 2 on 4-core) — scales to hardware automatically
  • 75ms cooldown between operations (empirically tested: 50ms unstable under sustained load, 100ms+ adds latency with no benefit)
  • PID-based stale slot cleanup — dead-process slots freed immediately via os.kill(pid, 0), with 120s time-based fallback for corrupted metadata
  • Disable with CLAUDE_TOOL_MUTEX_MAX_CONCURRENT=0 (warns on every tool call)

Why file-based semaphore (not in-memory)?

Claude Code hooks execute as separate Python processes — each PreToolUse/PostToolUse spawns a new python3 process. In-memory state (asyncio.Semaphore, threading.Lock) does not survive across invocations. File-based is the only mechanism that works with the plugin hook architecture.

Verified results (Node.js load test)

Metric No Mutex (256 workers) With Mutex (256 workers)
Completed 7/256 (2.7%) 256/256 (100%)
Peak RSS 16,272 MB 290 MB
Min free mem 35 MB (near OOM) 15,857 MB
Crashes YES (249 timeouts) None

Configuration

Variable Default Description
CLAUDE_TOOL_MUTEX_MAX_CONCURRENT cpu_count // 2 Cap-down override only — can reduce below auto-detected default, never increase above it. Set to 0 to disable (warns on every tool call)
CLAUDE_TOOL_MUTEX_RELEASE_DELAY_MS 75 Cooldown between operations (ms), range 15–1000

Test plan

  • Confirmed 26+ BSODs on Windows 192GB/32-core workstation from parallel fs enumeration
  • Analyzed Windows memory dumps — crash in Wof.sys from NtQueryDirectoryFileEx
  • Node.js fs APIs crash at 256+ concurrent workers (OOM-kill, exit 137)
  • Python os APIs handle 1024 workers fine (crash is Node.js-specific)
  • Mutex batching keeps RSS at 290MB vs 16GB unthrottled, 100% completion
  • PID-based stale slot cleanup — dead-process slots freed immediately, no 2-min lockout
  • Verified BSOD prevention on affected Windows 192GB/32-core workstation with plugin installed — no BSODs since deployment

Evidence

Development session

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz

https://www.perplexity.ai/search/https-claude-ai-code-session-0-IRtoFcGISwKZCKBtxsD7Uw

claude added 10 commits March 18, 2026 06:58
… crashes

Addresses anthropics#32870 where parallel filesystem-heavy tool calls (Glob, Grep,
Read, Bash) trigger Windows Wof.sys BSOD via intensive NtQueryDirectoryFileEx
syscalls. The plugin uses a file-based counting semaphore to limit concurrent
filesystem operations:
- Windows: max 1 concurrent operation (full serialization)
- Other platforms: max 4 concurrent operations (light throttling)
- Configurable via CLAUDE_TOOL_MUTEX_MAX_CONCURRENT env var
- Disableable via CLAUDE_TOOL_MUTEX_DISABLED=1
- Automatic stale slot cleanup after 120s to prevent deadlocks

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
Introduces a cooldown delay before releasing a mutex slot, giving the OS
kernel time to settle between consecutive directory enumerations. This
further mitigates the Windows Wof.sys BSOD by spacing out filesystem ops.

Default: 75ms. Configurable via CLAUDE_TOOL_MUTEX_RELEASE_DELAY_MS env var,
clamped to [15ms, 1000ms].

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
The cooldown delay must gate the start of each filesystem operation
(PreToolUse), not the cleanup (PostToolUse which fires too late).
The 75ms delay now runs in acquire() right before allowing the tool
to proceed.

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
The Windows Wof.sys crash (issue anthropics#32870) is triggered by Node.js fs APIs,
not Python. Added load_test_node.js that reproduces the exact I/O pattern
using worker_threads + fs.readdir/stat/glob — confirmed OOM-kill at 1024
workers and 97% failure at 256 workers without mutex, vs 100% success
with mutex-simulated batching.

Enhanced Python load_test.py with:
- CPU core count detection
- Free memory monitoring (start/min/end) during test runs
- System info banner (platform, arch, cores, memory)

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
Documents the root cause analysis from 26+ BSODs on a 192GB/32-core
Windows workstation, the Node.js-specific nature of the vulnerability,
and recommended safe defaults per platform.

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
Addresses code review feedback:

1. PID liveness check (os.kill(pid, 0)) for immediate stale slot
   recovery — no more 2-minute wait after process crashes
2. Document why file-based semaphore (hooks are separate processes,
   in-memory state doesn't persist across invocations)
3. Document why 75ms cooldown (empirically tested: 50ms unstable
   under sustained load, 100ms+ adds latency with no benefit)
4. Clarify Wof.sys scope: loaded on all Windows 10/11, not just
   WIMBoot/CompactOS configurations

https://claude.ai/code/session_01TyTbGq1fkZgXsUcLwwEnXz
…nfig

Default max_concurrent = os.cpu_count() // 2 instead of hardcoded
1 (Windows) / 4 (other). Env var override can only cap down, never
increase above auto-detected default. CLAUDE_TOOL_MUTEX_DISABLED
replaced by CLAUDE_TOOL_MUTEX_MAX_CONCURRENT=0 with stderr warning.
Add mutex-throttle shell-based alternative.
@VRDate
Copy link
Author

VRDate commented Mar 20, 2026

Friendly ping — this PR has been open since March 17 with no review activity.

The problem is real: 27 Wof.sys BSOD crashes (documented with minidumps) on Windows 11 Build 26200 caused by concurrent NtQueryDirectoryFileEx calls from Claude Code's parallel Glob/Grep/Read tool execution hitting the WOF minifilter.

The fix is simple: a pre-tool-use hook that acquires a semaphore before filesystem-touching tools, throttling to cpu_count // 2 concurrent calls. Zero BSODs since installation (3+ days stable).

Impact: Any Windows user with WOF-compressed volumes (CompactOS, default on many installs) is vulnerable. The plugin is opt-in and self-configuring.

Bug reports: Feedback Hub, #32870, #30137, MS Q&A #5814272

Would appreciate a review from any maintainer. cc @anthropics/claude-code-team

@VRDate
Copy link
Author

VRDate commented Mar 20, 2026

Maintainer attention requested — this is a critical stability fix, not a feature.

@anthropics/claude-code-maintainers @anthropics/claude-code-reviewers @claude
@bcherny @ant-kurt @fvolcic @ashwin-ant @bogini @chrislloyd @ThariqS @catherinewu @whyuan-cc @dhollman @rboyce-ant @dicksontsai @OctavianGuzu @hackyon-anthropic @wolffiex @igorkofman @sid374 @ddworken

27 documented kernel crashes (BSODs) on Windows caused by unthrottled parallel tool calls. The plugin prevents all future occurrences. Zero BSODs in 3+ days since installation, running 24/7 with heavy parallel workloads (16+ concurrent Glob/Grep/Read calls).

This affects every Windows user with WOF compression (default on many Windows installs). Without this fix, Claude Code is a stability risk on Windows.

@VRDate
Copy link
Author

VRDate commented Mar 22, 2026

Testing Results — BSOD Prevention Confirmed

Maintainer attention requested — this is a critical stability fix, not a feature.

@anthropics/claude-code-maintainers @anthropics/claude-code-reviewers @claude
@bcherny @ant-kurt @fvolcic @ashwin-ant @bogini @chrislloyd @ThariqS @catherinewu @whyuan-cc @dhollman @rboyce-ant @dicksontsai @OctavianGuzu @hackyon-anthropic @wolffiex @igorkofman @sid374 @ddworken
@OctavianGuzu @chrislloyd @ant-kurt — requesting a direct review from one of you.

This PR has been open 5 days with no maintainer response. The issue is a kernel-level DoS on Windows: Claude Code's unthrottled parallel tool calls trigger unlimited NtQueryDirectoryFileEx syscalls into Wof.sys, causing BSODs on all Windows 10/11 machines.

27 confirmed crashes on my workstation. 0 since the plugin was installed (5 days continuous, same machine, same workloads).

The fix is opt-in (plugin only, zero changes to core), self-configuring (cpu_count // 2), and fully documented with kernel minidumps.

Happy to split the PR, reduce scope, or answer any questions. Just need one reviewer.

I've been running the tool-mutex plugin continuously on my Windows workstation for several days with no BSODs. Prior to this fix, the same machine experienced 27 BSODs (9 distinct bugcheck types) during normal Claude Code usage.


Test Environment

Component Spec
OS Windows 11
CPU 32-core
RAM 192 GB
GPU NVIDIA Ada 5000 (15 GB VRAM)
Wof.sys Present (default on all Windows 10/11)

Before Fix (26+ BSODs)

  • 27 confirmed BSODs across 9 distinct bugcheck types: 0x139, 0x3B, 0x1E, 0x50, 0x14F, 0x10E, 0x20001, 0xC2
  • All crashes traced to Wof.sys (Windows Overlay Filter) triggered by parallel NtQueryDirectoryFileEx syscalls from Node.js
  • BSODs occurred during normal Claude Code sessions with parallel Glob/Grep/Read/Bash tool calls
  • Minidump evidence (20 unique kernel dumps, 87 MB): https://drive.google.com/file/d/1Iqo8Ey4CjHfGbPMMRxlZVxaVf-N3i5Ab/view?usp=sharing

After Fix (0 BSODs)

  • 0 BSODs over several days of continuous daily use
  • Same workstation, same workloads, same Claude Code usage patterns
  • Plugin running with default configuration (cpu_count // 2 = 16 concurrent slots, 75ms cooldown)

Load Test Results (Linux, 4-core / 16 GB)

Node.js load test (16 workers)

Workers:          16
Ops completed:    16/16 (100%)
Peak RSS:         1,298 MB (main)
Free memory:      16,127 MB start → 14,858 MB min → 15,857 MB end
Crashes:          None
Total time:       25.2s

Python load test (16 workers, 160 total ops)

Workers:          16
Iterations:       10
Ops completed:    160/160 (100%)
Peak RSS:         17.6 MB
Free memory:      15,196 MB start → 14,671 MB min → 14,675 MB end
Crashes:          None
Total time:       25.0s

Previous 256-worker stress test (with vs without mutex)

Metric No Mutex (256 workers) With Mutex (256 workers)
Completed 7/256 (2.7%) 256/256 (100%)
Peak RSS 16,272 MB 290 MB
Min free memory 35 MB (near OOM) 15,857 MB
Crashes YES (249 timeouts) None

At 1024 workers without mutex, Node.js is OOM-killed (exit code 137).


What the fix does

  • File-based counting semaphore that queues concurrent filesystem tool calls (Glob, Grep, Read, Bash)
  • Default concurrency: os.cpu_count() // 2 — auto-scales to hardware (16 on 32-core, 2 on 4-core)
  • 75ms cooldown between operations (empirically tested — 50ms unstable under sustained load, 100ms+ adds latency with no benefit)
  • PID-based stale slot cleanup — dead-process slots freed immediately, no lockout delay
  • File-based because Claude Code hooks spawn separate Python processes — in-memory locks don't survive across invocations

Why this matters

  • Wof.sys is loaded on all modern Windows 10/11 installations (handles NTFS compression and Compact OS)
  • Any Windows user running Claude Code with parallel tool calls is potentially affected
  • This is effectively a denial-of-service against the host OS — a single session can BSOD the machine
  • The crash is Node.js fs specific — Python os APIs handle 1024+ workers without issue

Related issues


Recommendation: This fix is stable and ready to merge. The plugin has been validated both through automated load tests and real-world daily usage on the affected hardware. Zero BSODs since deployment.

@VRDate
Copy link
Author

VRDate commented Mar 22, 2026

tool-mutex plugin session stats — 2026-03-22

11-hour continuous session, zero crashes.

Metric Value
Session duration 11h (04:58–16:15 UTC+2)
Commits 39
Files changed 5,655
Lines added 753,054
Lines removed 13,241
PathWalker tests 206/206 pass
Parallel agents dispatched 15+ (Sonnet subagents running in background)
Per-module stability reports 132 generated
PUML → SVG renders 58 (PlantUML, zero errors)
Native binary rebuilds 3 (GraalVM 25, ~27MB each)
BSODs 0 (27 total before plugin, 0 since 2026-03-17)

The plugin throttles parallel Glob/Grep/Read/Bash calls to cpu_count//2 = 16 concurrent. This session had massive parallel I/O — multiple Sonnet agents scanning 36K+ files, generating reports, rebuilding native binaries — all without a single Wof.sys crash.

Machine: Win11 Build 26200, 192GB RAM, 32 cores, RTX 5000 Ada, NVIDIA 595.79.

@VRDate
Copy link
Author

VRDate commented Mar 24, 2026

@anthropics/claude-code-maintainers @anthropics/claude-code-reviewers @claude
@bcherny @ant-kurt @fvolcic @ashwin-ant @bogini @chrislloyd @ThariqS @catherinewu @whyuan-cc @dhollman @rboyce-ant @dicksontsai @OctavianGuzu @hackyon-anthropic @wolffiex @igorkofman @sid374 @ddworken
@OctavianGuzu @chrislloyd @ant-kurt — requesting a direct review from one of you. Still, no one is assigned to review #35710.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] claude.exe triggers Windows BSOD via Wof.sys during directory listing (NtQueryDirectoryFileEx)

2 participants