High CPU on large projects dominated by filesystem syscalls during import resolution (Windows)

## Summary

On a large project (the PyTorch repo: ~2,430 modules / ~1.32M lines), a cold full-project
analysis spends the **majority of its CPU in filesystem syscalls during import/module
resolution**, not in type checking. On Windows this is heavily amplified because every probe
traverses the NTFS minifilter stack (Microsoft Defender + cloud-files / OneDrive filter).

I profiled a locally-built, symbolized `release` binary with an ETW sampling profiler (samply)
and symbolicated against the matching PDB. Sharing the breakdown plus a few mitigation ideas.

## Environment

- OS: Windows 11 (NTFS; Microsoft Defender real-time protection on; OneDrive cloud-files filter active)
- Pyrefly: locally built `target/release` with debug symbols
- Workload: PyTorch repository, ~2,430 modules, ~1,323,529 lines, ~17,760 type errors
- Repro: `pyrefly check` (same engine the TSP/IDE server runs), CPU-weighted via per-sample CPU time

## Methodology

- `pyrefly check --report-timings` for phase/module timing
- `samply` (ETW, elevated) for a full CPU sampling profile
- Symbolicated `pyrefly.exe` frames against `pyrefly.pdb`; aggregated self/inclusive time using
  per-sample CPU deltas (so idle/blocked samples are not counted)

## Phase timing (`--report-timings`, thread-time)

- Solutions (type solving): ~28.4s (78%)
- Answers (bindings): ~6.8s (19%)
- Ast (parse): ~1.0s
- Exports: ~0.1s

No single pathological module — heaviest was `torch.testing._internal.common_methods_invocations`
at ~1.6s; cost is a long tail spread across all modules.

## Where the CPU actually goes (sampling self-time, CPU-weighted, ~94 CPU-seconds total)

- **~57%  Filesystem / kernel I/O** — `ntoskrnl` ~28%, `Ntfs.sys` ~12%, `ntdll` ~11%, `FLTMGR` + minifilters
- **~29%  Pyrefly type-checking compute** (solver / bindings / types) — each function <1% self; long tail
- **~8%   Allocator churn** (mimalloc: `Type` clone/drop, alloc/free)
- **~2.7% Antivirus** (Defender `mssecflt.sys` / `WdFilter.sys`)
- remainder: `memcpy`/`memset`, glob matching, path UTF-16 conversion

## Root cause

- **~62% of all CPU is inclusive under import/module resolution**, along the path
  `LoaderFindCache::find_import` -> `find_import_internal` -> `find_one_part_in_root` ->
  `std::fs::metadata` / `File::open_native` / `DirEntryCache::file_exists`.
- **~96% of all filesystem self-time is under import resolution.**
- The dominant cost is **per-candidate filesystem probing to locate modules** (stat/open each
  candidate path across each search root), **not** reading file contents (parse is <1s) and **not**
  pure type math.
- Amplifiers:
  - Large search path (project + typeshed + site-packages + many namespace packages) means each
    of ~2,430 modules is probed across many candidate roots.
  - On Windows each `metadata`/`open` traverses NTFS + Defender (`mssecflt`/`WdFilter`) + the
    cloud-files filter (`cldflt`, OneDrive), so kernel/minifilter time dominates per syscall.
- Note: a `DirEntryCache` exists, yet `std::fs::metadata` is still ~52% inclusive — suggesting many
  probes stat per-candidate *before* (or instead of) consulting a cached directory listing.

## IDE / TSP amplification

The same workspace opened in VS Code (Pylance spawning `pyrefly tsp`) accumulated **~1,196
CPU-seconds** and ~2.35 GB before going idle — roughly **12x** the one-shot CLI (94 CPU-s). This
appears to be the IDE server pulling far more modules into scope (workspace indexing for
completions/auto-import), i.e. the **same bottleneck amplified by scale**, not a different one.

## Suggested mitigations

1. **Collapse per-candidate probing into one directory enumeration per directory.** Instead of
   `stat`/`exists` on each candidate file, `readdir` the directory once into an in-memory set and
   answer all candidates from it — resolving N modules in a directory becomes ~1 syscall, not N.
   (The existing `DirEntryCache` seems to be bypassed on the hot path.)
2. **Cache negative lookups** (module-not-found per root) so the same missing candidates are not
   re-probed across many importers.
3. **Windows-specific:** prefer directory enumeration (`FindFirstFile`/`FindNextFile`) over
   per-file `CreateFile`/stat to reduce the number of minifilter/Defender traversals.
4. **Environmental (for users, not a code fix):** Defender real-time scanning and the OneDrive
   cloud-files filter measurably inflate per-syscall cost; analyzing a project outside a
   cloud-synced folder and/or excluding it from real-time scanning reduces wall time noticeably.

## Notes

- Symbol names are from a symbolicated `release` build and may differ slightly from source.
- Happy to share the profile artifacts or re-run with specific flags if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High CPU on large projects dominated by filesystem syscalls during import resolution (Windows) #3993

Summary

Environment

Methodology

Phase timing (`--report-timings`, thread-time)

Where the CPU actually goes (sampling self-time, CPU-weighted, ~94 CPU-seconds total)

Root cause

IDE / TSP amplification

Suggested mitigations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

High CPU on large projects dominated by filesystem syscalls during import resolution (Windows) #3993

Description

Summary

Environment

Methodology

Phase timing (--report-timings, thread-time)

Where the CPU actually goes (sampling self-time, CPU-weighted, ~94 CPU-seconds total)

Root cause

IDE / TSP amplification

Suggested mitigations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase timing (`--report-timings`, thread-time)