Summary
On a large project (the PyTorch repo: ~2,430 modules / ~1.32M lines), a cold full-project
analysis spends the majority of its CPU in filesystem syscalls during import/module
resolution, not in type checking. On Windows this is heavily amplified because every probe
traverses the NTFS minifilter stack (Microsoft Defender + cloud-files / OneDrive filter).
I profiled a locally-built, symbolized release binary with an ETW sampling profiler (samply)
and symbolicated against the matching PDB. Sharing the breakdown plus a few mitigation ideas.
Environment
- OS: Windows 11 (NTFS; Microsoft Defender real-time protection on; OneDrive cloud-files filter active)
- Pyrefly: locally built
target/release with debug symbols
- Workload: PyTorch repository, ~2,430 modules, ~1,323,529 lines, ~17,760 type errors
- Repro:
pyrefly check (same engine the TSP/IDE server runs), CPU-weighted via per-sample CPU time
Methodology
pyrefly check --report-timings for phase/module timing
samply (ETW, elevated) for a full CPU sampling profile
- Symbolicated
pyrefly.exe frames against pyrefly.pdb; aggregated self/inclusive time using
per-sample CPU deltas (so idle/blocked samples are not counted)
Phase timing (--report-timings, thread-time)
- Solutions (type solving): ~28.4s (78%)
- Answers (bindings): ~6.8s (19%)
- Ast (parse): ~1.0s
- Exports: ~0.1s
No single pathological module — heaviest was torch.testing._internal.common_methods_invocations
at ~1.6s; cost is a long tail spread across all modules.
Where the CPU actually goes (sampling self-time, CPU-weighted, ~94 CPU-seconds total)
- ~57% Filesystem / kernel I/O —
ntoskrnl ~28%, Ntfs.sys ~12%, ntdll ~11%, FLTMGR + minifilters
- ~29% Pyrefly type-checking compute (solver / bindings / types) — each function <1% self; long tail
- ~8% Allocator churn (mimalloc:
Type clone/drop, alloc/free)
- ~2.7% Antivirus (Defender
mssecflt.sys / WdFilter.sys)
- remainder:
memcpy/memset, glob matching, path UTF-16 conversion
Root cause
- ~62% of all CPU is inclusive under import/module resolution, along the path
LoaderFindCache::find_import -> find_import_internal -> find_one_part_in_root ->
std::fs::metadata / File::open_native / DirEntryCache::file_exists.
- ~96% of all filesystem self-time is under import resolution.
- The dominant cost is per-candidate filesystem probing to locate modules (stat/open each
candidate path across each search root), not reading file contents (parse is <1s) and not
pure type math.
- Amplifiers:
- Large search path (project + typeshed + site-packages + many namespace packages) means each
of ~2,430 modules is probed across many candidate roots.
- On Windows each
metadata/open traverses NTFS + Defender (mssecflt/WdFilter) + the
cloud-files filter (cldflt, OneDrive), so kernel/minifilter time dominates per syscall.
- Note: a
DirEntryCache exists, yet std::fs::metadata is still ~52% inclusive — suggesting many
probes stat per-candidate before (or instead of) consulting a cached directory listing.
IDE / TSP amplification
The same workspace opened in VS Code (Pylance spawning pyrefly tsp) accumulated ~1,196
CPU-seconds and ~2.35 GB before going idle — roughly 12x the one-shot CLI (94 CPU-s). This
appears to be the IDE server pulling far more modules into scope (workspace indexing for
completions/auto-import), i.e. the same bottleneck amplified by scale, not a different one.
Suggested mitigations
- Collapse per-candidate probing into one directory enumeration per directory. Instead of
stat/exists on each candidate file, readdir the directory once into an in-memory set and
answer all candidates from it — resolving N modules in a directory becomes ~1 syscall, not N.
(The existing DirEntryCache seems to be bypassed on the hot path.)
- Cache negative lookups (module-not-found per root) so the same missing candidates are not
re-probed across many importers.
- Windows-specific: prefer directory enumeration (
FindFirstFile/FindNextFile) over
per-file CreateFile/stat to reduce the number of minifilter/Defender traversals.
- Environmental (for users, not a code fix): Defender real-time scanning and the OneDrive
cloud-files filter measurably inflate per-syscall cost; analyzing a project outside a
cloud-synced folder and/or excluding it from real-time scanning reduces wall time noticeably.
Notes
- Symbol names are from a symbolicated
release build and may differ slightly from source.
- Happy to share the profile artifacts or re-run with specific flags if useful.
Summary
On a large project (the PyTorch repo: ~2,430 modules / ~1.32M lines), a cold full-project
analysis spends the majority of its CPU in filesystem syscalls during import/module
resolution, not in type checking. On Windows this is heavily amplified because every probe
traverses the NTFS minifilter stack (Microsoft Defender + cloud-files / OneDrive filter).
I profiled a locally-built, symbolized
releasebinary with an ETW sampling profiler (samply)and symbolicated against the matching PDB. Sharing the breakdown plus a few mitigation ideas.
Environment
target/releasewith debug symbolspyrefly check(same engine the TSP/IDE server runs), CPU-weighted via per-sample CPU timeMethodology
pyrefly check --report-timingsfor phase/module timingsamply(ETW, elevated) for a full CPU sampling profilepyrefly.exeframes againstpyrefly.pdb; aggregated self/inclusive time usingper-sample CPU deltas (so idle/blocked samples are not counted)
Phase timing (
--report-timings, thread-time)No single pathological module — heaviest was
torch.testing._internal.common_methods_invocationsat ~1.6s; cost is a long tail spread across all modules.
Where the CPU actually goes (sampling self-time, CPU-weighted, ~94 CPU-seconds total)
ntoskrnl~28%,Ntfs.sys~12%,ntdll~11%,FLTMGR+ minifiltersTypeclone/drop, alloc/free)mssecflt.sys/WdFilter.sys)memcpy/memset, glob matching, path UTF-16 conversionRoot cause
LoaderFindCache::find_import->find_import_internal->find_one_part_in_root->std::fs::metadata/File::open_native/DirEntryCache::file_exists.candidate path across each search root), not reading file contents (parse is <1s) and not
pure type math.
of ~2,430 modules is probed across many candidate roots.
metadata/opentraverses NTFS + Defender (mssecflt/WdFilter) + thecloud-files filter (
cldflt, OneDrive), so kernel/minifilter time dominates per syscall.DirEntryCacheexists, yetstd::fs::metadatais still ~52% inclusive — suggesting manyprobes stat per-candidate before (or instead of) consulting a cached directory listing.
IDE / TSP amplification
The same workspace opened in VS Code (Pylance spawning
pyrefly tsp) accumulated ~1,196CPU-seconds and ~2.35 GB before going idle — roughly 12x the one-shot CLI (94 CPU-s). This
appears to be the IDE server pulling far more modules into scope (workspace indexing for
completions/auto-import), i.e. the same bottleneck amplified by scale, not a different one.
Suggested mitigations
stat/existson each candidate file,readdirthe directory once into an in-memory set andanswer all candidates from it — resolving N modules in a directory becomes ~1 syscall, not N.
(The existing
DirEntryCacheseems to be bypassed on the hot path.)re-probed across many importers.
FindFirstFile/FindNextFile) overper-file
CreateFile/stat to reduce the number of minifilter/Defender traversals.cloud-files filter measurably inflate per-syscall cost; analyzing a project outside a
cloud-synced folder and/or excluding it from real-time scanning reduces wall time noticeably.
Notes
releasebuild and may differ slightly from source.