Skip to content

perf: cache file reads and AST parses during discovery pass#2135

Open
KRRT7 wants to merge 9 commits intomainfrom
perf/discovery-ast-cache
Open

perf: cache file reads and AST parses during discovery pass#2135
KRRT7 wants to merge 9 commits intomainfrom
perf/discovery-ast-cache

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented May 7, 2026

Parent PR

#2132 — perf: targeted performance improvements for E2E pipeline hot path

Summary

  • Adds discovery-scoped discovery_cache() context manager that caches file content and parsed ASTs for the duration of a single discovery run
  • read_file_cached() ensures each file is read from disk at most once
  • parse_ast_cached() ensures each file's AST is parsed at most once
  • Removes dead code: _find_all_functions_via_language_support() (never called, had pre-existing type error)

Changes

  • get_functions_to_optimize(): wraps body with discovery_cache()
  • find_all_functions_in_file(): uses read_file_cached
  • inspect_top_level_functions_or_methods(): uses parse_ast_cached
  • get_all_replay_test_functions(): uses parse_ast_cached
  • JS/TS helpers: use read_file_cached
  • New test file: tests/test_discovery_cache.py (12 tests)

Stack

  1. perf: prefilter files by path before read_text() in discovery #2134 (prefilter) → main
  2. This PRperf/discovery-prefilter
  3. perf: read JS/TS files once during discovery export checks #2136 (js-single-read) → this branch

KRRT7 added 5 commits May 7, 2026 17:46
Moves path-based checks (test file detection, ignore paths, submodule
paths, site-packages, outside module-root) to run BEFORE read_text() is
called in get_all_files_and_functions() and get_functions_within_lines().
This avoids unnecessary file I/O for files that would be discarded by
filter_functions() anyway.

Also fixes pre-existing mypy error in _find_all_functions_via_language_support
where discover_functions was called with wrong argument order.

Signature changes (backward-compatible, all new params are optional):
- get_all_files_and_functions: added tests_root, module_root params
- get_functions_within_lines: added tests_root, ignore_paths, module_root
- get_functions_within_git_diff: added tests_root, ignore_paths, module_root
Introduces a discovery-scoped cache (`discovery_cache()` context
manager) that ensures each file is read from disk at most once and
parsed into an AST at most once within a single discovery run.

Key changes:
- `read_file_cached()`: returns cached file content when
  `discovery_cache()` is active, falls back to normal read otherwise
- `parse_ast_cached()`: returns cached `ast.Module` when active
- `get_functions_to_optimize()`: wraps its body with `discovery_cache()`
- `find_all_functions_in_file()`: uses `read_file_cached`
- `inspect_top_level_functions_or_methods()`: uses `parse_ast_cached`
- `get_all_replay_test_functions()`: uses `parse_ast_cached`
- JS/TS export helpers: use `read_file_cached`
- Removed dead code: `_find_all_functions_via_language_support()` (never
  called, had a type error passing Path as source str)

Signature changes: None. All public function signatures are unchanged.
The cache is transparent -- when not inside `discovery_cache()`, all
functions behave identically to before (direct reads/parses).
The _is_js_ts_function_exported and _is_js_ts_function_exists_but_not_exported
helpers each read the file from disk independently, causing up to 2 extra reads
per file during get_functions_to_optimize. Add an optional `source` parameter
to both functions so callers can pass pre-read content. The main call site now
reads the file once and passes it to both helpers.

Also fixes pre-existing mypy error in _find_all_functions_via_language_support
where discover_functions was called with wrong argument order.

Signatures changed:
- _is_js_ts_function_exported(file_path, function_name, source=None)
- _is_js_ts_function_exists_but_not_exported(file_path, function_name, source=None)
The --file path was calling file.read_text() directly even though
find_all_functions_in_file() already primed the discovery cache.
Now uses read_file_cached() to hit the cache with zero disk I/O.
The callers BLOB column was fetched and JSON-decoded for every row
during FunctionRanker initialization, but the resulting data was
never accessed — FunctionRanker.load_function_stats() discards it
as `_callers`. This eliminates O(n) JSON parsing at ranking startup.

Also fixes all pre-existing mypy errors in this file by declaring
pstats.Stats attributes that the type stubs don't expose.
@KRRT7 KRRT7 force-pushed the perf/discovery-prefilter branch from 6bbf072 to 6527780 Compare May 7, 2026 22:47
@KRRT7 KRRT7 force-pushed the perf/discovery-ast-cache branch from beac23d to 93763a3 Compare May 7, 2026 22:47
@KRRT7 KRRT7 marked this pull request as draft May 7, 2026 22:53
@KRRT7 KRRT7 marked this pull request as ready for review May 8, 2026 00:19
@KRRT7 KRRT7 changed the base branch from perf/discovery-prefilter to main May 8, 2026 00:19

class TestFiles(BaseModel):
test_files: list[TestFile]
_seen_paths: set[Path] = PrivateAttr(default_factory=set)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't this merged already?

aseembits93
aseembits93 previously approved these changes May 8, 2026
@KRRT7 KRRT7 dismissed aseembits93’s stale review May 8, 2026 00:28

The merge-base changed after approval.

@KRRT7 KRRT7 enabled auto-merge May 8, 2026 00:35
…decode

perf: skip unused callers JSON decode in ProfileStats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants