Skip to content

refactor(ingestion): prune vendored dirs in source walk (no rglob descent)#55

Open
cdeust wants to merge 1 commit into
mainfrom
refactor/ingestion-pruned-walk
Open

refactor(ingestion): prune vendored dirs in source walk (no rglob descent)#55
cdeust wants to merge 1 commit into
mainfrom
refactor/ingestion-pruned-walk

Conversation

@cdeust

@cdeust cdeust commented Jun 9, 2026

Copy link
Copy Markdown
Owner

What

codebase_analyze's collect_source_files walked the whole tree via root.rglob("*") and rejected IGNORE_DIRS entries only after enumeration. rglob can't prune mid-iteration, so a repo carrying a vendored subtree (a 154M deps/ of ~8K files, node_modules, site-packages) stalled the walk for minutes — on the asyncio event loop, blocking every concurrent tool call.

This is the ingestion-side counterpart of the wiki-drift hang fixed in 619bf9a (PR #53): same rglob-vs-pruned-walk asymmetry, different code path. It is a latent prod bug — any user repo with a venv / node_modules / deps at the scan root triggers it.

How (root-cause, one source of truth)

  • Extract the canonical pruned-walk idiom — os.walk(followlinks=False) + in-place dirnames[:] filter on IGNORE_DIRS — into handlers/source_walk.py::walk_pruned.
  • Route both _collect_unbounded and _collect_bounded through it; ignored subtrees are now never descended into. _file_matches keeps its IGNORE_DIRS/lang/size checks as defense-in-depth.
  • Replace seed_project_stages' private _walk_pruned with the shared function (removes the duplicate; drops the now-unused os import). IGNORE_DIRS already contained deps/site-packages/node_modules/etc., so no constant change was needed.

Behavior is preserved — same files returned, the bounded-candidate memory property (ADR-0045 §R2) intact — only the descent is pruned.

Verification

  • New tests_py/handlers/test_source_walk.py: proves ignored subtrees (deps / node_modules / site-packages / .venv, nested + symlinked) are pruned and not followed.
  • 17 passed (new suite + existing test_codebase_analyze_rglob.py collector suite), ruff format + check clean.

Scope note

This is item 3 of 3 of the Phase-5 throttling follow-on (docs/program/phase-5-pool-admission-design.md). The remaining two — asyncio.to_thread offload at the handler boundary and per-tool admission semaphores — are gated on the BLOCKING benchmark suite (§7) and are deferred to a follow-up PR where the DB/benchmark harness is available. This PR is the standalone, fully-verifiable latent-bug fix.

🤖 Generated with Claude Code

…cent)

codebase_analyze's collect_source_files walked the whole tree via
root.rglob("*") and rejected IGNORE_DIRS entries only AFTER enumeration.
rglob can't prune mid-iteration, so a repo carrying a vendored subtree (a
154M deps/ of ~8K files, node_modules, site-packages) stalled the walk for
minutes on the event loop — the same asymmetry that caused the wiki_drift
hang (fixed in 619bf9a), here on the ingestion side.

- Extract the canonical pruned-walk idiom (os.walk(followlinks=False) +
  in-place dirnames[:] filter on IGNORE_DIRS) into one module,
  handlers/source_walk.py::walk_pruned — single source of truth.
- Route both _collect_unbounded and _collect_bounded through walk_pruned;
  ignored subtrees are now never descended into. _file_matches keeps its
  IGNORE_DIRS/lang/size checks as defense-in-depth.
- Replace seed_project_stages' private _walk_pruned with the shared one
  (removes the duplicate; drops the now-unused os import).

Preserves behavior: same files returned, bounded-candidate memory property
(ADR-0045 §R2) intact. New tests_py/handlers/test_source_walk.py proves
ignored subtrees (deps/node_modules/site-packages/.venv, nested + symlinked)
are pruned. 17 passed (incl. existing collector suite).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant