Skip to content

Parallelize AST extraction and rewrite Louvain clustering#3

Merged
elbruno merged 1 commit into
elbruno:mainfrom
kenn850:main
May 10, 2026
Merged

Parallelize AST extraction and rewrite Louvain clustering#3
elbruno merged 1 commit into
elbruno:mainfrom
kenn850:main

Conversation

@kenn850
Copy link
Copy Markdown

@kenn850 kenn850 commented May 6, 2026

Two independent hot paths were dominating wall-clock time on large repositories. Both showed the same symptom — long runtime with low CPU utilization (extraction) or sustained single-core saturation (clustering) — and both scaled poorly with node count. ---

  1. Parallelize stage 2 (AST extraction) in PipelineRunner The pipeline previously ran extraction with a sequential foreach:
    foreach (var file in detectedFiles)
    await extractor.ExecuteAsync(file, cancellationToken);
    Each call performs file I/O, so the loop spent most of its time awaiting the disk while a single core sat idle. Wall-clock time grew linearly with file count.
    Replaced with Parallel.ForEachAsync bounded by Environment.ProcessorCount. Extractor only holds a read-only dictionary of language extractors and returns a fresh ExtractionResult per call, so concurrent invocation is safe.
    Concurrency details:
  • Results are collected into a ConcurrentBag and flattened into the existing list after the loop.
  • processed / skipped counters use Interlocked.Increment.
  • Verbose per-file failure messages are buffered into a ConcurrentQueue and flushed sequentially after the loop; TextWriter is not thread-safe and interleaved writes from parallel workers would corrupt output.
  • Cancellation flows through ParallelOptions.CancellationToken. Stage 2b (LLM-backed semantic extraction) is intentionally left sequential — LLM providers throttle aggressive concurrency and that stage needs a separate, lower parallelism cap if it's parallelized later.
    Expected impact: near-linear speedup on AST extraction up to disk throughput, typically 4–10x on large codebases.

  1. Rewrite Louvain community detection in ClusterEngine Stage 4 ("Detecting communities") was the dominant cost on graphs with thousands of nodes. The root cause was inside CalculateModularityGain (and its subgraph twin): every gain evaluation walked all N nodes and re-summed each one's edge weights:
    foreach (var otherNode in graph.GetNodes())
    otherDegree = graph.GetEdges(otherNode.Id).Sum(e => e.Weight);
    That made a single gain evaluation O(N · Ē). The outer Louvain loop calls it once per (node, neighboring-community) per iteration, so the overall cost was effectively O(iterations · N² · Ē). On a 10k-node / 50k-edge graph this is billions of operations — exactly why CPU was pinned and wall-clock high.
    Rewrote both the main Louvain pass and SplitCommunity to use standard incremental aggregates:
  • Build a per-node weighted adjacency dictionary once at the start of DetectCommunities (collapses parallel edges into a single neighbor → summed-weight entry).
  • Compute each node's weighted degree once.
  • Maintain communityTotalDegree[c] incrementally: when a node moves from a → b, subtract its degree from a and add it to b.
  • For each move evaluation, walk only the moving node's own neighbors to bucket weight per neighboring community in one pass — O(degree) instead of O(N · Ē).
  • Reuse the edgesToCommunity dictionary across iterations (.Clear() instead of allocating). The same shape is applied inside SplitCommunity, with a sub-adjacency restricted to the induced subgraph for the oversized community. The now-redundant CalculateModularityGain and
    CalculateSubgraphModularityGain helpers are removed; the gain math is inlined where it is used. The public CalculateModularity and CalculateCohesion static methods are unchanged. The modularity gain formula is preserved bit-for-bit so community assignments stay equivalent to the previous implementation.
    Total cost drops from O(iterations · N² · Ē) to O(iterations · E), roughly an N× speedup on large graphs.

Validation

  • All 573 unit tests in Graphify.Tests pass (10 ClusterEngine tests, plus full suite).
  • No public API changes.
  • No behavioral changes to graph output (community ids may differ only by the ordering tie-break already present in the original implementation).

Two independent hot paths were dominating wall-clock time on large
repositories. Both showed the same symptom — long runtime with low CPU
utilization (extraction) or sustained single-core saturation
(clustering) — and both scaled poorly with node count.
---
1. Parallelize stage 2 (AST extraction) in PipelineRunner
The pipeline previously ran extraction with a sequential foreach:
    foreach (var file in detectedFiles)
        await extractor.ExecuteAsync(file, cancellationToken);
Each call performs file I/O, so the loop spent most of its time
awaiting the disk while a single core sat idle. Wall-clock time grew
linearly with file count.
Replaced with Parallel.ForEachAsync bounded by Environment.ProcessorCount.
Extractor only holds a read-only dictionary of language extractors and
returns a fresh ExtractionResult per call, so concurrent invocation is
safe.
Concurrency details:
- Results are collected into a ConcurrentBag<ExtractionResult> and
  flattened into the existing list after the loop.
- processed / skipped counters use Interlocked.Increment.
- Verbose per-file failure messages are buffered into a
  ConcurrentQueue<string> and flushed sequentially after the loop;
  TextWriter is not thread-safe and interleaved writes from parallel
  workers would corrupt output.
- Cancellation flows through ParallelOptions.CancellationToken.
Stage 2b (LLM-backed semantic extraction) is intentionally left
sequential — LLM providers throttle aggressive concurrency and that
stage needs a separate, lower parallelism cap if it's parallelized
later.
Expected impact: near-linear speedup on AST extraction up to disk
throughput, typically 4–10x on large codebases.
---
2. Rewrite Louvain community detection in ClusterEngine
Stage 4 ("Detecting communities") was the dominant cost on graphs
with thousands of nodes. The root cause was inside
CalculateModularityGain (and its subgraph twin): every gain evaluation
walked all N nodes and re-summed each one's edge weights:
    foreach (var otherNode in graph.GetNodes())
        otherDegree = graph.GetEdges(otherNode.Id).Sum(e => e.Weight);
That made a single gain evaluation O(N · Ē). The outer Louvain loop
calls it once per (node, neighboring-community) per iteration, so the
overall cost was effectively O(iterations · N² · Ē). On a 10k-node /
50k-edge graph this is billions of operations — exactly why CPU was
pinned and wall-clock high.
Rewrote both the main Louvain pass and SplitCommunity to use standard
incremental aggregates:
- Build a per-node weighted adjacency dictionary once at the start of
  DetectCommunities (collapses parallel edges into a single
  neighbor → summed-weight entry).
- Compute each node's weighted degree once.
- Maintain communityTotalDegree[c] incrementally: when a node moves
  from a → b, subtract its degree from a and add it to b.
- For each move evaluation, walk only the moving node's own neighbors
  to bucket weight per neighboring community in one pass — O(degree)
  instead of O(N · Ē).
- Reuse the edgesToCommunity dictionary across iterations
  (.Clear() instead of allocating).
The same shape is applied inside SplitCommunity, with a sub-adjacency
restricted to the induced subgraph for the oversized community.
The now-redundant CalculateModularityGain and
CalculateSubgraphModularityGain helpers are removed; the gain math is
inlined where it is used. The public CalculateModularity and
CalculateCohesion static methods are unchanged. The modularity gain
formula is preserved bit-for-bit so community assignments stay
equivalent to the previous implementation.
Total cost drops from O(iterations · N² · Ē) to O(iterations · E),
roughly an N× speedup on large graphs.
---
Validation
- All 573 unit tests in Graphify.Tests pass (10 ClusterEngine tests,
  plus full suite).
- No public API changes.
- No behavioral changes to graph output (community ids may differ
  only by the ordering tie-break already present in the original
  implementation).
@elbruno
Copy link
Copy Markdown
Owner

elbruno commented May 10, 2026

Validation update: I checked out this PR and ran the full solution tests locally (614 passed, 0 failed). I’m adding a focused regression test for the new Louvain weighted-adjacency/parallel-edge path, then re-running tests before final merge recommendation.

@elbruno elbruno merged commit 1b905ba into elbruno:main May 10, 2026
@elbruno
Copy link
Copy Markdown
Owner

elbruno commented May 10, 2026

Final update: I validated PR #3 end-to-end, then merged it.\n\n✅ Validation: full solution tests passed locally\n- Before merge: 615/615 passing\n- After merge on main: 615/615 passing\n\n✅ Follow-up hardening: I added a regression test for parallel-edge weighted-adjacency clustering behavior in ClusterEngineTests and pushed it to main in commit 64cd03c.\n\nThis PR is now merged/closed and main is green from local verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants