fix(chunker): stop overlap-stride crawl that shatters long-prose files by jdubdevs · Pull Request #41 · devwhodevs/engraph

jdubdevs · 2026-06-02T00:24:12Z

Problem

smart_chunk applies the overlap window even when the chunk it just emitted is smaller than the overlap. cut_offset - overlap_chars then lands before the chunk's own start, and the .max(start_offset + 1) guard advances the start by a single character — which re-selects the same nearby high-score break point and crawls forward one char at a time.

On long, heading-dense prose this shatters a file into hundreds of near-duplicate empty-heading micro-chunks. Observed on a real 4.5k-word note: 923 chunks, 907 of them 1-character-offset shrapnel (e.g. "…ystem currently authorizing…", "…stem currently authorizing…", "…tem currently authorizing…").

Impact:

Retrieval breaks — the file's signal is split below threshold across ~900 tiny fragments, so it never enters the candidate set (a unique-phrase search couldn't surface a document that is a near-perfect match for it).
Index bloat — in one vault, 451 files (11%) held 91% of all chunks (260k of 286k). Re-chunking with this fix dropped the index 286k → 31k and cut full-rebuild time ~3.5×.

Fix

Only step back by the overlap window when the emitted chunk is larger than that window; otherwise advance fully to the cut. Guarantees forward progress, eliminates the crawl. ~6 lines in smart_chunk.

Validation

The reproducing note re-chunks 923 → 27; all chunks vectorize; a unique-phrase search returns it at rank 1.
Adds test_smart_chunk_no_overlap_crawl (bounded chunk count + no degenerate micro-chunks). All 20 chunker tests pass.

🤖 Generated with Claude Code

smart_chunk applied the overlap window even when the emitted chunk was smaller than the overlap. cut_offset - overlap_chars then landed before the chunk's own start, and the .max(start_offset + 1) guard advanced the start by a single character — re-selecting the same nearby high-score break point and crawling forward one char at a time. Long, heading-dense prose files were shattered into hundreds of near-duplicate empty-heading micro-chunks (a 4.5k-word note produced 923 chunks; 907 of them 1-char-offset shrapnel), which (a) made the file unretrievable — its signal split below threshold so it never entered the candidate set — and (b) bloated the index ~10x (451 files held 91% of all chunks). Fix: only step back by the overlap window when the chunk is larger than it; otherwise advance fully to the cut. Guarantees forward progress, no crawl. Validated: the note re-chunks 923 -> 28; all chunks vectorize; a unique-phrase search returns it at rank 1. Adds test_smart_chunk_no_overlap_crawl regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chunker): stop overlap-stride crawl that shatters long-prose files#41

fix(chunker): stop overlap-stride crawl that shatters long-prose files#41
jdubdevs wants to merge 1 commit into
devwhodevs:mainfrom
jdubdevs:fix/chunker-overlap-stride

jdubdevs commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdubdevs commented Jun 2, 2026

Problem

Fix

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant