Skip to content

fix(chunker): stop overlap-stride crawl that shatters long-prose files#41

Open
jdubdevs wants to merge 1 commit into
devwhodevs:mainfrom
jdubdevs:fix/chunker-overlap-stride
Open

fix(chunker): stop overlap-stride crawl that shatters long-prose files#41
jdubdevs wants to merge 1 commit into
devwhodevs:mainfrom
jdubdevs:fix/chunker-overlap-stride

Conversation

@jdubdevs
Copy link
Copy Markdown

@jdubdevs jdubdevs commented Jun 2, 2026

Problem

smart_chunk applies the overlap window even when the chunk it just emitted is smaller than the overlap. cut_offset - overlap_chars then lands before the chunk's own start, and the .max(start_offset + 1) guard advances the start by a single character — which re-selects the same nearby high-score break point and crawls forward one char at a time.

On long, heading-dense prose this shatters a file into hundreds of near-duplicate empty-heading micro-chunks. Observed on a real 4.5k-word note: 923 chunks, 907 of them 1-character-offset shrapnel (e.g. "…ystem currently authorizing…", "…stem currently authorizing…", "…tem currently authorizing…").

Impact:

  1. Retrieval breaks — the file's signal is split below threshold across ~900 tiny fragments, so it never enters the candidate set (a unique-phrase search couldn't surface a document that is a near-perfect match for it).
  2. Index bloat — in one vault, 451 files (11%) held 91% of all chunks (260k of 286k). Re-chunking with this fix dropped the index 286k → 31k and cut full-rebuild time ~3.5×.

Fix

Only step back by the overlap window when the emitted chunk is larger than that window; otherwise advance fully to the cut. Guarantees forward progress, eliminates the crawl. ~6 lines in smart_chunk.

Validation

  • The reproducing note re-chunks 923 → 27; all chunks vectorize; a unique-phrase search returns it at rank 1.
  • Adds test_smart_chunk_no_overlap_crawl (bounded chunk count + no degenerate micro-chunks). All 20 chunker tests pass.

🤖 Generated with Claude Code

smart_chunk applied the overlap window even when the emitted chunk was
smaller than the overlap. cut_offset - overlap_chars then landed before the
chunk's own start, and the .max(start_offset + 1) guard advanced the start by
a single character — re-selecting the same nearby high-score break point and
crawling forward one char at a time. Long, heading-dense prose files were
shattered into hundreds of near-duplicate empty-heading micro-chunks (a
4.5k-word note produced 923 chunks; 907 of them 1-char-offset shrapnel),
which (a) made the file unretrievable — its signal split below threshold so
it never entered the candidate set — and (b) bloated the index ~10x (451
files held 91% of all chunks).

Fix: only step back by the overlap window when the chunk is larger than it;
otherwise advance fully to the cut. Guarantees forward progress, no crawl.

Validated: the note re-chunks 923 -> 28; all chunks vectorize; a unique-phrase
search returns it at rank 1. Adds test_smart_chunk_no_overlap_crawl regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant