feat(ai): add 50-token overlap between embedding splits#35367
feat(ai): add 50-token overlap between embedding splits#35367
Conversation
Now overlaps 15-20% ⏺ The change is in EmbeddingsRunner.java:62-101. Here's what changed: - Before: on each split, buffer was cleared entirely — new chunk started with only the sentence that triggered the overflow. - After: on each split, the code walks backward through the sentence list to collect trailing sentences totaling ≥50 tokens, carries those into the next chunk as overlap, then resets totalTokens to the overlap size. ref: #35366
|
Claude finished @wezell's task in 2m 14s —— View job PR Review: feat(ai): add 50-token overlap between embedding splits
Issues1. Hard-coded magic number —
Minimal fix: add an 2. Final chunk may be purely duplicate content After a split, // After computing overlap and resetting:
sentences.addAll(overlap); // only overlap — no new content yet
...
// Loop ends, saves overlap-only chunk ← duplicate of end of previous chunkFix: track whether any new sentence was added after the last split before saving the final chunk. A simple boolean flag 3. Overshoot is unbounded, not just "~50" while (overlapStart > 0 && overlapCount < overlapTokens) {
overlapStart--;
overlapCount += tokenCounts.get(overlapStart); // adds full sentence
}The loop exits only after adding an entire sentence. If the first sentence traversed backward has 800 tokens, 4. Operational: existing embeddings are now stale This changes the chunking algorithm. Any contentlet indexed before this PR produces non-overlapping chunks; after this PR they produce overlapping ones. Mixed embedding stores will give inconsistent retrieval quality until content is fully reindexed. This should be called out in the release notes or migration guide. Minor
Overall the logic is sound and the overlap algorithm is correct for the happy path. The two actionable issues are the missing |
Summary
EmbeddingsRunnerto carry a ~50-token overlap from the end of each chunk into the start of the nextStringBuilderbuffer with parallelList<String> sentences/List<Integer> tokenCountslists to enable backward traversal for overlap computationArrayListimportMotivation
Without overlap, context at chunk boundaries is lost — a sentence split across two chunks has neither half with full surrounding context. A 50-token trailing overlap ensures semantic continuity between consecutive embedding chunks.
Test plan
Closes #35366
🤖 Generated with Claude Code