Skip to content

feat(ai): add 50-token overlap between embedding splits#35367

Open
wezell wants to merge 1 commit intomainfrom
issue-35366-overlap-embedding-tokens
Open

feat(ai): add 50-token overlap between embedding splits#35367
wezell wants to merge 1 commit intomainfrom
issue-35366-overlap-embedding-tokens

Conversation

@wezell
Copy link
Copy Markdown
Member

@wezell wezell commented Apr 17, 2026

Summary

  • Modifies EmbeddingsRunner to carry a ~50-token overlap from the end of each chunk into the start of the next
  • Replaces the single StringBuilder buffer with parallel List<String> sentences / List<Integer> tokenCounts lists to enable backward traversal for overlap computation
  • Adds ArrayList import

Motivation

Without overlap, context at chunk boundaries is lost — a sentence split across two chunks has neither half with full surrounding context. A 50-token trailing overlap ensures semantic continuity between consecutive embedding chunks.

Test plan

  • Verify embeddings are generated without errors for a contentlet with long text (>splitAtTokens)
  • Confirm adjacent chunks share ~50 tokens of trailing text from the previous chunk
  • Confirm no regression for short content (fewer tokens than splitAtTokens) — still saved as a single chunk

Closes #35366

🤖 Generated with Claude Code

Now overlaps 15-20%
⏺ The change is in EmbeddingsRunner.java:62-101. Here's what changed:

  - Before: on each split, buffer was cleared entirely — new chunk
started with only the sentence that triggered the overflow.
  - After: on each split, the code walks backward through the sentence
list to collect trailing sentences totaling ≥50 tokens, carries those
into the next chunk as overlap, then resets totalTokens to
   the overlap size.

ref: #35366
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 17, 2026

Claude finished @wezell's task in 2m 14s —— View job


PR Review: feat(ai): add 50-token overlap between embedding splits

  • Read changed file
  • Analyze implementation
  • Post review

Issues

1. Hard-coded magic number — overlapTokens = 50 (line 64)

splitAtTokens is configurable via AppKeys.EMBEDDINGS_SPLIT_AT_TOKENS (default 512). The overlap size has no equivalent key and cannot be tuned. At 512 tokens, 50-token overlap is ~10% — reasonable. At 100 tokens, it's 50% — wasteful and semantically weird. These two values are coupled and should be configured together.

Minimal fix: add an AppKeys entry (e.g., EMBEDDINGS_OVERLAP_TOKENS, default "50") and read it the same way splitAtTokens is read. Fix this →

2. Final chunk may be purely duplicate content

After a split, sentences is reset to the overlap. If the BreakIterator loop ends immediately (no further sentences), the if (!sentences.isEmpty()) check at line 106 passes, and we save a chunk consisting entirely of sentences already saved in the previous chunk. The embedding is redundant and increases storage/query noise.

// After computing overlap and resetting:
sentences.addAll(overlap);   // only overlap — no new content yet
...
// Loop ends, saves overlap-only chunk  ← duplicate of end of previous chunk

Fix: track whether any new sentence was added after the last split before saving the final chunk. A simple boolean flag boolean hasNewContentAfterSplit = false set to true each time a sentence is added post-reset, then guard the final save with if (!sentences.isEmpty() && hasNewContentAfterSplit).

3. Overshoot is unbounded, not just "~50"

while (overlapStart > 0 && overlapCount < overlapTokens) {
    overlapStart--;
    overlapCount += tokenCounts.get(overlapStart);  // adds full sentence
}

The loop exits only after adding an entire sentence. If the first sentence traversed backward has 800 tokens, overlapCount becomes 800. The comment says "~50 tokens" but in practice it's "the minimum number of trailing sentences whose combined tokens ≥ 50." For content with long sentences this can be a significantly larger overlap than intended. This is low severity but worth knowing — and a reason to cap or log the actual overlap size.

4. Operational: existing embeddings are now stale

This changes the chunking algorithm. Any contentlet indexed before this PR produces non-overlapping chunks; after this PR they produce overlapping ones. Mixed embedding stores will give inconsistent retrieval quality until content is fully reindexed. This should be called out in the release notes or migration guide.


Minor

  • The old final-save guard buffer.toString().split("\\s+").length > 0 was always true even for empty buffers (Java's "".split(...) returns [""]). The new !sentences.isEmpty() is correct — good fix.
  • Removing the StringBuilder in favour of parallel List<String>/List<Integer> is cleaner and necessary for backward traversal. No issues there.

Overall the logic is sound and the overlap algorithm is correct for the happy path. The two actionable issues are the missing AppKeys entry (#1) and the duplicate-tail-chunk edge case (#2).

@github-actions github-actions Bot added the Area : Backend PR changes Java/Maven backend code label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Improve our embedding splitting

2 participants