feat(ai): add 50-token overlap between embedding splits by wezell · Pull Request #35367 · dotCMS/core

wezell · 2026-04-17T16:36:46Z

Summary

Modifies EmbeddingsRunner to carry a ~50-token overlap from the end of each chunk into the start of the next
Replaces the single StringBuilder buffer with parallel List<String> sentences / List<Integer> tokenCounts lists to enable backward traversal for overlap computation
Adds ArrayList import

Motivation

Without overlap, context at chunk boundaries is lost — a sentence split across two chunks has neither half with full surrounding context. A 50-token trailing overlap ensures semantic continuity between consecutive embedding chunks.

Test plan

Verify embeddings are generated without errors for a contentlet with long text (>splitAtTokens)
Confirm adjacent chunks share ~50 tokens of trailing text from the previous chunk
Confirm no regression for short content (fewer tokens than splitAtTokens) — still saved as a single chunk

Closes #35366

🤖 Generated with Claude Code

Now overlaps 15-20% ⏺ The change is in EmbeddingsRunner.java:62-101. Here's what changed: - Before: on each split, buffer was cleared entirely — new chunk started with only the sentence that triggered the overflow. - After: on each split, the code walks backward through the sentence list to collect trailing sentences totaling ≥50 tokens, carries those into the next chunk as overlap, then resets totalTokens to the overlap size. ref: #35366

claude · 2026-04-17T16:41:04Z

Claude finished @wezell's task in 2m 14s —— View job

PR Review: feat(ai): add 50-token overlap between embedding splits

Read changed file
Analyze implementation
Post review

Issues

1. Hard-coded magic number — overlapTokens = 50 (line 64)

splitAtTokens is configurable via AppKeys.EMBEDDINGS_SPLIT_AT_TOKENS (default 512). The overlap size has no equivalent key and cannot be tuned. At 512 tokens, 50-token overlap is ~10% — reasonable. At 100 tokens, it's 50% — wasteful and semantically weird. These two values are coupled and should be configured together.

Minimal fix: add an AppKeys entry (e.g., EMBEDDINGS_OVERLAP_TOKENS, default "50") and read it the same way splitAtTokens is read. Fix this →

2. Final chunk may be purely duplicate content

After a split, sentences is reset to the overlap. If the BreakIterator loop ends immediately (no further sentences), the if (!sentences.isEmpty()) check at line 106 passes, and we save a chunk consisting entirely of sentences already saved in the previous chunk. The embedding is redundant and increases storage/query noise.

// After computing overlap and resetting:
sentences.addAll(overlap);   // only overlap — no new content yet
...
// Loop ends, saves overlap-only chunk  ← duplicate of end of previous chunk

Fix: track whether any new sentence was added after the last split before saving the final chunk. A simple boolean flag boolean hasNewContentAfterSplit = false set to true each time a sentence is added post-reset, then guard the final save with if (!sentences.isEmpty() && hasNewContentAfterSplit).

3. Overshoot is unbounded, not just "~50"

while (overlapStart > 0 && overlapCount < overlapTokens) {
    overlapStart--;
    overlapCount += tokenCounts.get(overlapStart);  // adds full sentence
}

The loop exits only after adding an entire sentence. If the first sentence traversed backward has 800 tokens, overlapCount becomes 800. The comment says "~50 tokens" but in practice it's "the minimum number of trailing sentences whose combined tokens ≥ 50." For content with long sentences this can be a significantly larger overlap than intended. This is low severity but worth knowing — and a reason to cap or log the actual overlap size.

4. Operational: existing embeddings are now stale

This changes the chunking algorithm. Any contentlet indexed before this PR produces non-overlapping chunks; after this PR they produce overlapping ones. Mixed embedding stores will give inconsistent retrieval quality until content is fully reindexed. This should be called out in the release notes or migration guide.

Minor

The old final-save guard buffer.toString().split("\\s+").length > 0 was always true even for empty buffers (Java's "".split(...) returns [""]). The new !sentences.isEmpty() is correct — good fix.
Removing the StringBuilder in favour of parallel List<String>/List<Integer> is cleaner and necessary for backward traversal. No issues there.

Overall the logic is sound and the overlap algorithm is correct for the happy path. The two actionable issues are the missing AppKeys entry (#1) and the duplicate-tail-chunk edge case (#2).

github-project-automation Bot added this to dotCMS - Product Planning Apr 17, 2026

github-actions Bot mentioned this pull request Apr 17, 2026

Improve our embedding splitting #35366

Open

wezell requested review from ihoffmann-dot and nollymar April 17, 2026 16:38

claude Bot added the AI: Safe To Rollback label Apr 17, 2026

ihoffmann-dot approved these changes Apr 17, 2026

View reviewed changes

github-actions Bot added the Area : Backend PR changes Java/Maven backend code label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): add 50-token overlap between embedding splits#35367

feat(ai): add 50-token overlap between embedding splits#35367
wezell wants to merge 1 commit intomainfrom
issue-35366-overlap-embedding-tokens

wezell commented Apr 17, 2026

Uh oh!

claude Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wezell commented Apr 17, 2026

Summary

Motivation

Test plan

Uh oh!

claude Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(ai): add 50-token overlap between embedding splits

Issues

Minor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 17, 2026 •

edited

Loading