feat(salience): dedup near-duplicate memories by source_key / title#152
Merged
Conversation
Caught 2026-05-21 on first real prod-vault scoring output: 3 of the
auto-selected top-10 were the same memory ("System-wide deploy
changelog" at ids 600, 617, 644 — same title, content edited
iteratively across sessions, three different content_hashes).
Bare content-hash dedup misses this — those memories were
near-duplicates not byte-duplicates. The right dedup primitive is
mnemon's source_key (post-rc16 canonical identity via
store.save's upsert-by-slug) falling back to title for pre-rc16
saves.
Dedup key priority:
1. source_key — post-rc16 canonical identity
2. title — lowercased + stripped, catches pre-rc16 dupes
3. id — no-title memories or genuinely unique titles
Keep most recent (highest id) per dedup key — has the most current
title / confidence / content metadata.
Verified against prod snapshot (/tmp/mnemon-prod-snap.sqlite):
BEFORE: Live memories: 2084 (3 of top-10 are the same memory)
AFTER: Live memories: 1872 (deduped 212 across whole vault)
The 3 deploy-changelog entries collapsed to id 644. Auto-selected
top-10 now picks 9 more distinct candidates instead of triple-counting
the same fact.
Full suite: 801 passed. Harness: 13/13.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Caught on first real prod-vault scoring output: 3 of the auto-selected top-10 were the same memory (
System-wide deploy changelogat ids 600, 617, 644 — same title, content edited iteratively across sessions, three different content_hashes).Bare content-hash dedup misses this — those memories are near-duplicates not byte-duplicates. The right dedup primitive is mnemon's
source_key(post-rc16 canonical identity fromstore.save's upsert-by-slug) falling back totitlefor pre-rc16 saves.Dedup key priority
source_key— post-rc16 canonical identity (rc16 / PR fix(mirror): upsert by stable slug + bump 0.6.0rc16 (P0) #122)title— lowercased + stripped, catches pre-rc16 dupesid— no-title memories or genuinely unique titles (never dedupes with another)Keep most recent (highest id) per dedup key — has the most current title/confidence/content metadata.
Verified against prod snapshot
212 collapsed — not just the 3 we knew about. The dedup uncovered substantial historical noise.
The 3 deploy-changelog entries collapsed to id 644 (most recent). Auto-selected top-10 now picks 9 more distinct candidates instead of triple-counting the same fact.
Tests
801/801 pytest, 13/13 harness — no regressions.
After merge
🤖 Generated with Claude Code