You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Paraphrase accumulation is the most common visible failure mode in the production Qdrant index. The same underlying claim, restated across conversations, produces a cluster of near-duplicate rows.
The extractor already runs a write-time dedup gate: _is_duplicate in kai.memory_extraction is a top-1 semantic-similarity check that fires when the nearest existing fact under the same user_id has cosine similarity >= 0.9 against the candidate. The gate catches the high-similarity case. What it does not catch is the band BELOW 0.9 where paraphrases score, e.g.:
"User's GitHub username is <gh_user>" alongside "GitHub: <gh_user>" alongside "User goes by <gh_user> on GitHub" (these often score 0.75-0.88 pairwise depending on phrasing)
"User is in EST timezone" alongside "User is in Eastern time" alongside "User's timezone is Eastern"
"Home workspace is at <path>" alongside "User's home_workspace is configured to <path>"
These clusters bloat retrieval results (multiple near-identical hits crowd out distinct content), and they dilute the per-fact provenance signal #437 ships.
Mechanism
Tune the existing _is_duplicate gate. Lift its hardcoded threshold=0.9 to a configurable env var (default still 0.9 for safety; spec determines the production value via threshold sweep), and lower the production default to whatever the sweep picks (expected band: 0.80-0.88). Reuse the existing call site, existing outcome string (dropped_duplicate), existing log shape. Add the truncated candidate text to the log line so post-merge audit is possible.
This is NOT a new gate. It is a calibration of an existing gate that ships at a conservative threshold and has not been retuned since #321 restored it. The sub is "make the threshold env-tunable and pick a better default," not "build a second semantic gate."
v1 scope: drop, not merge
Keep the drop-on-fire semantics the existing gate already has: existing fact wins, new content is discarded. Rationale:
Merge would require a per-field metadata reconciliation rule (speaker, confidence, tags union, confirmation_quote preservation, updated_at bump, created_at retention) that is its own design surface.
Drop is what the current gate already does; this sub is calibration, not behavior change.
The dropped content's loss is bounded: by definition it is a paraphrase of an existing fact already in the index, so retrieval of the underlying claim is preserved.
Merge is filed as a v2 follow-up rather than scoped into this sub.
Integration point
src/kai/memory_extraction.py:
_is_duplicate(content, user_id, threshold=0.9) at line ~1609: signature change to accept threshold from config rather than defaulting; or call-site change to pass the config value in.
Call site at line ~2326 (if _is_duplicate(content, user_id):): passes config.memory_duplicate_threshold.
Log emit at line ~2332 (outcome="dropped_duplicate"): extends to include the truncated candidate text and the cosine score for the gate-fire so post-merge audit can scan log lines without correlation.
src/kai/config.py: new env var MEMORY_DUPLICATE_THRESHOLD (float, validated [0.0, 1.0] with the operator-floor recommendation noted in the spec).
src/kai/_emit_intent_log (or wherever the emit helper lives): may need extension for the new cosine and content_preview fields; spec resolves.
Acceptance criteria
On a sandbox replay over a representative chat-history window with the production-candidate threshold, the post-replay sandbox has no pair of facts with cosine similarity > T (the production threshold) to each other.
Gate fires are observable: every fire emits one memory.consolidate.intent log line with outcome=dropped_duplicate, carrying user_id, the surviving fact's id, a truncated candidate text, and the cosine value.
Threshold flows from config: setting MEMORY_DUPLICATE_THRESHOLD=1.0 disables the gate (no cosine can strictly exceed 1.0; the existing >= semantics may need a strict-greater adjustment, or the disable value becomes a sentinel; spec resolves).
Threshold sweep at {0.75, 0.80, 0.85, 0.88} against a fresh sandbox per value produces a fire-rate gradient (lower T = more fires).
Operator audit at the production T (sample N <= 10 borderline fires) shows acceptable false-merge rate; spec specifies the bar.
Threshold sweep: run replay at T in {0.75, 0.80, 0.85, 0.88} (the band currently letting paraphrases through), against fresh sandbox user_ids per T. Plot gate-fire rate per T. Note that T values >= 0.9 reproduce current production behavior (no change) and are not part of the sweep.
Pairwise-cosine cluster count post-replay: for each sandbox, count fact pairs with cosine > T_audit (typically fixed at 0.85 for the audit threshold regardless of run T). Headline: cluster-density reduction vs the T=0.9 baseline (current production).
Audit sample: operator scans N <= 10 borderline fires (cosine in [T-0.03, T+0.03]) at the production-candidate T and judges per-fire correctness. Per feat: positive-criterion extraction prompt #467's rubric (no large-N bucket classification).
Observability
Existing log line: memory.consolidate.intent {..., outcome: "dropped_duplicate", ...}. Extended to include cosine (the float similarity) and content_preview (truncated candidate text, ~80-120 chars).
No new log surface, no new outcome string.
Prerequisites
None. The embedding model, memory.search primitive, and the _is_duplicate call site all exist. The work is calibration plus config-plumbing.
Open spec questions
Exact production threshold T (sweep answers this).
Whether to add a strict-greater comparison alongside lowering the threshold (current code uses >=, which makes the kill-switch shape via T=1.0 require attention).
Whether the dropped candidate's tags should be UNIONED into the surviving fact (preserves topical coverage at zero metadata cost) or fully discarded. v1 default is fully discarded.
Counter persistence: log-line-only vs metadata bump on the surviving row (paraphrase_hit_count++). v1 default is log-only.
Risks
Per-deployment threshold variance: 0.9 was a default; a lower retuned T might be too aggressive on a different operator's conversational style. Mitigation: sweep methodology + audit sample + env-var-tunable so per-deployment retuning is possible without code change.
False-merge content loss: a candidate that was actually a new fact but happens to be paraphrase-close to an existing one is silently dropped. Mitigation: audit sample, observable log surface, drop-not-merge (preserves the existing fact rather than overwriting).
Hot-path cost: unchanged. The existing gate already pays one memory.search per intent="new" fact. This sub does not add a second search call.
Parent: #436 (Sub 2 of the memory-quality epic). Sibling subs: #437 speaker attribution (merged), #464 positive-criterion prompt (merged), #466 janitor decay (filed).
Problem
Paraphrase accumulation is the most common visible failure mode in the production Qdrant index. The same underlying claim, restated across conversations, produces a cluster of near-duplicate rows.
The extractor already runs a write-time dedup gate:
_is_duplicateinkai.memory_extractionis a top-1 semantic-similarity check that fires when the nearest existing fact under the sameuser_idhas cosine similarity >= 0.9 against the candidate. The gate catches the high-similarity case. What it does not catch is the band BELOW 0.9 where paraphrases score, e.g.:<gh_user>" alongside "GitHub:<gh_user>" alongside "User goes by<gh_user>on GitHub" (these often score 0.75-0.88 pairwise depending on phrasing)<path>" alongside "User's home_workspace is configured to<path>"These clusters bloat retrieval results (multiple near-identical hits crowd out distinct content), and they dilute the per-fact provenance signal #437 ships.
Mechanism
Tune the existing
_is_duplicategate. Lift its hardcodedthreshold=0.9to a configurable env var (default still 0.9 for safety; spec determines the production value via threshold sweep), and lower the production default to whatever the sweep picks (expected band: 0.80-0.88). Reuse the existing call site, existing outcome string (dropped_duplicate), existing log shape. Add the truncated candidate text to the log line so post-merge audit is possible.This is NOT a new gate. It is a calibration of an existing gate that ships at a conservative threshold and has not been retuned since #321 restored it. The sub is "make the threshold env-tunable and pick a better default," not "build a second semantic gate."
v1 scope: drop, not merge
Keep the drop-on-fire semantics the existing gate already has: existing fact wins, new content is discarded. Rationale:
confirmation_quotepreservation,updated_atbump,created_atretention) that is its own design surface.Merge is filed as a v2 follow-up rather than scoped into this sub.
Integration point
src/kai/memory_extraction.py:_is_duplicate(content, user_id, threshold=0.9)at line ~1609: signature change to accept threshold from config rather than defaulting; or call-site change to pass the config value in.if _is_duplicate(content, user_id):): passesconfig.memory_duplicate_threshold.outcome="dropped_duplicate"): extends to include the truncated candidate text and the cosine score for the gate-fire so post-merge audit can scan log lines without correlation.src/kai/config.py: new env varMEMORY_DUPLICATE_THRESHOLD(float, validated[0.0, 1.0]with the operator-floor recommendation noted in the spec).src/kai/_emit_intent_log(or wherever the emit helper lives): may need extension for the newcosineandcontent_previewfields; spec resolves.Acceptance criteria
memory.consolidate.intentlog line withoutcome=dropped_duplicate, carryinguser_id, the surviving fact'sid, a truncated candidate text, and the cosine value.MEMORY_DUPLICATE_THRESHOLD=1.0disables the gate (no cosine can strictly exceed 1.0; the existing>=semantics may need a strict-greater adjustment, or the disable value becomes a sentinel; spec resolves).{0.75, 0.80, 0.85, 0.88}against a fresh sandbox per value produces a fire-rate gradient (lower T = more fires).Eval methodology
kai.eval.replayfrom Memory extraction: replace exclusion stanzas with positive-criterion test #464, over a 7-14 day window.{0.75, 0.80, 0.85, 0.88}(the band currently letting paraphrases through), against fresh sandbox user_ids per T. Plot gate-fire rate per T. Note that T values >= 0.9 reproduce current production behavior (no change) and are not part of the sweep.[T-0.03, T+0.03]) at the production-candidate T and judges per-fire correctness. Per feat: positive-criterion extraction prompt #467's rubric (no large-N bucket classification).Observability
memory.consolidate.intent {..., outcome: "dropped_duplicate", ...}. Extended to includecosine(the float similarity) andcontent_preview(truncated candidate text, ~80-120 chars).Prerequisites
None. The embedding model,
memory.searchprimitive, and the_is_duplicatecall site all exist. The work is calibration plus config-plumbing.Open spec questions
>=, which makes the kill-switch shape via T=1.0 require attention).paraphrase_hit_count++). v1 default is log-only.Risks
memory.searchperintent="new"fact. This sub does not add a second search call.