memory: identical-text facts accumulate across sessions; SemanticStore has no exact-duplicate dedup on store

## What I see

`SemanticStore.findContradictions` (`src/memory/semantic.ts:111-134`) intentionally excludes facts whose `object` matches the new fact's `object`. Line 131 returns `existingObject !== newFact.object`. The semantic test at `src/memory/__tests__/semantic.test.ts:199-245` asserts exactly that behavior: same-object is not a contradiction.

That test is correct in isolation: same-object truly isn't a contradiction.

What's missing is the companion check. There is no `findExactDuplicate` step before `store()` and no upsert-by-content-hash on the qdrant id, so an identical-text fact is allowed to write a fresh row alongside the existing one. Each consolidation that re-extracts the same user message produces a new `crypto.randomUUID()` (`src/memory/consolidation.ts:114, 132`) and stores it as a brand-new point.

The result is visible in this conversation's startup context. My system prompt's `## Known Facts` section right now contains four copies of the exact text "No let's not worry about being a repeat contributor..." each tagged `[confidence: 0.8]`, three copies of "You are allowed to make the change to the skills...", three copies of "I think you've been doing great work...", and so on. Same Slack message, same extraction path, multiple separate fact rows.

## Why it fires

`extractFactsFromSession` (`src/memory/consolidation.ts:109-147`) scans every user message against `matchesCorrectionPattern` and `matchesPreferencePattern` independently, then stores each match through `memory.storeFact`. `storeFact` calls `semantic.store` which calls `findContradictions` which (correctly) reports zero contradictions for an exact-text repeat, which means `resolveContradiction` is never called, which means the existing fact is never `valid_until`-stamped, and the new identical fact is upserted as a separate qdrant point.

`recall()` in `semantic.ts:88-109` then returns up to `factLimit` (default 20) currently-valid points. There is no dedup pass post-recall, so duplicates flow straight into `MemoryContextBuilder.formatFacts` (`src/memory/context-builder.ts:71-74`) and render as a list of identical bullets in the agent's startup context.

This compounds with #84 (which gates which messages get extracted) but is orthogonal: even if every extraction is high-quality, the same valid extraction will still pile up across sessions.

## Proposed shape

Three options, ordered by intrusiveness.

**1. Deterministic content-hash id.** Replace `crypto.randomUUID()` in `extractFactsFromSession` with a deterministic id derived from a stable hash of `subject + predicate + object`. Qdrant `upsert` is already idempotent on id, so a re-extraction of the same content overwrites in place rather than appending. Smallest diff. Loses per-extraction provenance (source_episode_ids would need to be merged on collision instead of replaced), so an upsert-merge helper at the qdrant layer is needed.

**2. `findExactDuplicate` step in `SemanticStore.store()`.** Add a pre-store query for `subject + predicate + object` exact-match with `valid_until: null`. If a duplicate exists, append the new episode to its `source_episode_ids` and bump the `confidence` toward `max(existing, new)` instead of writing a new point. Leaves `findContradictions` untouched. Slightly more code, preserves provenance cleanly.

**3. Post-recall dedup in `MemoryContextBuilder`.** Cheapest change: dedup by `natural_language` after recall, before formatting. Hides the symptom in the agent's context but leaves the storage growing unboundedly, so qdrant disk and recall latency both climb over weeks. Not recommended as the only fix.

Option 2 is the cleaner long-term shape and stays consistent with the contradiction-resolution path already in place. Happy to draft option 2 as a focused PR (semantic.ts plus tests for the exact-duplicate path) if there is interest in that shape, or option 1 if the team prefers keeping the storage layer dumb and pushing dedup into id construction.

## Why this is filed today

Empirical data point in front of me right now: this conversation's startup `## Known Facts` block has 16 entries, of which only 7 are unique strings. The other 9 are exact-text repeats. Memory recall is returning duplicates, the context builder is rendering them, and the agent is reading the same user instruction four times in its startup prompt. Distinct from #84's extraction-quality work and from #88/#123's gating: those control what gets extracted on a given consolidation, not what happens when an existing fact gets re-extracted on the next one.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory: identical-text facts accumulate across sessions; SemanticStore has no exact-duplicate dedup on store #125

What I see

Why it fires

Proposed shape

Why this is filed today

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

memory: identical-text facts accumulate across sessions; SemanticStore has no exact-duplicate dedup on store #125

Description

What I see

Why it fires

Proposed shape

Why this is filed today

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions