v4.10.0 — Personal-notes link-discovery audit (V4 #43)
MINOR. The read-only complement to v4.9.0's vault lint. Where vault_lint.py checks the agent-shaped AgentMemory/ entries and skips the operator's free-form personal notes, this audits those skipped notes for missing connections between them — "these two notes look related but aren't [[linked]]." The personal-notes corpus is richly written but essentially ungraphed (a handful of ~390 notes carry tags, one has a wikilink, frontmatter is just title/created/updated), so relatedness is content-based. It is strictly read-only (DC-1) and strictly personal↔personal (DC-2) — a personal note is never link-suggested to an AgentMemory/ entry, enforced by excluding AgentMemory/ from the corpus as both source and target. The operator applies suggestions by hand (A3 — these are his notes). Single-repo release; crickets untouched.
Added
harness/skills/memory/scripts/notes_link_discovery.py— read-only missing-link audit. Two independent relatedness signals over the personal-notes corpus (the Obsidian root excludingAgentMemory/,.obsidian/,.trash,.git): TF-IDF lexical overlap (always on) and embedding semantic similarity (opt-in--embeddings). The TF-IDF path hand-rolls sublinear-tf × IDF over title+body (title double-weighted), L2-normalizes, and cosine-scores via an inverted index (never an O(n²) cross product); the embedding path embeds each note with the memory skill's local BGE model (embed.py) and full-pairwise-cosines cached vectors. Already-[[linked]]pairs (by stem or relative path) are excluded; CLI--vault,--format json|text,--top,--min-score,--embeddings,--mode,--embed-min-score.--reportmode. Writes an operator-review markdown report to<vault>/_meta/notes-links-<date>.md(mirroringvault-lint-<date>.md) — ranked pairs with folder/title, the top shared distinctive terms (the why), the score, and paste-ready bidirectional[[wikilinks]]. The report renders both signals: a "Shared-vocabulary (TF-IDF)" section flagging pairs embeddings also confirm (✓ also semantically related), plus a "Semantically related (embedding — TF-IDF missed these)" section for the new coverage. The report file is the only write the audit ever makes; it refuses any--outoutside the agent-controlled vault or onto a personal note.- Separate personal-notes embedding cache at
<vault>/_meta/notes-embeddings.json— content-hash keyed so re-runs only re-embed changed notes (live: ~27s cold → ~2s warm on 392 notes), deliberately never the AgentMemoryvec-index.db(DC-2). Graceful-skips to TF-IDF-only whensentence-transformersis absent.
Internal
- Clip-noise cleaning (live-dogfood driven). The personal notes are largely pasted HTML, so the first dogfood saw hex colors (
fffaa5), CSS tokens (serif), and image refs (image1) dominating the shared-terms. The tokenizer now strips<style>/<script>blocks, HTML tags,{…}CSS rules,#hexcolors, image embeds, and URLs, drops hex-id/media tokens, and carries a compact Spanish stopword set (the corpus is bilingual) — kept conservative so common words that double as CSS keywords (family/width/color/times) are not stopworded and a family-history corpus keeps its own vocabulary. - Three adversarial reviews, three real fixes. (1) A
max_df-band collapse that silently dropped a genuine pair's shared terms on tiny corpora; (2) a severe--outhole that would have overwritten a personal note (destructive DC-1 violation) — now guarded; (3) a stale-dimension embedding-cache reuse (the documentedEMBEDDING_DIM384→1024 upgrade / model swap) that_cosine_unit'szipwould truncate into false-positive scores — now re-embeds stale-dim entries so the vector set is always uniform (mirrorsvec_index.py's dim-rebuild). Each fix shipped with a regression test. - Live dogfood (392 notes): 40 TF-IDF + 16 embedding-only suggestions; 25/40 cross-confirmed. Standout embedding-only finds TF-IDF structurally cannot make: the same Stake Conference talk in
[SP]and[EN](cross-language, cosine 0.988) and a person's Birth/Baptism certificates (same person, different docs). - +36 tests (
scripts/test_notes_link_discovery.py; the engine lives in the skill dir, tests inscripts/so CI'sunittest discoverruns them). 406 → 426. The how-to page runs 14 words over the 600-word soft ceiling — kept intact as a load-bearing worked recipe (two optional signals + the read-only/paste-by-hand flow).
Cross-references
- agentm v4.9.0 — prior release (the
AgentMemory/-entry lint this complements). - ROADMAP-V4 item #43 (bucket ① V4-finish). Deferred follow-ups: real-time / watcher re-embedding (re-run is fine at this corpus size); packaging into the crickets personal-notes bundle (V4 ④).