Skip to content

v4.10.0 — Personal-notes link-discovery audit (V4 #43)

Choose a tag to compare

@alexherrero alexherrero released this 30 May 07:39
· 244 commits to main since this release

MINOR. The read-only complement to v4.9.0's vault lint. Where vault_lint.py checks the agent-shaped AgentMemory/ entries and skips the operator's free-form personal notes, this audits those skipped notes for missing connections between them — "these two notes look related but aren't [[linked]]." The personal-notes corpus is richly written but essentially ungraphed (a handful of ~390 notes carry tags, one has a wikilink, frontmatter is just title/created/updated), so relatedness is content-based. It is strictly read-only (DC-1) and strictly personal↔personal (DC-2) — a personal note is never link-suggested to an AgentMemory/ entry, enforced by excluding AgentMemory/ from the corpus as both source and target. The operator applies suggestions by hand (A3 — these are his notes). Single-repo release; crickets untouched.

Added

  • harness/skills/memory/scripts/notes_link_discovery.py — read-only missing-link audit. Two independent relatedness signals over the personal-notes corpus (the Obsidian root excluding AgentMemory/, .obsidian/, .trash, .git): TF-IDF lexical overlap (always on) and embedding semantic similarity (opt-in --embeddings). The TF-IDF path hand-rolls sublinear-tf × IDF over title+body (title double-weighted), L2-normalizes, and cosine-scores via an inverted index (never an O(n²) cross product); the embedding path embeds each note with the memory skill's local BGE model (embed.py) and full-pairwise-cosines cached vectors. Already-[[linked]] pairs (by stem or relative path) are excluded; CLI --vault, --format json|text, --top, --min-score, --embeddings, --mode, --embed-min-score.
  • --report mode. Writes an operator-review markdown report to <vault>/_meta/notes-links-<date>.md (mirroring vault-lint-<date>.md) — ranked pairs with folder/title, the top shared distinctive terms (the why), the score, and paste-ready bidirectional [[wikilinks]]. The report renders both signals: a "Shared-vocabulary (TF-IDF)" section flagging pairs embeddings also confirm (✓ also semantically related), plus a "Semantically related (embedding — TF-IDF missed these)" section for the new coverage. The report file is the only write the audit ever makes; it refuses any --out outside the agent-controlled vault or onto a personal note.
  • Separate personal-notes embedding cache at <vault>/_meta/notes-embeddings.json — content-hash keyed so re-runs only re-embed changed notes (live: ~27s cold → ~2s warm on 392 notes), deliberately never the AgentMemory vec-index.db (DC-2). Graceful-skips to TF-IDF-only when sentence-transformers is absent.

Internal

  • Clip-noise cleaning (live-dogfood driven). The personal notes are largely pasted HTML, so the first dogfood saw hex colors (fffaa5), CSS tokens (serif), and image refs (image1) dominating the shared-terms. The tokenizer now strips <style>/<script> blocks, HTML tags, {…} CSS rules, #hex colors, image embeds, and URLs, drops hex-id/media tokens, and carries a compact Spanish stopword set (the corpus is bilingual) — kept conservative so common words that double as CSS keywords (family/width/color/times) are not stopworded and a family-history corpus keeps its own vocabulary.
  • Three adversarial reviews, three real fixes. (1) A max_df-band collapse that silently dropped a genuine pair's shared terms on tiny corpora; (2) a severe --out hole that would have overwritten a personal note (destructive DC-1 violation) — now guarded; (3) a stale-dimension embedding-cache reuse (the documented EMBEDDING_DIM 384→1024 upgrade / model swap) that _cosine_unit's zip would truncate into false-positive scores — now re-embeds stale-dim entries so the vector set is always uniform (mirrors vec_index.py's dim-rebuild). Each fix shipped with a regression test.
  • Live dogfood (392 notes): 40 TF-IDF + 16 embedding-only suggestions; 25/40 cross-confirmed. Standout embedding-only finds TF-IDF structurally cannot make: the same Stake Conference talk in [SP] and [EN] (cross-language, cosine 0.988) and a person's Birth/Baptism certificates (same person, different docs).
  • +36 tests (scripts/test_notes_link_discovery.py; the engine lives in the skill dir, tests in scripts/ so CI's unittest discover runs them). 406 → 426. The how-to page runs 14 words over the 600-word soft ceiling — kept intact as a load-bearing worked recipe (two optional signals + the read-only/paste-by-hand flow).

Cross-references

  • agentm v4.9.0 — prior release (the AgentMemory/-entry lint this complements).
  • ROADMAP-V4 item #43 (bucket ① V4-finish). Deferred follow-ups: real-time / watcher re-embedding (re-run is fine at this corpus size); packaging into the crickets personal-notes bundle (V4 ④).