Skip to content

feat: memory rating — closed-loop self-rating of reflections + semantic memories#52

Merged
emp3thy merged 44 commits into
mainfrom
worktree-memory-rating
May 12, 2026
Merged

feat: memory rating — closed-loop self-rating of reflections + semantic memories#52
emp3thy merged 44 commits into
mainfrom
worktree-memory-rating

Conversation

@emp3thy
Copy link
Copy Markdown
Owner

@emp3thy emp3thy commented May 12, 2026

Summary

Closed-loop self-rating of reflections and semantic memories. The LLM credits memories opportunistically mid-session via the new memory.credit MCP tool; at session end, a Stop hook injects a decision:block directive that triggers the rate-session-memories skill to classify whatever wasn't credited. Rating data (useful_count, times_misled) prepends ORDER BY useful_count DESC to retrieval queries so useful memories surface first.

Highlights:

  • New SQLite migration 0009 adds session_memory_exposure, counter columns on reflections / semantic_memories, and rating_diagnostics.
  • New MemoryRatingService with credit_one (single-row, atomic) and apply_session_ratings (batched, one SAVEPOINT, wholesale validation).
  • Three new MCP tools: memory.credit, memory.list_session_exposures, memory.apply_session_ratings. All resolve CLAUDE_SESSION_ID server-side from env.
  • New symlinked rate-session-memories skill + extended session_close hook with verified Stop-block directive shape.
  • UI: useful-count badge + drawer rows on reflection + semantic; Useful-only filter; /diagnostics Recent ratings panel + session_id_missing counter.
  • 95 new tests across unit / hook subprocess / MCP dispatch / UI / end-to-end integration. Suite: 860 passed, 23 skipped (up from 765 baseline).
  • track_exposure=False opt-out prevents bootstrap+retrieve double-write.
  • Pyright clean on every touched file.

Architecture: spec at docs/superpowers/specs/2026-05-10-memory-rating-design.md, plan at docs/superpowers/plans/2026-05-11-memory-rating.md. Both committed early on this branch.

Final reviewer verdict: Approved with minor revisions (all applied in commit f81475c). No Critical findings. Asymmetry between memory.credit (graceful skip on missing env) and memory.apply_session_ratings (raises) is documented in the tool description.

Test plan

  • python -m pytest tests/ --ignore=tests/ui/test_browser_* — expect 860 passed
  • python -m pyright better_memory — expect 0 errors on touched files
  • Run `bm install-hooks` and confirm `~/.claude/skills/rate-session-memories` symlink lands
  • Open a real Claude Code session in this repo, retrieve a memory, end the session, verify the LLM takes an extra rating turn and `session_memory_exposure.rated_at` gets stamped (the Task 12 Step 8 manual smoke test — the only production behavior pytest cannot drive)
  • Browse `/reflections` and `/semantic` — confirm useful badge renders, useful-only filter works
  • Browse `/diagnostics` — confirm Recent ratings panel and session_id_missing counter render

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

emp3thy and others added 30 commits May 10, 2026 22:38
Design for closed-loop memory rating: the LLM self-classifies which
reflections and semantic memories were cited, shaped, ignored, or
misled per session, and the system records useful_count / times_misled
that feed retrieval ranking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two rating paths now:
- memory_credit: per-tool-use credit when LLM uses a memory (fresh signal)
- session-end sweep: catches the rest, defaulting to ignored

Verified Stop hook force-continue mechanism (decision: block +
hookSpecificOutput.additionalContext); confirmed semantic_memories has
no status column; dropped PreCompact for v1 since decision:block there
prevents compaction without forcing an LLM turn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MCP server is stdio-spawned per Claude session, so CLAUDE_SESSION_ID
in env is reliable for the server's whole lifetime (across compacts,
sub-agents, tool calls). Codebase already trusts this pattern in 5+
call sites.

All three new MCP tools (list_session_exposures, apply_session_ratings,
memory_credit) drop session_id from their input schema and resolve
server-side from env. Best-effort skip when env is missing (rare
non-Claude contexts) plus a /diagnostics counter so we notice if
Anthropic ever renames the var.

No residual verification items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 TDD tasks: schema migration (0009) with session_memory_exposure +
rating_diagnostics tables and counter columns; MemoryRatingService with
credit_one + apply_session_ratings; exposure tracking on bootstrap and
mid-session retrieve paths; ORDER BY useful_count in retrieval; three
new MCP tools (list/apply/credit); rate-session-memories skill + symlink
installer; session_close hook extension with Stop-block directive;
opportunistic crediting reminders in memory-retrieve / CLAUDE.snippet;
UI badges and drawer lines on reflections + semantic; /diagnostics
Recent ratings panel + session_id_missing counter; end-to-end integration
test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-verified the Stop hook JSON shape against the docs and found two
fields the prior draft was missing: 'reason' (top-level) and
'hookEventName: Stop' inside hookSpecificOutput. Updated the test
assertions and the implementation example.

Added:
- Step 0 spike (delete-on-merge) to eyeball the JSON shape before TDD
- Step 6 cleanup to remove the spike
- Step 8 manual end-to-end smoke test in a real Claude session
  (the only way to confirm the mechanism produces an LLM turn in production)

Confidence lifted from ~90% to ~98% (the remaining 2% is the
real-session smoke that can't be unit-tested).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_one

Replace fetchone() with fetchall() in _apply_one step-1 so a memory with
two exposure rows (bootstrap + mid-session retrieve, spec §4.1/§5.3) is
not falsely skipped as already_rated when only one row is rated. Add four
new tests covering superseded status, cross-table kind miss, and the
multi-row one-rated/all-rated edge cases. Also: comment _VALID_CLASSES,
remove unused os/patch imports, and tighten pytest.raises match strings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssion_ratings validation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emories

Adds _record_exposure to SessionBootstrapService and wires it into
bootstrap() so every reflection and semantic memory actually rendered into
additionalContext gets an exposure row with source='bootstrap'. The render
helpers (_render_semantic, _render_reflection_bucket) now return (html, ids)
tuples. SessionBootstrapService gains an injectable clock and an optional
project kwarg for test ergonomics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add track_exposure: bool = True to retrieve_reflections and
list_for_project. SessionBootstrapService.bootstrap passes
track_exposure=False to both so only its own _record_exposure
source='bootstrap' rows are written, preventing duplicate exposure
rows per memory when CLAUDE_SESSION_ID is set. Add 4 tests covering
the opt-out flag and the single-row-per-memory invariant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Register memory.list_session_exposures tool that returns unrated
session_memory_exposure rows for the current Claude session (resolved
server-side from CLAUDE_SESSION_ID env). Also wires MemoryRatingService
into create_server() for use by Tasks 9 and 10, and adds
_dispatch_for_tests helper for direct handler invocation in tests.
…tests

os was already imported at module level; the line-1165 error was benign.
For line 1249: assert isinstance(..., CallToolResult) narrows the
discriminated-union ServerResult, then cast() aligns the ContentBlock
return with the list[TextContent] annotation.
- Adds test_returns_content_key_for_semantic_exposure: seeds a
  semantic_memories row and a memory_kind='semantic' exposure, asserts
  the response item carries "content" not "title". This was the only
  untested branch of the conditional dict spread in the handler.
- Adds _seed_semantic helper.
- Tightens test_returns_unrated_for_current_session to assert "title"
  present and "content" absent on reflection items.
- Adds docstring to that test explaining the two-connection model
  (_dispatch_for_tests opens its own DB connection; commits are required).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds .claude/skills/rate-session-memories/SKILL.md with the 4-step
rating protocol. Extends install_hooks with _resolve_user_skills_dir()
and install_skill_symlinks() (idempotent, degrades gracefully on Windows
without Developer Mode). Test skips automatically when the process lacks
symlink privilege.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
emp3thy and others added 13 commits May 12, 2026 14:18
Extends the Stop hook to query session_memory_exposure for unrated rows
and emit a decision:block JSON payload directing the LLM to run the
rate-session-memories skill before closing. Falls back silently (exits 0,
writes spool marker) if the DB is absent or the query fails.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Replace char-based truncation with byte-safe slice in
   _emit_rating_directive_if_unrated. When directive exceeds CAP_BYTES,
   encode to UTF-8, slice at the byte boundary, and decode with
   errors='ignore' to handle mid-multibyte-character cuts.

2. Fix test_empty_unrated_writes_marker_no_directive: set
   BETTER_MEMORY_HOME to tmp_memory_db.parent so cfg.memory_db
   resolves to the migrated DB. Previous test never exercised the
   empty-rows path (bailed at DB existence check). Now spool dir is
   under the same home and marker assertion added.

3. Add marker-file assertion to test_db_error_falls_back_to_marker:
   verify spool/marker creation even when DB is absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add useful_count, last_useful_at, times_misled, last_misled_at to
ReflectionFull and expose them via delegation properties on ReflectionDetail.
The reflection_detail SELECT now fetches the four rating columns; the drawer
template renders them in the meta <dl> with misled highlighted only when > 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors Task 14/15 reflection changes for semantic memories: adds
useful_count, last_useful_at, times_misled, last_misled_at to the
SemanticMemory dataclass and list_for_project SELECT; updates the
inline drawer query in app.py; adds the useful badge to semantic_row.html
and the Useful/Misled dl block to semantic_drawer.html.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…counter

- Bump rating_diagnostics.session_id_missing in retrieve_reflections and
  list_for_project when track_exposure=True and CLAUDE_SESSION_ID is unset
- Add recent_ratings query and rating_diagnostics dict to /diagnostics route
- Add Recent ratings table + Rating diagnostics dl to diagnostics.html
- Add 5 tests covering counter bump, track_exposure=False guard, and panel content

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts:
#	better_memory/skills/memory-retrieve.md
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Claude BugBot Analysis

Found 2 potential bugs in this PR.

high: 1 | medium: 1

Two concrete bugs: (1) the session-end spool marker is written to disk even when the Stop hook emits a 'decision:block' directive, causing premature session-end processing and a duplicate spool event on the second hook invocation; (2) memory.list_session_exposures returns per-row results including duplicate (kind, id) pairs for memories exposed via both bootstrap and retrieve paths, which causes apply_session_ratings to reject the whole batch with a duplicate-key validation error when the model rates them individually.

Comment thread better_memory/hooks/session_close.py Outdated
Comment thread better_memory/mcp/server.py
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Claude BugBot Analysis

Found 1 potential bug in this PR.

medium: 1

One medium-severity logic bug: when the Stop hook defers session end by emitting decision:block for memory rating, it still unconditionally writes the session_end spool file. This triggers premature synthesis (before memories are rated) and, when the hook fires again after rating completes, writes a second spool file with a distinct hash, causing the synthesis service to run twice for the same session.

Comment thread better_memory/hooks/session_close.py
…edupe exposure list

Three BugBot review findings on PR #52:

1. HIGH: session_close.py wrote the session_end spool marker
   unconditionally after emitting decision:block, causing premature
   synthesis runs and duplicate marker writes on the second Stop fire.
   Fix: _emit_rating_directive_if_unrated now returns bool; main()
   exits early when a block was emitted. Marker lands only on the
   final (non-blocking) Stop fire.

2. MEDIUM (dup): same root cause as #1.

3. MEDIUM: memory.list_session_exposures and the session_close
   directive query both returned duplicate (kind, id) entries when a
   memory had two exposure rows (bootstrap + retrieve). The LLM would
   submit two ratings for the same (kind, id) pair and
   apply_session_ratings would reject the whole batch.
   Fix: GROUP BY (memory_kind, memory_id) with MIN(exposed_at) in
   both queries. Matches _apply_one's contract (one rating stamps all
   unrated rows for a (kind, id)).

Tests:
- Updated test_non_empty_unrated_emits_decision_block to assert no
  marker is written when block is emitted.
- New test_multi_row_exposure_dedupes_in_directive verifies dedupe
  in session_close directive.
- New test_dedupes_multi_row_exposure verifies dedupe in MCP
  list_session_exposures tool.

862 tests pass (was 860, +2 new).
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Claude BugBot Analysis

Both previously reported bugs are fixed: list_session_exposures now deduplicates by (memory_kind, memory_id) using GROUP BY, and main() in session_close.py calls sys.exit(0) before the spool write when a block directive is emitted. No new defects were found in the added or modified lines.

No bugs were detected in this PR.

@emp3thy emp3thy merged commit eceed6a into main May 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant