feat: memory rating — closed-loop self-rating of reflections + semantic memories#52
Conversation
Design for closed-loop memory rating: the LLM self-classifies which reflections and semantic memories were cited, shaped, ignored, or misled per session, and the system records useful_count / times_misled that feed retrieval ranking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two rating paths now: - memory_credit: per-tool-use credit when LLM uses a memory (fresh signal) - session-end sweep: catches the rest, defaulting to ignored Verified Stop hook force-continue mechanism (decision: block + hookSpecificOutput.additionalContext); confirmed semantic_memories has no status column; dropped PreCompact for v1 since decision:block there prevents compaction without forcing an LLM turn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MCP server is stdio-spawned per Claude session, so CLAUDE_SESSION_ID in env is reliable for the server's whole lifetime (across compacts, sub-agents, tool calls). Codebase already trusts this pattern in 5+ call sites. All three new MCP tools (list_session_exposures, apply_session_ratings, memory_credit) drop session_id from their input schema and resolve server-side from env. Best-effort skip when env is missing (rare non-Claude contexts) plus a /diagnostics counter so we notice if Anthropic ever renames the var. No residual verification items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 TDD tasks: schema migration (0009) with session_memory_exposure + rating_diagnostics tables and counter columns; MemoryRatingService with credit_one + apply_session_ratings; exposure tracking on bootstrap and mid-session retrieve paths; ORDER BY useful_count in retrieval; three new MCP tools (list/apply/credit); rate-session-memories skill + symlink installer; session_close hook extension with Stop-block directive; opportunistic crediting reminders in memory-retrieve / CLAUDE.snippet; UI badges and drawer lines on reflections + semantic; /diagnostics Recent ratings panel + session_id_missing counter; end-to-end integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-verified the Stop hook JSON shape against the docs and found two fields the prior draft was missing: 'reason' (top-level) and 'hookEventName: Stop' inside hookSpecificOutput. Updated the test assertions and the implementation example. Added: - Step 0 spike (delete-on-merge) to eyeball the JSON shape before TDD - Step 6 cleanup to remove the spike - Step 8 manual end-to-end smoke test in a real Claude session (the only way to confirm the mechanism produces an LLM turn in production) Confidence lifted from ~90% to ~98% (the remaining 2% is the real-session smoke that can't be unit-tested). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_one Replace fetchone() with fetchall() in _apply_one step-1 so a memory with two exposure rows (bootstrap + mid-session retrieve, spec §4.1/§5.3) is not falsely skipped as already_rated when only one row is rated. Add four new tests covering superseded status, cross-table kind miss, and the multi-row one-rated/all-rated edge cases. Also: comment _VALID_CLASSES, remove unused os/patch imports, and tighten pytest.raises match strings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssion_ratings validation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emories Adds _record_exposure to SessionBootstrapService and wires it into bootstrap() so every reflection and semantic memory actually rendered into additionalContext gets an exposure row with source='bootstrap'. The render helpers (_render_semantic, _render_reflection_bucket) now return (html, ids) tuples. SessionBootstrapService gains an injectable clock and an optional project kwarg for test ergonomics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… dead imports, missing test)
Add track_exposure: bool = True to retrieve_reflections and list_for_project. SessionBootstrapService.bootstrap passes track_exposure=False to both so only its own _record_exposure source='bootstrap' rows are written, preventing duplicate exposure rows per memory when CLAUDE_SESSION_ID is set. Add 4 tests covering the opt-out flag and the single-row-per-memory invariant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Register memory.list_session_exposures tool that returns unrated session_memory_exposure rows for the current Claude session (resolved server-side from CLAUDE_SESSION_ID env). Also wires MemoryRatingService into create_server() for use by Tasks 9 and 10, and adds _dispatch_for_tests helper for direct handler invocation in tests.
…tests os was already imported at module level; the line-1165 error was benign. For line 1249: assert isinstance(..., CallToolResult) narrows the discriminated-union ServerResult, then cast() aligns the ContentBlock return with the list[TextContent] annotation.
- Adds test_returns_content_key_for_semantic_exposure: seeds a semantic_memories row and a memory_kind='semantic' exposure, asserts the response item carries "content" not "title". This was the only untested branch of the conditional dict spread in the handler. - Adds _seed_semantic helper. - Tightens test_returns_unrated_for_current_session to assert "title" present and "content" absent on reflection items. - Adds docstring to that test explaining the two-connection model (_dispatch_for_tests opens its own DB connection; commits are required). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds .claude/skills/rate-session-memories/SKILL.md with the 4-step rating protocol. Extends install_hooks with _resolve_user_skills_dir() and install_skill_symlinks() (idempotent, degrades gracefully on Windows without Developer Mode). Test skips automatically when the process lacks symlink privilege. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the Stop hook to query session_memory_exposure for unrated rows and emit a decision:block JSON payload directing the LLM to run the rate-session-memories skill before closing. Falls back silently (exits 0, writes spool marker) if the DB is absent or the query fails. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Replace char-based truncation with byte-safe slice in _emit_rating_directive_if_unrated. When directive exceeds CAP_BYTES, encode to UTF-8, slice at the byte boundary, and decode with errors='ignore' to handle mid-multibyte-character cuts. 2. Fix test_empty_unrated_writes_marker_no_directive: set BETTER_MEMORY_HOME to tmp_memory_db.parent so cfg.memory_db resolves to the migrated DB. Previous test never exercised the empty-rows path (bailed at DB existence check). Now spool dir is under the same home and marker assertion added. 3. Add marker-file assertion to test_db_error_falls_back_to_marker: verify spool/marker creation even when DB is absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add useful_count, last_useful_at, times_misled, last_misled_at to ReflectionFull and expose them via delegation properties on ReflectionDetail. The reflection_detail SELECT now fetches the four rating columns; the drawer template renders them in the meta <dl> with misled highlighted only when > 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors Task 14/15 reflection changes for semantic memories: adds useful_count, last_useful_at, times_misled, last_misled_at to the SemanticMemory dataclass and list_for_project SELECT; updates the inline drawer query in app.py; adds the useful badge to semantic_row.html and the Useful/Misled dl block to semantic_drawer.html. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…counter - Bump rating_diagnostics.session_id_missing in retrieve_reflections and list_for_project when track_exposure=True and CLAUDE_SESSION_ID is unset - Add recent_ratings query and rating_diagnostics dict to /diagnostics route - Add Recent ratings table + Rating diagnostics dl to diagnostics.html - Add 5 tests covering counter bump, track_exposure=False guard, and panel content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts: # better_memory/skills/memory-retrieve.md
There was a problem hiding this comment.
🔴 Claude BugBot Analysis
Found 2 potential bugs in this PR.
high: 1 | medium: 1
Two concrete bugs: (1) the session-end spool marker is written to disk even when the Stop hook emits a 'decision:block' directive, causing premature session-end processing and a duplicate spool event on the second hook invocation; (2) memory.list_session_exposures returns per-row results including duplicate (kind, id) pairs for memories exposed via both bootstrap and retrieve paths, which causes apply_session_ratings to reject the whole batch with a duplicate-key validation error when the model rates them individually.
There was a problem hiding this comment.
🔴 Claude BugBot Analysis
Found 1 potential bug in this PR.
medium: 1
One medium-severity logic bug: when the Stop hook defers session end by emitting decision:block for memory rating, it still unconditionally writes the session_end spool file. This triggers premature synthesis (before memories are rated) and, when the hook fires again after rating completes, writes a second spool file with a distinct hash, causing the synthesis service to run twice for the same session.
…edupe exposure list Three BugBot review findings on PR #52: 1. HIGH: session_close.py wrote the session_end spool marker unconditionally after emitting decision:block, causing premature synthesis runs and duplicate marker writes on the second Stop fire. Fix: _emit_rating_directive_if_unrated now returns bool; main() exits early when a block was emitted. Marker lands only on the final (non-blocking) Stop fire. 2. MEDIUM (dup): same root cause as #1. 3. MEDIUM: memory.list_session_exposures and the session_close directive query both returned duplicate (kind, id) entries when a memory had two exposure rows (bootstrap + retrieve). The LLM would submit two ratings for the same (kind, id) pair and apply_session_ratings would reject the whole batch. Fix: GROUP BY (memory_kind, memory_id) with MIN(exposed_at) in both queries. Matches _apply_one's contract (one rating stamps all unrated rows for a (kind, id)). Tests: - Updated test_non_empty_unrated_emits_decision_block to assert no marker is written when block is emitted. - New test_multi_row_exposure_dedupes_in_directive verifies dedupe in session_close directive. - New test_dedupes_multi_row_exposure verifies dedupe in MCP list_session_exposures tool. 862 tests pass (was 860, +2 new).
There was a problem hiding this comment.
🟢 Claude BugBot Analysis
Both previously reported bugs are fixed: list_session_exposures now deduplicates by (memory_kind, memory_id) using GROUP BY, and main() in session_close.py calls sys.exit(0) before the spool write when a block directive is emitted. No new defects were found in the added or modified lines.
No bugs were detected in this PR.
Summary
Closed-loop self-rating of reflections and semantic memories. The LLM credits memories opportunistically mid-session via the new
memory.creditMCP tool; at session end, aStophook injects adecision:blockdirective that triggers therate-session-memoriesskill to classify whatever wasn't credited. Rating data (useful_count,times_misled) prependsORDER BY useful_count DESCto retrieval queries so useful memories surface first.Highlights:
0009addssession_memory_exposure, counter columns onreflections/semantic_memories, andrating_diagnostics.MemoryRatingServicewithcredit_one(single-row, atomic) andapply_session_ratings(batched, one SAVEPOINT, wholesale validation).memory.credit,memory.list_session_exposures,memory.apply_session_ratings. All resolveCLAUDE_SESSION_IDserver-side from env.rate-session-memoriesskill + extendedsession_closehook with verified Stop-block directive shape./diagnosticsRecent ratings panel +session_id_missingcounter.track_exposure=Falseopt-out prevents bootstrap+retrieve double-write.Architecture: spec at
docs/superpowers/specs/2026-05-10-memory-rating-design.md, plan atdocs/superpowers/plans/2026-05-11-memory-rating.md. Both committed early on this branch.Final reviewer verdict: Approved with minor revisions (all applied in commit
f81475c). No Critical findings. Asymmetry betweenmemory.credit(graceful skip on missing env) andmemory.apply_session_ratings(raises) is documented in the tool description.Test plan
python -m pytest tests/ --ignore=tests/ui/test_browser_*— expect 860 passedpython -m pyright better_memory— expect 0 errors on touched filesCo-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com