feat: memory rating — closed-loop self-rating of reflections + semantic memories by emp3thy · Pull Request #52 · emp3thy/better-memory

emp3thy · 2026-05-12T15:09:57Z

Summary

Closed-loop self-rating of reflections and semantic memories. The LLM credits memories opportunistically mid-session via the new memory.credit MCP tool; at session end, a Stop hook injects a decision:block directive that triggers the rate-session-memories skill to classify whatever wasn't credited. Rating data (useful_count, times_misled) prepends ORDER BY useful_count DESC to retrieval queries so useful memories surface first.

Highlights:

New SQLite migration 0009 adds session_memory_exposure, counter columns on reflections / semantic_memories, and rating_diagnostics.
New MemoryRatingService with credit_one (single-row, atomic) and apply_session_ratings (batched, one SAVEPOINT, wholesale validation).
Three new MCP tools: memory.credit, memory.list_session_exposures, memory.apply_session_ratings. All resolve CLAUDE_SESSION_ID server-side from env.
New symlinked rate-session-memories skill + extended session_close hook with verified Stop-block directive shape.
UI: useful-count badge + drawer rows on reflection + semantic; Useful-only filter; /diagnostics Recent ratings panel + session_id_missing counter.
95 new tests across unit / hook subprocess / MCP dispatch / UI / end-to-end integration. Suite: 860 passed, 23 skipped (up from 765 baseline).
track_exposure=False opt-out prevents bootstrap+retrieve double-write.
Pyright clean on every touched file.

Architecture: spec at docs/superpowers/specs/2026-05-10-memory-rating-design.md, plan at docs/superpowers/plans/2026-05-11-memory-rating.md. Both committed early on this branch.

Final reviewer verdict: Approved with minor revisions (all applied in commit f81475c). No Critical findings. Asymmetry between memory.credit (graceful skip on missing env) and memory.apply_session_ratings (raises) is documented in the tool description.

Test plan

python -m pytest tests/ --ignore=tests/ui/test_browser_* — expect 860 passed
python -m pyright better_memory — expect 0 errors on touched files
Run `bm install-hooks` and confirm `~/.claude/skills/rate-session-memories` symlink lands
Open a real Claude Code session in this repo, retrieve a memory, end the session, verify the LLM takes an extra rating turn and `session_memory_exposure.rated_at` gets stamped (the Task 12 Step 8 manual smoke test — the only production behavior pytest cannot drive)
Browse `/reflections` and `/semantic` — confirm useful badge renders, useful-only filter works
Browse `/diagnostics` — confirm Recent ratings panel and session_id_missing counter render

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Design for closed-loop memory rating: the LLM self-classifies which reflections and semantic memories were cited, shaped, ignored, or misled per session, and the system records useful_count / times_misled that feed retrieval ranking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two rating paths now: - memory_credit: per-tool-use credit when LLM uses a memory (fresh signal) - session-end sweep: catches the rest, defaulting to ignored Verified Stop hook force-continue mechanism (decision: block + hookSpecificOutput.additionalContext); confirmed semantic_memories has no status column; dropped PreCompact for v1 since decision:block there prevents compaction without forcing an LLM turn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MCP server is stdio-spawned per Claude session, so CLAUDE_SESSION_ID in env is reliable for the server's whole lifetime (across compacts, sub-agents, tool calls). Codebase already trusts this pattern in 5+ call sites. All three new MCP tools (list_session_exposures, apply_session_ratings, memory_credit) drop session_id from their input schema and resolve server-side from env. Best-effort skip when env is missing (rare non-Claude contexts) plus a /diagnostics counter so we notice if Anthropic ever renames the var. No residual verification items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

18 TDD tasks: schema migration (0009) with session_memory_exposure + rating_diagnostics tables and counter columns; MemoryRatingService with credit_one + apply_session_ratings; exposure tracking on bootstrap and mid-session retrieve paths; ORDER BY useful_count in retrieval; three new MCP tools (list/apply/credit); rate-session-memories skill + symlink installer; session_close hook extension with Stop-block directive; opportunistic crediting reminders in memory-retrieve / CLAUDE.snippet; UI badges and drawer lines on reflections + semantic; /diagnostics Recent ratings panel + session_id_missing counter; end-to-end integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-verified the Stop hook JSON shape against the docs and found two fields the prior draft was missing: 'reason' (top-level) and 'hookEventName: Stop' inside hookSpecificOutput. Updated the test assertions and the implementation example. Added: - Step 0 spike (delete-on-merge) to eyeball the JSON shape before TDD - Step 6 cleanup to remove the spike - Step 8 manual end-to-end smoke test in a real Claude session (the only way to confirm the mechanism produces an LLM turn in production) Confidence lifted from ~90% to ~98% (the remaining 2% is the real-session smoke that can't be unit-tested). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aths

…_one Replace fetchone() with fetchall() in _apply_one step-1 so a memory with two exposure rows (bootstrap + mid-session retrieve, spec §4.1/§5.3) is not falsely skipped as already_rated when only one row is rated. Add four new tests covering superseded status, cross-table kind miss, and the multi-row one-rated/all-rated edge cases. Also: comment _VALID_CLASSES, remove unused os/patch imports, and tighten pytest.raises match strings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…owing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ssion_ratings validation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…emories Adds _record_exposure to SessionBootstrapService and wires it into bootstrap() so every reflection and semantic memory actually rendered into additionalContext gets an exposure row with source='bootstrap'. The render helpers (_render_semantic, _render_reflection_bucket) now return (html, ids) tuples. SessionBootstrapService gains an injectable clock and an optional project kwarg for test ergonomics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… dead imports, missing test)

… vars

Add track_exposure: bool = True to retrieve_reflections and list_for_project. SessionBootstrapService.bootstrap passes track_exposure=False to both so only its own _record_exposure source='bootstrap' rows are written, preventing duplicate exposure rows per memory when CLAUDE_SESSION_ID is set. Add 4 tests covering the opt-out flag and the single-row-per-memory invariant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Register memory.list_session_exposures tool that returns unrated session_memory_exposure rows for the current Claude session (resolved server-side from CLAUDE_SESSION_ID env). Also wires MemoryRatingService into create_server() for use by Tasks 9 and 10, and adds _dispatch_for_tests helper for direct handler invocation in tests.

…tests os was already imported at module level; the line-1165 error was benign. For line 1249: assert isinstance(..., CallToolResult) narrows the discriminated-union ServerResult, then cast() aligns the ContentBlock return with the list[TextContent] annotation.

….cast

- Adds test_returns_content_key_for_semantic_exposure: seeds a semantic_memories row and a memory_kind='semantic' exposure, asserts the response item carries "content" not "title". This was the only untested branch of the conditional dict spread in the handler. - Adds _seed_semantic helper. - Tightens test_returns_unrated_for_current_session to assert "title" present and "content" absent on reflection items. - Adds docstring to that test explaining the two-connection model (_dispatch_for_tests opens its own DB connection; commits are required). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds .claude/skills/rate-session-memories/SKILL.md with the 4-step rating protocol. Extends install_hooks with _resolve_user_skills_dir() and install_skill_symlinks() (idempotent, degrades gracefully on Windows without Developer Mode). Test skips automatically when the process lacks symlink privilege. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extends the Stop hook to query session_memory_exposure for unrated rows and emit a decision:block JSON payload directing the LLM to run the rate-session-memories skill before closing. Falls back silently (exits 0, writes spool marker) if the DB is absent or the query fails. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. Replace char-based truncation with byte-safe slice in _emit_rating_directive_if_unrated. When directive exceeds CAP_BYTES, encode to UTF-8, slice at the byte boundary, and decode with errors='ignore' to handle mid-multibyte-character cuts. 2. Fix test_empty_unrated_writes_marker_no_directive: set BETTER_MEMORY_HOME to tmp_memory_db.parent so cfg.memory_db resolves to the migrated DB. Previous test never exercised the empty-rows path (bailed at DB existence check). Now spool dir is under the same home and marker assertion added. 3. Add marker-file assertion to test_db_error_falls_back_to_marker: verify spool/marker creation even when DB is absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add useful_count, last_useful_at, times_misled, last_misled_at to ReflectionFull and expose them via delegation properties on ReflectionDetail. The reflection_detail SELECT now fetches the four rating columns; the drawer template renders them in the meta <dl> with misled highlighted only when > 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mirrors Task 14/15 reflection changes for semantic memories: adds useful_count, last_useful_at, times_misled, last_misled_at to the SemanticMemory dataclass and list_for_project SELECT; updates the inline drawer query in app.py; adds the useful badge to semantic_row.html and the Useful/Misled dl block to semantic_drawer.html. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…counter - Bump rating_diagnostics.session_id_missing in retrieve_reflections and list_for_project when track_exposure=True and CLAUDE_SESSION_ID is unset - Add recent_ratings query and rating_diagnostics dict to /diagnostics route - Add Recent ratings table + Rating diagnostics dl to diagnostics.html - Add 5 tests covering counter bump, track_exposure=False guard, and panel content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…c clarity)

# Conflicts: # better_memory/skills/memory-retrieve.md

github-actions

🔴 Claude BugBot Analysis

Found 2 potential bugs in this PR.

high: 1 | medium: 1

Two concrete bugs: (1) the session-end spool marker is written to disk even when the Stop hook emits a 'decision:block' directive, causing premature session-end processing and a duplicate spool event on the second hook invocation; (2) memory.list_session_exposures returns per-row results including duplicate (kind, id) pairs for memories exposed via both bootstrap and retrieve paths, which causes apply_session_ratings to reject the whole batch with a duplicate-key validation error when the model rates them individually.

github-actions

🔴 Claude BugBot Analysis

Found 1 potential bug in this PR.

medium: 1

One medium-severity logic bug: when the Stop hook defers session end by emitting decision:block for memory rating, it still unconditionally writes the session_end spool file. This triggers premature synthesis (before memories are rated) and, when the hook fires again after rating completes, writes a second spool file with a distinct hash, causing the synthesis service to run twice for the same session.

…edupe exposure list Three BugBot review findings on PR #52: 1. HIGH: session_close.py wrote the session_end spool marker unconditionally after emitting decision:block, causing premature synthesis runs and duplicate marker writes on the second Stop fire. Fix: _emit_rating_directive_if_unrated now returns bool; main() exits early when a block was emitted. Marker lands only on the final (non-blocking) Stop fire. 2. MEDIUM (dup): same root cause as #1. 3. MEDIUM: memory.list_session_exposures and the session_close directive query both returned duplicate (kind, id) entries when a memory had two exposure rows (bootstrap + retrieve). The LLM would submit two ratings for the same (kind, id) pair and apply_session_ratings would reject the whole batch. Fix: GROUP BY (memory_kind, memory_id) with MIN(exposed_at) in both queries. Matches _apply_one's contract (one rating stamps all unrated rows for a (kind, id)). Tests: - Updated test_non_empty_unrated_emits_decision_block to assert no marker is written when block is emitted. - New test_multi_row_exposure_dedupes_in_directive verifies dedupe in session_close directive. - New test_dedupes_multi_row_exposure verifies dedupe in MCP list_session_exposures tool. 862 tests pass (was 860, +2 new).

github-actions

🟢 Claude BugBot Analysis

Both previously reported bugs are fixed: list_session_exposures now deduplicates by (memory_kind, memory_id) using GROUP BY, and main() in session_close.py calls sys.exit(0) before the spool write when a block directive is emitted. No new defects were found in the added or modified lines.

No bugs were detected in this PR.

emp3thy and others added 30 commits May 10, 2026 22:38

feat(rating): migration 0009 — session_memory_exposure + counter columns

280045a

fix(rating): add missing test coverage for migration 0009

2862407

feat(rating): MemoryRatingService.credit_one with skip / validation p…

0f5139a

…aths

feat(rating): apply_session_ratings batch with single SAVEPOINT

4fe0e0d

fix(rating): tighten MemoryRatingService return type for Pyright narr…

f26da53

…owing

fix(rating): use TypedDict + local-var narrowing for Pyright compliance

d6b7527

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(rating): add invalid kind rejection test for apply_session_ratings

6568bee

fix(rating): distinguish missing-field from invalid-value in apply_se…

0fc0610

…ssion_ratings validation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(rating): address bootstrap exposure reviewer findings (docstring,…

7842bbf

… dead imports, missing test)

feat(rating): record exposure on retrieve_reflections

b338426

test(rating): revert unnecessary _seed_reflection status, remove dead…

f66a0df

… vars

feat(rating): record exposure on semantic list_for_project

6983e68

feat(rating): ORDER BY useful_count DESC in retrieval queries

5287526

test(rating): add semantic ranking tests + docstring update

dbc9746

fix(rating): cast result.root.content to list[TextContent] via typing…

f396cc5

….cast

feat(rating): MCP tool memory.apply_session_ratings

d0a183b

fix(rating): _dispatch_for_tests re-raises ValueError on isError

00390f4

feat(rating): MCP tool memory.credit (per-tool-use credit)

54b44ca

emp3thy and others added 13 commits May 12, 2026 14:18

fix(install): symlink better-memory-synthesize too + clearer docstring

fd65c00

docs(rating): credit memories opportunistically when used

448ab0e

feat(rating): useful badge + useful-only filter on Reflections tab

950c429

fix(test): narrow Optional[ReflectionDetail] before attribute access

f8ef3f8

test(rating): end-to-end bootstrap → credit → sweep round-trip

0311e91

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(rating): pre-merge polish from final review (schema minItems + do…

f81475c

…c clarity)

Merge remote-tracking branch 'origin/main' into worktree-memory-rating

e18ef5d

# Conflicts: # better_memory/skills/memory-retrieve.md

fix(rating): tighten apply_session_ratings return type for Pyright (CI)

1b28df2

github-actions Bot reviewed May 12, 2026

View reviewed changes

Comment thread better_memory/hooks/session_close.py Outdated

Comment thread better_memory/mcp/server.py

github-actions Bot reviewed May 12, 2026

View reviewed changes

Comment thread better_memory/hooks/session_close.py

github-actions Bot reviewed May 12, 2026

View reviewed changes

emp3thy merged commit eceed6a into main May 12, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: memory rating — closed-loop self-rating of reflections + semantic memories#52

feat: memory rating — closed-loop self-rating of reflections + semantic memories#52
emp3thy merged 44 commits into
mainfrom
worktree-memory-rating

emp3thy commented May 12, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emp3thy commented May 12, 2026

Summary

Test plan

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔴 Claude BugBot Analysis

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔴 Claude BugBot Analysis

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🟢 Claude BugBot Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant