feat(memory): add stage-2 episode generation (#385)#387
Conversation
Adds a second memory type alongside facts: episodes, which capture "what happened, what we learned" via a Sophia-shaped record. Two-stage flow piggybacks on the existing Haiku extraction subprocess for classification, then spawns a separate fire-and-forget subprocess on positives to generate the episode. - Stage 1: _FACT_SCHEMA gains a has_episode boolean. Single extra output token per call; no new subprocess. _run_extractor returns an ExtractionResult dataclass so every early-exit path defaults has_episode=False (a flaky extraction can never falsely fire stage 2). - Stage 2: new _generate_episode task spawned via asyncio.create_task AFTER _store_facts returns. Strong refs held in _pending_episode_tasks + done-callback to defend against asyncio's weak-ref-only Task tracking. Per-user Semaphore(1) cache, independent of stage 1 (a concurrent stage-1 call for the same user during stage-2 in-flight is desirable). Single memory.episode log line per call with outcome/cost_usd/duration_ms/memory_id. - Retrieval: _SOURCE_WEIGHTS and _SOURCE_SHORT gain "episode" entries. format_context renders episodes as "- (YYYY-MM-DD, episode, <quality>) <goal>. Outcome: <outcome>" using the Sophia moderate-relevance form; remaining fields stored but not rendered inline in v1. - Config: 3 new tunables (model, budget, timeout). Episode model defaults to memory_extraction_model so an operator who only set MEMORY_EXTRACTION_MODEL also moves stage 2 onto the new model. - Wizard: 3 new prompts gated on memory_extraction_enabled. Episode keys popped on non-claude backends, mirroring the extraction-key cleanup. - Tests: 31 new tests across schema/classifier (8.1), trigger (8.2), isolation (8.3), storage/telemetry (8.4), retrieval (8.4 cont'd, and subprocess assembly. Plus an opt-in classifier eval against 32 hand-labeled exchanges (16/16 split), gated by RUN_CLASSIFIER_EVAL=1 so it does not run in CI or burn API tokens during make test. - Stage-2 failures cannot affect stage-1 facts: stage 2 is scheduled AFTER _store_facts returns, runs as a fire-and-forget task wrapped in a broad try/except, and emits a single log line on every path. - Eventual consistency between episode storage and retrieval: a user message arriving within 1-30s of an episode-worthy turn may not see the new episode. Acceptable because episodes are about cumulative pattern recall over many sessions, not per-turn recall. Fixes #385 EOF )
Review by KaiThe file on disk is pre-merge. The diff is the authority here. Here is the review: PR #387 Review: feat(memory): add stage-2 episode generationOverall: well-engineered. The code is clearly structured, the failure-isolation strategy is sound, the strong-ref pattern for background tasks is correct, and the test coverage is thorough. Three real issues below. Warning:
|
Three findings from the review:
1. is_error envelope (e.g. budget exhaustion) was being mislabeled as
outcome=parse_error in _generate_episode. The branch table only
matched timeout/exit_*/invalid_json explicitly; everything else
collapsed to parse_error. Now subprocess-level faults (exit_* AND
is_error) collapse to subprocess_error, so operators triaging by
outcome do not see budget burns mixed with genuine JSON-parse
failures. New test covers the is_error mapping.
2. Pre-initialize stored=replaced=skipped=0 before the try block in
extract_and_store. Pre-PR the no-facts path returned early, so the
variables were always assigned before the post-try summary log; the
stage-2 restructure removed that early-return guarantee. The
pre-acquire init is a defensive fallback - the in-try assignment
still wins on every successful path.
3. memory_episode_model dataclass default changed from a model name
literal to "" so the literal accurately reflects "this gets
overridden by load_config()". Inheritance happens in load_config
via `os.environ.get("MEMORY_EPISODE_MODEL", "").strip() or
memory_extraction_model`. Test fixtures that construct Config
directly already set this explicitly.
The float-equality flag for delta-from-default emission in install.py
is consistent with the existing stage-1 pattern; left as-is.
Review by KaiLooking at the diff against the prior review thread: all three prior warnings/suggestions were addressed - PR #387 Follow-up ReviewTwo new findings; everything from the prior review is resolved. Warning:
|
One real bug, one verification. 1. duration_ms in stage-2 logs was including semaphore-wait time. `start = time.monotonic()` was set before `async with sem`, so a second episode-worthy turn for the same user arriving while the first was still in flight reported its queue-wait as generation latency. Operators monitoring the documented 5-30s expected range would see queue contention as slow generations. Fix mirrors stage 1's pattern: pre-acquire init for the exception-fallback path, in-acquire reassignment for the actual value used on every successful call. New regression test holds the semaphore for 300ms then asserts logged duration_ms is well under that wait. 2. Verified --system-prompt is a real claude CLI flag (already in use by stage 1; `claude --help | grep system` confirms). No code change needed; suggestion was a defensive verification ask.
Review by KaiBoth prior review warnings are confirmed resolved in this diff: Suggestion:
|
Single finding: memory_id and reason had asymmetric presence in the memory.episode log JSON. memory_id was always emitted (even as null on every failure path); reason was conditional (omitted on success). A log query like `outcome != "stored" AND memory_id IS NOT NULL` silently matches nothing because the failure path emits literal null, not absent. Fix: aligns to "both conditional, presence-symmetric." memory_id is emitted only on outcome=stored; reason is emitted only on non-stored outcomes. The two are now mutually exclusive partitions of the log stream rather than one-always-null + one-conditional. Two existing tests updated: - success-path test: replaced `memory_id == "fake-id"` (kept) with a comment pinning the symmetric contract; the existing `"reason" not in payload` assertion still holds. - store-failed-path test: replaced `memory_id is None` with `"memory_id" not in payload` to assert the new omission contract. Low severity; no behavior change beyond log shape. The reviewer explicitly called this non-blocking.
Review by KaiPR #387 Follow-up Review (pass 4)All three findings from the prior review are resolved: Warning:
|
Reviewer flagged a `TypeError` risk in test_episode_classifier_eval.py calling `_build_extraction_payload(user, assistant)` with two args. Verification: the function signature has `candidates: list[MemoryResult] | None = None`, so the 2-arg call does not raise. Confirmed via direct invocation that `_build_extraction_payload(u, a)`, `_build_extraction_payload(u, a, None)`, and `_build_extraction_payload(u, a, [])` all produce identical output; the function checks `if candidates:` which is falsy for all three. The reviewer's headline claim was therefore incorrect, but the underlying cosmetic suggestion is valid: passing `[]` explicitly mirrors the production call shape in extract_and_store under the memory_consolidation_candidates_n=0 kill-switch branch, removing one inferential step for a future reader. Comment added so the "cosmetic, not behavioral" intent is explicit. No production code change.
Review by KaiThis is pass 5. All five prior findings are confirmed resolved in the visible diff. PR #387 Follow-up Review (pass 5)Clean. No new findings. Prior-review resolutions confirmed
Spot checks on new code
The diff is truncated before the end of |
Per operator decision: shipping with Haiku as the stage-2 default risked giving the wrong impression of feature efficacy, since the narrative-quality regime is exactly where Sonnet pulls ahead. The prior cost-asymmetry framing does not apply to Kai's deployment shape: on a Max-plan OAuth subscription headless `claude --print` calls do not bill per-token. Changes: - Wizard recommends `claude-sonnet-4-6` as the stage-2 model default; an operator who hits Enter at the prompt now writes MEMORY_EPISODE_MODEL=claude-sonnet-4-6 to the env file. - Wizard budget default raised from 0.05 to 0.15 to fit Sonnet on the full uncapped (user, assistant) pair. Prompt label updated. - Dataclass default for memory_episode_budget_usd raised to 0.15 in lockstep so an operator who skips the wizard or removes the env line gets a Sonnet-sized safety rail rather than a Haiku-sized one (which would burn the ceiling on every Sonnet call). Updated comment captures the new rationale and the Max-plan billing context. - load_config() default string updated to "0.15". - Wizard emission gate updated from `!= 0.05` to `!= 0.15` so an operator who accepts the wizard default suppresses the redundant env entry. - .env.example updated. Tests: - test_config.TestMemoryEpisode.test_defaults asserts the new 0.15 budget default; docstring rewritten to capture the inheritance + wizard-recommendation split. - test_budget_override switched its override value from 0.15 to 0.30 so it tests the override path rather than coincidentally matching the new default. - test_install.test_memory_enabled_writes_tunables: the input list previously used "" for the episode model to test the inheritance fallback - that path is no longer reachable via the wizard since the prompt now has a non-blank default. Switched to a literal model override. - New test test_memory_episode_wizard_default_writes_sonnet pins the v1 default-flip: operator who accepts every wizard default in the episode block ends up with Sonnet written. - _cfg in test_memory_episodes.py and the is_error envelope cost value updated to 0.15 to reflect the Sonnet-sized scenario. Stage 2 still inherits from MEMORY_EXTRACTION_MODEL when the env var is unset; this is the safety floor for tests and for operators who deliberately want both stages on the same model.
Review by KaiThis is a sixth pass following five previous reviews. The prior five passes identified and confirmed fixes for: PR #387 Review (pass 6)Suggestion:
|
Single cosmetic finding: install.py's pre-wizard init for memory_episode_budget_usd was still "0.05", but the wizard prompt default, the emission gate, the dataclass default, and the load_config() env default all moved to 0.15 in the Sonnet-default flip. Dead code (the wizard always overwrites the variable before the gate runs and the non-extraction path pops the key entirely), but a reader tracing the data flow saw three different sentinel values for the same logical default. Fix: aligns the pre-init to "0.15". Comment clarifies the "wizard default = pre-init = gate sentinel" contract, and explicitly documents the model field as the deliberate exception (empty pre-init for inheritance, non-empty wizard default for the Sonnet recommendation).
Review by KaiPR #387 Review (pass 7)All six prior findings are confirmed resolved in the current diff. The visible portion of the diff is clean. One new observation below. Suggestion:
|
Single suggestion: name the stage-2 asyncio task so it shows up as `episode-<user_id>` in `asyncio.all_tasks()` dumps instead of an auto-generated `Task-N` placeholder. An operator triaging a stuck event loop can then immediately identify which user's stage-2 work is in flight without cross-referencing `_pending_episode_tasks`. Comment on the spawn site documents the secondary-affordance intent (the `_pending_episode_tasks` set is still the primary operational tool; the name is for ad-hoc debugging). New test `test_stage2_task_is_named_for_incident_triage` pins the `episode-<user_id>` name format so a future change that strips the user_id would surface.
Review by KaiPR #387 Review (pass 8)All seven prior findings are confirmed resolved in the current diff. One new observation below. Suggestion: misleading inline comment in
|
The test_memory_episode_wizard_default_writes_sonnet test had a misleading inline comment: "0.05" was annotated as "matches stage-1 dataclass default; suppressed" but the actual stage-1 dataclass default is 0.01, so 0.05 would have been emitted to the env file (the gate is `!= 0.01`). The test passed only because it never asserted on MEMORY_EXTRACTION_BUDGET_USD; a future test author copying the pattern would have inherited the wrong mental model of the dataclass defaults. Fix: change the input from "0.05" to "0.01" so it actually exercises the suppression path the comment claimed. Same for the broader input list - every input now matches its corresponding dataclass default. Added negative assertions on MEMORY_EXTRACTION_BUDGET_USD, MEMORY_EXTRACTION_TIMEOUT_S, and MEMORY_CONSOLIDATION_CANDIDATES_N so the suppression contract is now self-checking; a future change to the gate semantics for any of those keys surfaces here.
Review by KaiPR #387 Review (pass 9)Pass 8 finding resolved. The Clean - no new findingsAll eight prior findings are confirmed resolved in this diff:
Spot-checked in this pass: The PR is ready to merge. |
Summary
Adds a second memory type alongside facts: episodes, which capture "what happened, what we learned" via a Sophia-shaped record. Two-stage flow piggybacks on the existing Haiku extraction subprocess for classification, then spawns a separate fire-and-forget subprocess on positives to generate the episode.
Fixes #385.
Implementation
Stage 1 (modified)
_FACT_SCHEMAgains ahas_episode: booleanroot field. Required so the field is always present; one extra output token per call, no new subprocess._run_extractornow returns anExtractionResult(facts, has_episode)dataclass. Every early-exit path (timeout, non-zero exit, invalid JSON, is_error envelope, non-dict parsed) defaultshas_episode=Falseso a flaky extraction can never falsely fire stage 2._EXTRACTION_SYSTEM_PROMPTadds an EPISODE CLASSIFICATION section between FORMAT and CONSOLIDATION;_EXTRACTION_PROMPT_VERSIONbumped to"3".Stage 2 (new)
_generate_episodetask body, spawned viaasyncio.create_taskAFTER_store_factsreturns so stage-1 facts are durable before stage 2 hits the loop._pending_episode_tasksset withtask.add_done_callback(set.discard)to defend against asyncio's documented weak-ref-only Task tracking. Pattern matcheswebhook.py's_background_tasks.Semaphore(1)cache (_per_user_episode_semaphores) is independent of the stage-1 semaphore: a concurrent stage-1 call for the same user during a stage-2 in-flight is desirable (stage 1 is the user's next turn).memory.episodelog line per call. Payload:user_id,outcome(stored/timeout/subprocess_error/parse_error/store_failed),memory_id,cost_usd,duration_ms,reason(omitted on success)._run_episode_extractorreuses the stage-1 subprocess flag set verbatim except for model / budget / timeout / schema / system prompt. Same--tools "",--permission-mode bypassPermissions,--no-session-persistence, allow-listed env. Security review of stage 1 transfers cleanly.Retrieval
_SOURCE_WEIGHTSand_SOURCE_SHORTgain"episode"entries. Equal weight to extracted facts in v1 (both high-signal curated content).format_contextrenders episodes in the Sophia "moderate relevance" form:- (YYYY-MM-DD, episode, <quality>) <goal>. Outcome: <outcome>. Remaining Sophia fields (context,approach,lessons,tags,actors) stored but not rendered inline in v1. Defensive fallback tor.text.split("\n")[0]whengoalmetadata is missing.Configuration
memory_episode_model,memory_episode_budget_usd(default0.05,> 0),memory_episode_timeout_s(default120,>= 10).memory_episode_modeldefaults to whatevermemory_extraction_modelis set to, so an operator who only flipsMEMORY_EXTRACTION_MODELalso moves stage 2 onto the new model. ExplicitMEMORY_EPISODE_MODELtakes precedence._cmd_configgated onmemory_extraction_enabled. Episode env keys popped on non-claude backends, mirroring the extraction-key cleanup.Tradeoffs
create_taskandadd_structureddrops the in-flight episode. Cheaper than persistent task durability for a feature whose value is cumulative.(user_msg, assistant_msg)pair, uncapped on either side. Multi-turn arc situations surface only as adjacent single-turn episodes; cross-turn aggregation is a follow-up evaluation question.intent=update_offor facts) is a follow-up.Tests
tests/test_memory_episodes.pyacross schema/classifier, stage-2 trigger, stage-2 isolation, storage/telemetry, retrieval surfacing, and subprocess command assembly.tests/test_memory_extraction.pyupdated for the{facts, has_episode}required set.tests/test_config.pyfor the 3 episode config fields, including the model-inheritance contract and the strict-positive budget bound.tests/test_eval_retrieval.pyfixture updated to include"episode"in its_SOURCE_WEIGHTSsnapshot so cross-suite ordering does not strip the new entry.tests/data/episode_classifier_labeled.jsonl(16 episode-worthy / 16 routine, scrubbed for personal names + local paths).tests/test_episode_classifier_eval.py: skipped by default, run withRUN_CLASSIFIER_EVAL=1 pytest tests/test_episode_classifier_eval.py -sto producetests/artifacts/episode_classifier_eval.jsonfor reviewer threshold-setting. v1 does not assert a hard threshold; working target is precision >= 0.7, recall >= 0.6.Test plan
make check(ruff check + format) — local: passesmake test(full suite, 2480 tests + 1 skipped) — local: passesRUN_CLASSIFIER_EVAL=1 pytest tests/test_episode_classifier_eval.py -s; attach the JSON artifact to this PR for reviewer threshold-settingmemory.episodelog lines in production: confirmoutcome=storedis the dominant outcome,cost_usdaligns with the $0.05 ceiling, andduration_msclusters in the expected 5-30s rangeformat_contextoutput in a fresh session that has at least one stored episode: confirm the(YYYY-MM-DD, episode, <quality>)line shape renders inline among factsOut of scope
/memorycommand UI (get_by_id,get_all,get_stats,get_by_tagkeep theirsource == "extracted"filter and continue to return facts only). Follow-up.context,approach, andlessons). v1 renders the moderate-relevance form only; tiered is a follow-up.(user_msg, assistant_msg)pair. Revisit only after the single-turn baseline has been evaluated in production.