Skip to content

feat(memory): add stage-2 episode generation (#385)#387

Merged
dcellison merged 9 commits intomainfrom
feature/episode-memory-385
Apr 25, 2026
Merged

feat(memory): add stage-2 episode generation (#385)#387
dcellison merged 9 commits intomainfrom
feature/episode-memory-385

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Summary

Adds a second memory type alongside facts: episodes, which capture "what happened, what we learned" via a Sophia-shaped record. Two-stage flow piggybacks on the existing Haiku extraction subprocess for classification, then spawns a separate fire-and-forget subprocess on positives to generate the episode.

Fixes #385.

Implementation

Stage 1 (modified)

  • _FACT_SCHEMA gains a has_episode: boolean root field. Required so the field is always present; one extra output token per call, no new subprocess.
  • _run_extractor now returns an ExtractionResult(facts, has_episode) dataclass. Every early-exit path (timeout, non-zero exit, invalid JSON, is_error envelope, non-dict parsed) defaults has_episode=False so a flaky extraction can never falsely fire stage 2.
  • _EXTRACTION_SYSTEM_PROMPT adds an EPISODE CLASSIFICATION section between FORMAT and CONSOLIDATION; _EXTRACTION_PROMPT_VERSION bumped to "3".

Stage 2 (new)

  • _generate_episode task body, spawned via asyncio.create_task AFTER _store_facts returns so stage-1 facts are durable before stage 2 hits the loop.
  • Strong refs held in a module-level _pending_episode_tasks set with task.add_done_callback(set.discard) to defend against asyncio's documented weak-ref-only Task tracking. Pattern matches webhook.py's _background_tasks.
  • Per-user Semaphore(1) cache (_per_user_episode_semaphores) is independent of the stage-1 semaphore: a concurrent stage-1 call for the same user during a stage-2 in-flight is desirable (stage 1 is the user's next turn).
  • Single memory.episode log line per call. Payload: user_id, outcome (stored/timeout/subprocess_error/parse_error/store_failed), memory_id, cost_usd, duration_ms, reason (omitted on success).
  • _run_episode_extractor reuses the stage-1 subprocess flag set verbatim except for model / budget / timeout / schema / system prompt. Same --tools "", --permission-mode bypassPermissions, --no-session-persistence, allow-listed env. Security review of stage 1 transfers cleanly.

Retrieval

  • _SOURCE_WEIGHTS and _SOURCE_SHORT gain "episode" entries. Equal weight to extracted facts in v1 (both high-signal curated content).
  • format_context renders episodes in the Sophia "moderate relevance" form: - (YYYY-MM-DD, episode, <quality>) <goal>. Outcome: <outcome>. Remaining Sophia fields (context, approach, lessons, tags, actors) stored but not rendered inline in v1. Defensive fallback to r.text.split("\n")[0] when goal metadata is missing.

Configuration

  • 3 new fields: memory_episode_model, memory_episode_budget_usd (default 0.05, > 0), memory_episode_timeout_s (default 120, >= 10).
  • memory_episode_model defaults to whatever memory_extraction_model is set to, so an operator who only flips MEMORY_EXTRACTION_MODEL also moves stage 2 onto the new model. Explicit MEMORY_EPISODE_MODEL takes precedence.
  • 3 new wizard prompts in _cmd_config gated on memory_extraction_enabled. Episode env keys popped on non-claude backends, mirroring the extraction-key cleanup.

Tradeoffs

  • Eventual consistency: a user message arriving within 1-30s of an episode-worthy turn may not see the new episode in retrieval. Acceptable because episodes are about cumulative pattern recall over many sessions, not per-turn recall.
  • Process restart loses in-flight episodes: stage 2 lives only as long as the parent process. A restart between create_task and add_structured drops the in-flight episode. Cheaper than persistent task durability for a feature whose value is cumulative.
  • Single-turn input scope: stage 2 is given exactly one (user_msg, assistant_msg) pair, uncapped on either side. Multi-turn arc situations surface only as adjacent single-turn episodes; cross-turn aggregation is a follow-up evaluation question.
  • No dedup in v1: episode consolidation (analogue of intent=update_of for facts) is a follow-up.

Tests

  • 31 new tests in tests/test_memory_episodes.py across schema/classifier, stage-2 trigger, stage-2 isolation, storage/telemetry, retrieval surfacing, and subprocess command assembly.
  • Stage-1 schema test in tests/test_memory_extraction.py updated for the {facts, has_episode} required set.
  • 11 new tests in tests/test_config.py for the 3 episode config fields, including the model-inheritance contract and the strict-positive budget bound.
  • 2 install tests updated to provide answers for the 3 new wizard prompts.
  • tests/test_eval_retrieval.py fixture updated to include "episode" in its _SOURCE_WEIGHTS snapshot so cross-suite ordering does not strip the new entry.
  • 32 hand-labeled (user, assistant) exchanges in tests/data/episode_classifier_labeled.jsonl (16 episode-worthy / 16 routine, scrubbed for personal names + local paths).
  • Opt-in evaluation harness tests/test_episode_classifier_eval.py: skipped by default, run with RUN_CLASSIFIER_EVAL=1 pytest tests/test_episode_classifier_eval.py -s to produce tests/artifacts/episode_classifier_eval.json for reviewer threshold-setting. v1 does not assert a hard threshold; working target is precision >= 0.7, recall >= 0.6.

Test plan

  • make check (ruff check + format) — local: passes
  • make test (full suite, 2480 tests + 1 skipped) — local: passes
  • Run RUN_CLASSIFIER_EVAL=1 pytest tests/test_episode_classifier_eval.py -s; attach the JSON artifact to this PR for reviewer threshold-setting
  • After deploy, observe a handful of memory.episode log lines in production: confirm outcome=stored is the dominant outcome, cost_usd aligns with the $0.05 ceiling, and duration_ms clusters in the expected 5-30s range
  • After deploy, check format_context output in a fresh session that has at least one stored episode: confirm the (YYYY-MM-DD, episode, <quality>) line shape renders inline among facts

Out of scope

  • Surfacing episodes in the /memory command UI (get_by_id, get_all, get_stats, get_by_tag keep their source == "extracted" filter and continue to return facts only). Follow-up.
  • Manual episode add/edit/delete via slash command. Follow-up.
  • Episode consolidation / deduplication. Follow-up.
  • Tiered retrieval (high-relevance full-trace rendering that expands to include context, approach, and lessons). v1 renders the moderate-relevance form only; tiered is a follow-up.
  • Tag-based filtering in retrieval. Episodes store domain tags; the filter is a follow-up.
  • Expanding the stage-2 input window beyond a single (user_msg, assistant_msg) pair. Revisit only after the single-turn baseline has been evaluated in production.
  • Persistent task durability across restarts.

Adds a second memory type alongside facts: episodes, which capture
"what happened, what we learned" via a Sophia-shaped record. Two-stage
flow piggybacks on the existing Haiku extraction subprocess for
classification, then spawns a separate fire-and-forget subprocess on
positives to generate the episode.

- Stage 1: _FACT_SCHEMA gains a has_episode boolean. Single extra
  output token per call; no new subprocess. _run_extractor returns an
  ExtractionResult dataclass so every early-exit path defaults
  has_episode=False (a flaky extraction can never falsely fire stage 2).

- Stage 2: new _generate_episode task spawned via asyncio.create_task
  AFTER _store_facts returns. Strong refs held in _pending_episode_tasks
  + done-callback to defend against asyncio's weak-ref-only Task
  tracking. Per-user Semaphore(1) cache, independent of stage 1 (a
  concurrent stage-1 call for the same user during stage-2 in-flight
  is desirable). Single memory.episode log line per call with
  outcome/cost_usd/duration_ms/memory_id.

- Retrieval: _SOURCE_WEIGHTS and _SOURCE_SHORT gain "episode" entries.
  format_context renders episodes as
  "- (YYYY-MM-DD, episode, <quality>) <goal>. Outcome: <outcome>"
  using the Sophia moderate-relevance form; remaining fields stored
  but not rendered inline in v1.

- Config: 3 new tunables (model, budget, timeout). Episode model
  defaults to memory_extraction_model so an operator who only set
  MEMORY_EXTRACTION_MODEL also moves stage 2 onto the new model.

- Wizard: 3 new prompts gated on memory_extraction_enabled. Episode
  keys popped on non-claude backends, mirroring the extraction-key
  cleanup.

- Tests: 31 new tests across schema/classifier (8.1), trigger (8.2),
  isolation (8.3), storage/telemetry (8.4), retrieval (8.4 cont'd,
  and subprocess assembly. Plus an opt-in classifier eval against 32
  hand-labeled exchanges (16/16 split), gated by RUN_CLASSIFIER_EVAL=1
  so it does not run in CI or burn API tokens during make test.

- Stage-2 failures cannot affect stage-1 facts: stage 2 is scheduled
  AFTER _store_facts returns, runs as a fire-and-forget task wrapped
  in a broad try/except, and emits a single log line on every path.

- Eventual consistency between episode storage and retrieval: a user
  message arriving within 1-30s of an episode-worthy turn may not
  see the new episode. Acceptable because episodes are about
  cumulative pattern recall over many sessions, not per-turn recall.

Fixes #385
EOF
)
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

The file on disk is pre-merge. The diff is the authority here. Here is the review:


PR #387 Review: feat(memory): add stage-2 episode generation

Overall: well-engineered. The code is clearly structured, the failure-isolation strategy is sound, the strong-ref pattern for background tasks is correct, and the test coverage is thorough. Three real issues below.


Warning: is_error budget exhaustion maps to "parse_error" outcome

File: memory_extraction.py, _generate_episode (the if episode is None branch added in the diff)

_run_episode_extractor returns a reason string like "is_error subtype=error_max_budget_usd" when the CLI exits 0 but reports is_error=true. In _generate_episode:

if run_reason == "timeout":
    outcome = "timeout"
elif run_reason and run_reason.startswith("exit_"):
    outcome = "subprocess_error"
elif run_reason and run_reason.startswith("invalid_json"):
    outcome = "parse_error"
else:
    outcome = "parse_error"   # <-- catches is_error, non_object_envelope, missing_episode_field

"is_error subtype=error_max_budget_usd" falls to the else branch and becomes "parse_error". Budget exhaustion is an API/subprocess-level fault, not a content-parsing fault. The PR's own test plan calls for operators to watch cost_usd alignment post-deploy; if they also filter logs by outcome=parse_error to triage failures, they will see budget burns mixed with genuine JSON problems. The reason field preserves the full detail, but the outcome label is misleading.

Fix: add an elif run_reason and run_reason.startswith("is_error"): branch mapping to "subprocess_error", which is semantically correct (the subprocess reported an error condition).


Warning: return stored after try block - silent NameError path under one new branch

File: memory_extraction.py, extract_and_store (diff around line 1744)

Before the PR, the code returned early when no facts were found (return 0 inside the if not facts block), so stored was always assigned before the post-try log line. After the PR, stored = replaced = skipped = 0 is set in the else branch, which is fine for the no-facts path. However, the code now reads:

if result.facts:
    stored, replaced, skipped = await loop.run_in_executor(...)
else:
    stored = replaced = skipped = 0
if result.has_episode:
    ep_task = asyncio.create_task(...)
    ...
# (end of try block)

If result.facts is truthy and run_in_executor raises (e.g., _store_facts raises unexpectedly), execution jumps to except Exception: return 0. stored is never defined but we never reach return stored either - so no actual bug today. The concern is that if create_task itself raised (extremely unlikely), it would also hit except Exception: return 0, which is fine. This is currently safe, but worth a note: the variable lifecycle relies on all exception paths short-circuiting before the post-try log.

This is a warning rather than critical because the current exception handlers do all return before reaching return stored.


Suggestion: memory_episode_model dataclass default is never used at runtime

File: config.py, line ~510

memory_episode_model: str = "claude-haiku-4-5-20251001"

load_config() unconditionally sets this field via:

memory_episode_model = os.environ.get("MEMORY_EPISODE_MODEL", "").strip() or memory_extraction_model

The dataclass literal "claude-haiku-4-5-20251001" is never the effective default for any running process - memory_extraction_model is. The comment above the field documents this, but the literal still misleads a reader who grep-matches to find "what is the default model." A more honest default is "" with the inheritance documented only in load_config() where it actually happens. Minor, but consistent with how the rest of the config fields work (they all have the actual default as the literal).


Suggestion: float equality for delta-from-default in install.py

File: install.py, diff around line 1018

if float(memory_episode_budget_usd) != 0.05:
    env["MEMORY_EPISODE_BUDGET_USD"] = memory_episode_budget_usd

Floating-point equality with a literal like 0.05 is fragile: float("0.05") == 0.05 is True in CPython for this specific value, but the pattern is hazardous. If a future value like "0.1" is involved in a similar delta check elsewhere, float("0.1") != 0.1 can be True due to IEEE 754 rounding. The worst-case consequence here is writing an unnecessary key to the env file, which is low severity. The stage-1 equivalent check for memory_extraction_budget_usd uses the same pattern, so this is consistent with existing code - just flagging it.


Clean items (explicitly verified)

  • Strong-ref task pattern: _pending_episode_tasks.add(ep_task) + ep_task.add_done_callback(_pending_episode_tasks.discard) is correct. set.discard takes one argument matching the done-callback signature callback(task).
  • ExtractionResult default-False on all failure paths: all five early-exit paths in _run_extractor return ExtractionResult(facts=[], has_episode=False). bool(payload_root.get("has_episode")) defensively coerces missing/non-bool to False.
  • Semaphore independence: stage-2 semaphore dict is separate from stage-1's, correctly allowing a concurrent stage-1 call while stage-2 is in-flight.
  • CancelledError re-raise: both _generate_episode and extract_and_store re-raise CancelledError after logging, and skip the episode log line on cancellation (intentional, documented).
  • Subprocess security: list-based create_subprocess_exec (no shell), env allowlist, --permission-mode bypassPermissions, --tools "", --no-session-persistence. Role-label stripping carried over from stage 1.
  • Convention compliance: all six steps from the "Adding a New Env Var" section are covered. Step 3 (pool.py/claude.py passthrough) is correctly skipped because these fields affect the extraction subprocess, not the inner Claude.

Three findings from the review:

1. is_error envelope (e.g. budget exhaustion) was being mislabeled as
   outcome=parse_error in _generate_episode. The branch table only
   matched timeout/exit_*/invalid_json explicitly; everything else
   collapsed to parse_error. Now subprocess-level faults (exit_* AND
   is_error) collapse to subprocess_error, so operators triaging by
   outcome do not see budget burns mixed with genuine JSON-parse
   failures. New test covers the is_error mapping.

2. Pre-initialize stored=replaced=skipped=0 before the try block in
   extract_and_store. Pre-PR the no-facts path returned early, so the
   variables were always assigned before the post-try summary log; the
   stage-2 restructure removed that early-return guarantee. The
   pre-acquire init is a defensive fallback - the in-try assignment
   still wins on every successful path.

3. memory_episode_model dataclass default changed from a model name
   literal to "" so the literal accurately reflects "this gets
   overridden by load_config()". Inheritance happens in load_config
   via `os.environ.get("MEMORY_EPISODE_MODEL", "").strip() or
   memory_extraction_model`. Test fixtures that construct Config
   directly already set this explicitly.

The float-equality flag for delta-from-default emission in install.py
is consistent with the existing stage-1 pattern; left as-is.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Looking at the diff against the prior review thread: all three prior warnings/suggestions were addressed - is_error now correctly maps to subprocess_error, stored = replaced = skipped = 0 is pre-initialized before the try block, and memory_episode_model defaults to "". I'll focus on what's new.


PR #387 Follow-up Review

Two new findings; everything from the prior review is resolved.


Warning: duration_ms in stage-2 logs includes semaphore wait time

File: memory_extraction.py, _generate_episode

start = time.monotonic()          # clock starts here
sem = _get_episode_semaphore(user_id)
...
try:
    async with sem:               # semaphore acquired here

Stage 1 explicitly restarts the clock after acquiring the semaphore ("Restart the clock AFTER acquiring the per-user semaphore" is literally in the new comment). Stage 2 does not - start is set before async with sem:. Under the common case (one in-flight episode per user) there is no contention and the delta is negligible. But if two episode-worthy turns arrive in quick succession, the second call's duration_ms will include queue wait time, conflating wait with generation latency. Operators watching duration_ms in production to confirm "clusters in the expected 5-30s range" (from the test plan) will see outliers that look like slow generations but are actually queue waits.

Fix: move start = time.monotonic() to immediately after async with sem:, mirroring stage 1.


Suggestion: --system-prompt flag not present in stage-1 flag set - verify against claude --help

File: memory_extraction.py, _run_episode_extractor

The PR description lists system prompt as one of the five things stage 2 changes from stage 1's flag set. The conventions rule is explicit: "Inner Claude CLI flags: verify with claude --help before use." If --system-prompt is not a real CLI flag (e.g., the actual flag is --system or it must be passed via a settings file), every stage-2 subprocess call will exit non-zero and log outcome=subprocess_error. Since stage 2 is fire-and-forget, this failure is silent to the user and would only surface in post-deploy log monitoring.

If the flag is already used in stage 1 (not visible in the truncated diff), this transfers cleanly and can be ignored. Otherwise, it warrants a one-line claude --help | grep system confirmation before merge.


Clean items confirmed in this pass

  • Prior review's is_errorsubprocess_error fix is present and correct in _generate_episode's outcome dispatch.
  • stored = replaced = skipped = 0 pre-initialization is in place before the try block.
  • memory_episode_model: str = "" with inheritance documented only in load_config() is correct.
  • format_context episode rendering handles the no-date path correctly (date_str = "" initialized before the if r.created_at: block).
  • _generate_episode never raises: CancelledError is re-raised before the log line; all other exceptions are caught and produce outcome=store_failed.
  • _EPISODE_SCHEMA lessons field is correctly optional (absent from required), preventing fabricated lessons.
  • Test fixture _reset_episode_state clears both semaphore caches and cancels in-flight tasks, preventing cross-test bleed.

One real bug, one verification.

1. duration_ms in stage-2 logs was including semaphore-wait time.
   `start = time.monotonic()` was set before `async with sem`, so a
   second episode-worthy turn for the same user arriving while the
   first was still in flight reported its queue-wait as generation
   latency. Operators monitoring the documented 5-30s expected range
   would see queue contention as slow generations.

   Fix mirrors stage 1's pattern: pre-acquire init for the
   exception-fallback path, in-acquire reassignment for the actual
   value used on every successful call. New regression test holds
   the semaphore for 300ms then asserts logged duration_ms is well
   under that wait.

2. Verified --system-prompt is a real claude CLI flag (already in
   use by stage 1; `claude --help | grep system` confirms). No code
   change needed; suggestion was a defensive verification ask.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Both prior review warnings are confirmed resolved in this diff: is_error correctly routes to subprocess_error, stored = replaced = skipped = 0 is pre-initialized before the try block, memory_episode_model defaults to "", and duration_ms clock restart after semaphore acquisition is in place. One new minor finding below; everything else is clean.


Suggestion: memory_id: null always emitted on failure paths

File: memory_extraction.py, _emit_episode_log

payload: dict = {
    "user_id": user_id,
    "outcome": outcome,
    "memory_id": memory_id,      # always present, even when None
    "cost_usd": cost_usd,
    "duration_ms": duration_ms,
}
if reason is not None:
    payload["reason"] = reason   # conditionally present

reason is omitted from the JSON on success (the PR description explicitly calls this out: "reason (omitted on success)"). memory_id follows the opposite convention - always present even as null on every non-stored path. This asymmetry means an operator writing a log query like outcome != "stored" AND memory_id IS NOT NULL would silently match nothing. Low severity since outcome is the correct filter key, but the two fields use inconsistent presence semantics with no stated reason for the difference.


Clean items confirmed in this pass

  • duration_ms semaphore fix: start = time.monotonic() is correctly placed inside async with sem: in _generate_episode, mirroring stage 1. The prior review warning is resolved.
  • ExtractionResult defaults: all five early-exit paths return ExtractionResult(facts=[], has_episode=False). bool(payload_root.get("has_episode")) coerces missing/non-bool defensively. Confirmed correct.
  • Strong-ref task pattern: _pending_episode_tasks.add(ep_task) + ep_task.add_done_callback(_pending_episode_tasks.discard) - set.discard signature matches the done-callback contract.
  • CancelledError path: re-raised before _emit_episode_log, semaphore released correctly via async with exit. _emit_episode_log unreachable on cancellation as intended.
  • content and extra key access: direct episode["goal"] etc. on required schema fields - any KeyError is caught by except Exception and produces outcome=store_failed. Safe.
  • Episode spawn ordering: create_task is after run_in_executor(_store_facts) returns, so stage-1 facts are durable before stage 2 is even scheduled. The test_stage2_scheduled_after_store_facts test enforces this invariant.
  • Convention compliance: all six "Adding a New Env Var" steps covered; .env.example, config.py, load_config(), install.py, and test_config.py are all updated consistently.
  • Labeled dataset quality: the 32 episode_classifier_labeled.jsonl examples are well-constructed - the episode/non-episode split covers the boundary cases (partial outcome, mid-arc turns, lookup-only exchanges) and the notes field on entry 1 confirms the canonical positive case.

The PR is in good shape. The one new finding above is a log ergonomics suggestion and does not block merge.

Single finding: memory_id and reason had asymmetric presence in the
memory.episode log JSON. memory_id was always emitted (even as null
on every failure path); reason was conditional (omitted on success).
A log query like `outcome != "stored" AND memory_id IS NOT NULL`
silently matches nothing because the failure path emits literal
null, not absent.

Fix: aligns to "both conditional, presence-symmetric." memory_id is
emitted only on outcome=stored; reason is emitted only on non-stored
outcomes. The two are now mutually exclusive partitions of the log
stream rather than one-always-null + one-conditional.

Two existing tests updated:
- success-path test: replaced `memory_id == "fake-id"` (kept) with a
  comment pinning the symmetric contract; the existing
  `"reason" not in payload` assertion still holds.
- store-failed-path test: replaced `memory_id is None` with
  `"memory_id" not in payload` to assert the new omission contract.

Low severity; no behavior change beyond log shape. The reviewer
explicitly called this non-blocking.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR #387 Follow-up Review (pass 4)

All three findings from the prior review are resolved: memory_id and reason are now both conditionally present in _emit_episode_log, and the docstring explicitly documents the symmetric presence semantics. One new finding below.


Warning: test_episode_classifier_eval.py calls _build_extraction_payload with two arguments

File: tests/test_episode_classifier_eval.py, line ~99

payload = _build_extraction_payload(user, assistant)

Everywhere else in the codebase the function is called with three arguments, including candidates:

# extract_and_store (memory_extraction.py)
payload = _build_extraction_payload(user_text, assistant_capped, candidates)

If candidates has no default, this call fails at runtime with TypeError: _build_extraction_payload() missing 1 required positional argument: 'candidates'. The test is skipped by make test, so this only surfaces when a reviewer runs RUN_CLASSIFIER_EVAL=1 pytest tests/test_episode_classifier_eval.py -s - which the PR test plan explicitly asks reviewers to do. The fix is _build_extraction_payload(user, assistant, []) (empty candidates matches the memory_consolidation_candidates_n=0 already set in _eval_config()).


Clean items confirmed this pass

  • _emit_episode_log symmetry: both memory_id and reason are now conditionally emitted (if memory_id is not None, if reason is not None). The docstring accurately describes the mutually-exclusive partition semantics.
  • store_failed + reason=None gap: traced all paths - the except Exception block always sets reason before _emit_episode_log is reached, so there is no silent store_failed without a reason.
  • _generate_episode never-raises contract: CancelledError is re-raised before the log line; all other paths through the outcome dispatch reach _emit_episode_log with a non-None reason or memory_id.
  • All three prior-review warnings/suggestions confirmed resolved.

The eval test arity issue is the only new finding and is straightforward to fix.

Reviewer flagged a `TypeError` risk in test_episode_classifier_eval.py
calling `_build_extraction_payload(user, assistant)` with two args.

Verification: the function signature has `candidates: list[MemoryResult]
| None = None`, so the 2-arg call does not raise. Confirmed via direct
invocation that `_build_extraction_payload(u, a)`,
`_build_extraction_payload(u, a, None)`, and
`_build_extraction_payload(u, a, [])` all produce identical output;
the function checks `if candidates:` which is falsy for all three.

The reviewer's headline claim was therefore incorrect, but the
underlying cosmetic suggestion is valid: passing `[]` explicitly
mirrors the production call shape in extract_and_store under the
memory_consolidation_candidates_n=0 kill-switch branch, removing one
inferential step for a future reader. Comment added so the
"cosmetic, not behavioral" intent is explicit.

No production code change.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

This is pass 5. All five prior findings are confirmed resolved in the visible diff.


PR #387 Follow-up Review (pass 5)

Clean. No new findings.

Prior-review resolutions confirmed

  • is_errorsubprocess_error routing in _generate_episode: present and correct.
  • stored = replaced = skipped = 0 pre-initialization before the try block: present.
  • duration_ms clock restart after async with sem: in _generate_episode: present (start = time.monotonic() is inside the async with sem: body).
  • _emit_episode_log symmetric presence (memory_id and reason both conditional): present.
  • _build_extraction_payload(user, assistant, []) in test_episode_classifier_eval.py: present on the visible line.

Spot checks on new code

  • ExtractionResult dataclass: all five early-exit paths in _run_extractor return ExtractionResult(facts=[], has_episode=False). Defensive bool(payload_root.get("has_episode")) coerces missing/non-bool to False. Correct.
  • Stage-2 spawn ordering in extract_and_store: create_task is after run_in_executor(_store_facts) returns. test_stage2_scheduled_after_store_facts enforces the invariant.
  • Strong-ref pattern: _pending_episode_tasks.add(ep_task) + ep_task.add_done_callback(_pending_episode_tasks.discard). set.discard signature matches the done-callback contract.
  • CancelledError re-raise: precedes _emit_episode_log, so the log is never emitted on shutdown. except Exception does not catch BaseException subclasses, so no interference.
  • format_context episode rendering: date_str = "" initialized before the if r.created_at: block; fallback to r.text.split("\n")[0] when goal is missing is safe (split on empty string yields [""], giving an empty goal rather than an IndexError).
  • Config fields: all six "Adding a New Env Var" steps covered in config.py, load_config(), install.py, .env.example, and test_config.py.

The diff is truncated before the end of tests/test_memory_episodes.py, but the visible fixture and test infrastructure (_reset_episode_state, _make_proc, _stage1_envelope, _stage2_envelope) are correctly shaped.

Per operator decision: shipping with Haiku as the stage-2 default
risked giving the wrong impression of feature efficacy, since the
narrative-quality regime is exactly where Sonnet pulls ahead. The
prior cost-asymmetry framing does not apply to Kai's deployment
shape: on a Max-plan OAuth subscription headless `claude --print`
calls do not bill per-token.

Changes:

- Wizard recommends `claude-sonnet-4-6` as the stage-2 model
  default; an operator who hits Enter at the prompt now writes
  MEMORY_EPISODE_MODEL=claude-sonnet-4-6 to the env file.
- Wizard budget default raised from 0.05 to 0.15 to fit Sonnet on
  the full uncapped (user, assistant) pair. Prompt label updated.
- Dataclass default for memory_episode_budget_usd raised to 0.15
  in lockstep so an operator who skips the wizard or removes the
  env line gets a Sonnet-sized safety rail rather than a
  Haiku-sized one (which would burn the ceiling on every Sonnet
  call). Updated comment captures the new rationale and the
  Max-plan billing context.
- load_config() default string updated to "0.15".
- Wizard emission gate updated from `!= 0.05` to `!= 0.15` so
  an operator who accepts the wizard default suppresses the
  redundant env entry.
- .env.example updated.

Tests:

- test_config.TestMemoryEpisode.test_defaults asserts the new
  0.15 budget default; docstring rewritten to capture the
  inheritance + wizard-recommendation split.
- test_budget_override switched its override value from 0.15 to
  0.30 so it tests the override path rather than coincidentally
  matching the new default.
- test_install.test_memory_enabled_writes_tunables: the input
  list previously used "" for the episode model to test the
  inheritance fallback - that path is no longer reachable via
  the wizard since the prompt now has a non-blank default.
  Switched to a literal model override.
- New test test_memory_episode_wizard_default_writes_sonnet
  pins the v1 default-flip: operator who accepts every wizard
  default in the episode block ends up with Sonnet written.
- _cfg in test_memory_episodes.py and the is_error envelope
  cost value updated to 0.15 to reflect the Sonnet-sized
  scenario.

Stage 2 still inherits from MEMORY_EXTRACTION_MODEL when the
env var is unset; this is the safety floor for tests and for
operators who deliberately want both stages on the same model.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

This is a sixth pass following five previous reviews. The prior five passes identified and confirmed fixes for: is_error outcome routing, stored pre-initialization, duration_ms semaphore clock placement, _emit_episode_log symmetric presence semantics, and _build_extraction_payload arity in the eval test. All are confirmed resolved in the current diff. One new finding below.


PR #387 Review (pass 6)

Suggestion: install.py pre-init memory_episode_budget_usd = "0.05" breaks the stage-1 pattern

File: src/kai/install.py, diff hunk near memory_episode_budget_usd = "0.05"

Every other budget variable in _cmd_config follows a three-way consistent pattern: pre-wizard init = wizard prompt default = emission gate sentinel. The stage-1 extractor confirms this - memory_extraction_budget_usd is initialized to "0.01", the wizard defaults to "0.01", and the emission gate checks != 0.01.

The episode budget breaks this pattern:

memory_episode_budget_usd = "0.05"          # pre-init

# wizard prompt default:
existing_env.get("MEMORY_EPISODE_BUDGET_USD", "0.15")   # "0.15"

# emission gate:
if float(memory_episode_budget_usd) != 0.15:            # 0.15

Pre-init is "0.05", but wizard default and gate are both 0.15. The discrepancy is harmless at runtime - the wizard always overwrites the variable before the gate runs, and the non-extraction path pops the key entirely. But a reader tracing the data flow from the init sees 0.05, checks the gate, sees 0.15, and has to reconstruct the "wizard always overwrites it" argument to confirm the init is dead. Every other init in this function is self-documenting because it equals the default.

Fix: change the pre-init to "0.15" to match the wizard default, the emission gate, the config.py dataclass default, the load_config() env default, and the .env.example comment. Then the pattern is consistent across the file.


Confirmed clean (this pass)

  • All five prior-review findings resolved and present in the current diff.
  • Stage-1/stage-2 semaphore independence: separate dicts, correct LRU cap.
  • ExtractionResult defaults: all five early-exit paths return has_episode=False; defensive bool() coercion on the parsed value.
  • _emit_episode_log symmetric presence: memory_id and reason both conditional, mutually-exclusive on the stored vs non-stored partition.
  • _build_extraction_payload(user, assistant, []) arity in test_episode_classifier_eval.py: present on the visible call site.
  • format_context episode rendering: date_str = "" pre-init, split("\n")[0] fallback safe on empty string.
  • Episode spawn ordering: create_task after run_in_executor(_store_facts) returns.
  • Strong-ref task pattern: set.discard done-callback signature matches.
  • New env-var convention checklist (all six steps): fully covered across config.py, load_config(), install.py, .env.example, and test_config.py.

The pre-init inconsistency is the only new finding and is purely cosmetic - it does not block merge.

Single cosmetic finding: install.py's pre-wizard init for
memory_episode_budget_usd was still "0.05", but the wizard prompt
default, the emission gate, the dataclass default, and the
load_config() env default all moved to 0.15 in the Sonnet-default
flip. Dead code (the wizard always overwrites the variable before
the gate runs and the non-extraction path pops the key entirely),
but a reader tracing the data flow saw three different sentinel
values for the same logical default.

Fix: aligns the pre-init to "0.15". Comment clarifies the
"wizard default = pre-init = gate sentinel" contract, and
explicitly documents the model field as the deliberate exception
(empty pre-init for inheritance, non-empty wizard default for the
Sonnet recommendation).
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR #387 Review (pass 7)

All six prior findings are confirmed resolved in the current diff. The visible portion of the diff is clean. One new observation below.


Suggestion: _generate_episode task spawned without a name

File: src/kai/memory_extraction.py, extract_and_store (stage-2 spawn block)

ep_task = asyncio.create_task(
    _generate_episode(
        user_text=user_text,
        assistant_text=assistant_text,
        user_id=user_id,
        session_id=session_id,
        config=config,
    )
)

asyncio.create_task accepts a name parameter (Python 3.8+). Without it, tasks appear in asyncio.all_tasks() as auto-generated names like Task-42, making it harder to correlate a stuck task with the episode generator during an incident. A name like f"episode-{user_id}" would let an operator reading a task dump immediately identify in-flight stage-2 work. Very minor; the _pending_episode_tasks set is the primary operational tool here, not all_tasks().


Confirmed clean (this pass)

  • Pass 6 memory_episode_budget_usd pre-init fixed: now "0.15" matching wizard default, emission gate, dataclass default, and .env.example. Comment explains the model field's intentional asymmetry (empty = inherit) separately.
  • _emit_episode_log symmetric presence: memory_id and reason both conditional, mutually exclusive on the stored/non-stored partition, docstring documents the partition contract explicitly.
  • _generate_episode outcome dispatch: is_error prefix correctly routes to subprocess_error (not parse_error); non_object_envelope and missing_episode_field correctly route to parse_error via the else branch.
  • stored = replaced = skipped = 0 pre-initialized before the try block in extract_and_store.
  • duration_ms clock restart inside async with sem: in _generate_episode, mirroring stage 1.
  • _build_extraction_payload(user, assistant, []) in test_episode_classifier_eval.py has the three-arg form.
  • format_context episode rendering: date_str = "" pre-init, split("\n")[0] fallback safe on empty string; outcome_text empty-string guard correctly suppresses the "Outcome:" suffix.
  • Strong-ref task pattern: set.discard done-callback matches the callback(task) signature.
  • Episode spawn ordering: create_task is after run_in_executor(_store_facts) returns, inside the stage-1 async with sem: block, which is correct (the task acquires the independent episode semaphore and runs outside the stage-1 lock).
  • test_memory_episode_wizard_default_writes_sonnet pins the operator-default-to-Sonnet contract and the budget/timeout suppression behavior.
  • Convention checklist: all six "Adding a New Env Var" steps covered across config.py, load_config(), install.py, .env.example, and test_config.py.

The diff is truncated before the end of tests/test_memory_episodes.py (~900 lines not visible). The visible fixture and helper infrastructure is correctly shaped. No new blocking issues.

Single suggestion: name the stage-2 asyncio task so it shows up as
`episode-<user_id>` in `asyncio.all_tasks()` dumps instead of an
auto-generated `Task-N` placeholder. An operator triaging a stuck
event loop can then immediately identify which user's stage-2 work
is in flight without cross-referencing `_pending_episode_tasks`.

Comment on the spawn site documents the secondary-affordance
intent (the `_pending_episode_tasks` set is still the primary
operational tool; the name is for ad-hoc debugging).

New test `test_stage2_task_is_named_for_incident_triage` pins the
`episode-<user_id>` name format so a future change that strips the
user_id would surface.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR #387 Review (pass 8)

All seven prior findings are confirmed resolved in the current diff. One new observation below.


Suggestion: misleading inline comment in test_memory_episode_wizard_default_writes_sonnet

File: tests/test_install.py, new test test_memory_episode_wizard_default_writes_sonnet

"0.05",  # extraction budget (matches stage-1 dataclass default; suppressed)

The stage-1 wizard pre-init visible in install.py is memory_extraction_budget_usd = "0.01". Following the delta-from-default pattern used consistently throughout _cmd_config (and mirrored exactly for stage-2: pre-init = "0.15", gate != 0.15), the emission gate for extraction budget checks != 0.01. That means "0.05" is NOT suppressed - MEMORY_EXTRACTION_BUDGET_USD="0.05" would be written to the env file. The comment says it matches the default and is suppressed; it does not.

The test still passes because it only asserts episode-related keys, never extraction budget. But a future developer reading this test to understand suppression semantics would be misled - they would assume 0.05 is the stage-1 dataclass default and write tests against that assumption.

Fix: change the comment to reflect that "0.05" is a non-default value and will be emitted; or change the input to "0.01" if the intent is to exercise the suppression path for stage-1 extraction budget.


Confirmed clean (this pass)

  • asyncio.create_task(name=f"episode-{user_id}") from pass 7 is present and correct.
  • All six prior warnings/suggestions confirmed resolved.
  • _emit_episode_log symmetric presence: memory_id and reason both conditional and mutually exclusive, docstring documents the partition contract.
  • duration_ms clock restart inside async with sem: in _generate_episode, mirroring stage 1.
  • _build_extraction_payload(user, assistant, []) three-arg form in test_episode_classifier_eval.py.
  • memory_episode_budget_usd = "0.15" pre-init matching wizard default, emission gate, and dataclass default.
  • _generate_episode never-raises contract: CancelledError re-raised before log line; broad except Exception covers unexpected classes.
  • Strong-ref task pattern: set.discard done-callback signature matches callback(task).
  • Episode spawn after run_in_executor(_store_facts) returns - stage-1 facts are durable before stage-2 is scheduled.
  • format_context episode rendering: date_str = "" pre-init, split("\n")[0] fallback safe, outcome_text empty-string guard suppresses "Outcome:" suffix correctly.
  • _run_episode_extractor uses list-based create_subprocess_exec (no shell), env allowlist, sandboxing flags carry over from stage 1.
  • _reset_episode_state fixture clears both semaphore caches and cancels in-flight tasks with autouse=True.
  • Convention checklist: all six "Adding a New Env Var" steps covered across config.py, load_config(), install.py, .env.example, and test_config.py.

The comment inaccuracy is the only new finding. It does not block merge.

The test_memory_episode_wizard_default_writes_sonnet test had a
misleading inline comment: "0.05" was annotated as "matches stage-1
dataclass default; suppressed" but the actual stage-1 dataclass
default is 0.01, so 0.05 would have been emitted to the env file
(the gate is `!= 0.01`). The test passed only because it never
asserted on MEMORY_EXTRACTION_BUDGET_USD; a future test author
copying the pattern would have inherited the wrong mental model
of the dataclass defaults.

Fix: change the input from "0.05" to "0.01" so it actually
exercises the suppression path the comment claimed. Same for
the broader input list - every input now matches its
corresponding dataclass default. Added negative assertions
on MEMORY_EXTRACTION_BUDGET_USD, MEMORY_EXTRACTION_TIMEOUT_S,
and MEMORY_CONSOLIDATION_CANDIDATES_N so the suppression
contract is now self-checking; a future change to the gate
semantics for any of those keys surfaces here.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR #387 Review (pass 9)

Pass 8 finding resolved. The test_memory_episode_wizard_default_writes_sonnet input is now "0.01" with the correct comment ("stage-1 dataclass default; suppressed"). float("0.01") != 0.01 is False, so MEMORY_EXTRACTION_BUDGET_USD is correctly suppressed, and the assertion "MEMORY_EXTRACTION_BUDGET_USD" not in env now holds.


Clean - no new findings

All eight prior findings are confirmed resolved in this diff:

  • is_errorsubprocess_error outcome routing: present.
  • stored = replaced = skipped = 0 pre-initialization before try: present.
  • duration_ms clock restart inside async with sem: in _generate_episode: present.
  • memory_episode_model: str = "" with inheritance only in load_config(): present.
  • _emit_episode_log symmetric memory_id/reason presence: present.
  • _build_extraction_payload(user, assistant, []) three-arg form in eval test: present.
  • asyncio.create_task(name=f"episode-{user_id}"): present.
  • "0.01" pre-init in test_memory_episode_wizard_default_writes_sonnet with correct comment: present.

Spot-checked in this pass: CancelledError re-raise before _emit_episode_log with semaphore correctly released via async with exit; strong-ref task pattern with set.discard done-callback; episode spawn after run_in_executor(_store_facts) returns; format_context episode rendering with date_str = "" pre-init and split("\n")[0] fallback safe on empty string; all six "Adding a New Env Var" convention steps covered.

The PR is ready to merge.

@dcellison dcellison merged commit 5c9f821 into main Apr 25, 2026
1 check passed
@dcellison dcellison deleted the feature/episode-memory-385 branch April 25, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add episode memory via conditional two-stage extraction

2 participants