feat(memory): tighten episode classifier against workflow-event closures#429
Conversation
…res (#428) Three coordinated changes mirroring PR #427's pattern for fact extraction: 1. Stage-1 prompt v6 -> v7: refine the EPISODE CLASSIFICATION block in `_EXTRACTION_SYSTEM_PROMPT` with a new EPISODE IGNORE list (workflow-loop iterations, routine workflow transactions, process meta-lessons) plus an EPISODE DURABILITY TEST gate. Positive criterion 1 narrowed to "durable situation". Bump `_EXTRACTION_PROMPT_VERSION` 6 -> 7 with the v7 history-block prose appended; existing v3/v4/v5/v6 entries preserved. 2. `_validate_episode` regex backstop hooked into `_generate_episode`: `_EPISODE_GOAL_NOISE_RE` rejects stage-2 outputs whose `goal` matches workflow-shape verbs (Evaluate/Review/Audit/Approve/File/Push/Draft/ Schedule/Post). Per-user, per-arm rejection counter `_EPISODE_VALIDATE_REJECTIONS` exposed via `get_extractor_stats()`. New `validate_rejected` outcome on the memory.episode log line. `_validate_episode` added to `__all__` paralleling `_validate_facts`; regex constant and counter stay module-level (mirrors `_WORKFLOW_EVENT_RE` / `_RULE_6_REJECTIONS` precedent). 3. Eval harness extended for v6-vs-v7 head-to-head: `_PROMPT_V6_PINNED` captured verbatim before this PR's edits, `--baseline {v5,v6}` CLI flag threaded through `_run_one_probe`. Existing v5-vs-active path retained for cross-revision sanity. Report carries `baseline_choice` so a v5 run is distinguishable from a v6 run; `v5_/v6_prompt_hash` keys retain their historical names but report the actual pair-arm hashes. `_EPISODE_PROMPT_VERSION` deliberately stays at "1": this PR does not change `_EPISODE_SYSTEM_PROMPT` or `_EPISODE_SCHEMA`, so the stage-2 version constant is unchanged. The hygiene sweep targets episodes by hand-curated ID list, not by prompt-version filter. 13 new tests (`TestEpisodeClassificationIgnoreList`, `TestValidateEpisodeWorkflowRegex`, `TestValidateEpisodeIntegration`, plus `test_v6_pinned_drift` and `test_episode_classification_fixture_loads`), plus updates to two existing methods in `TestExtractionPromptSoftVocab`. Full suite: 2764 passed, 1 skipped. A 12-probe example fixture lands at `home/evals/episode-classification-probes.example.jsonl` (5 workflow-event, 4 durable, 3 borderline; sanitized identifiers throughout) with the `.gitignore` exception. The operator-local production fixture and the `scripts/forget-episode-noise.py` hygiene-sweep script (29 hand-curated mundane episode IDs) stay untracked per the precedent set by `scripts/forget-review-noise.py` and its pass-2 counterpart.
Review by KaiThe explore agent found the pre-PR state. The changes are in the diff itself, which I've already fully analyzed. Here's the review. Review: feat(memory): tighten episode classifier against workflow-event closuresThe PR is well-structured with clear defense-in-depth thinking and good test coverage. Three concrete findings, no critical issues. Warning 1 -
|
- W1: _run_one_probe baseline_prompt default flips from
_PROMPT_V5_PINNED to _PROMPT_V6_PINNED so direct programmatic
callers and ad-hoc test fixtures get the same baseline as the
CLI default.
- W2: report keys renamed v5_prompt_hash -> baseline_prompt_hash
and v6_prompt_hash -> active_prompt_hash; output schema bumps
1 -> 2. Historical names lied at the new --baseline v6 default
(the field "v5" was carrying a v6 hash).
- Suggestion: _validate_episode now returns (episode | None,
reason | None). Workflow-regex reject path returns
("workflow-event regex match"); non-string-goal defensive path
returns ("non-string goal") with a DEBUG log and no counter
increment. _generate_episode unpacks the tuple and uses the
reason directly so the memory.episode log line distinguishes
the two reject modes.
- Suggestion: _ARM_FOR_VERB lookup switched from .get(verb,
"unknown") to direct subscript so a future _EPISODE_GOAL_NOISE_RE
edit that adds a verb without updating the map fails loud
instead of silently miscounting under "unknown".
New test test_non_string_goal_rejected_with_distinct_reason
pins the defensive-path reason and verifies the per-arm counter
stays at zero. Existing TestValidateEpisodeWorkflowRegex tests
updated to unpack the tuple and pin the reason string. Full
suite: 2765 passed, 1 skipped.
Review by KaiThe prior review's four findings (two warnings, two suggestions) have all been addressed cleanly. I'll verify each resolution, then assess anything new. Prior findings - resolvedWarning 1 ( Warning 2 (misleading output field names) - Fixed. Suggestion 1 (non-string goal reason) - Fixed. Suggestion 2 ( New findingsSuggestion - Test count in PR description is off by one. Remaining code review - cleanRegex correctness.
Fixture probe counts.
The implementation is correct. No critical issues, no new warnings. |
- src/kai/memory_extraction.py:1104-1105: rephrased the _EPISODE_GOAL_NOISE_RE rationale comment to drop "the spec's per-arm noun lists" - same review-round-vocabulary class as the four violations scrubbed in PR #427's final commit. - tests/test_memory_extraction.py:2367-2369: same-class catch in test_extraction_prompt_version_history_extended docstring, rephrased "appended in this spec's history block" to "appended for issue #428". TG Kai flagged the source comment; same-class sweep caught the test docstring. No mechanism changes; comment-only diff. Suite still 2765 passed.
Review by KaiThe two prior rounds resolved all four findings cleanly. I reviewed the diff independently against those resolved items and the full codebase conventions. ReviewOverall: clean. No critical issues or new warnings. Two minor suggestions below. Suggestion 1 - Ambiguous "admits" in regex commentFile: # Catches real production goals like
# "Push a prepared Memory wiki page", "Approve v3 of the
# memory-md-to-Qdrant migration spec", ...
# Bounded explicitly so a long fact text cannot
# drive quadratic backtracking. The token class is alphanumerics
# + `#` + `-` so things like "#412", "v3", and "tag-dedup" pass
# through.The phrase "admits real production goals like 'Push a prepared Memory wiki page'" (earlier in that same block) reads as "allows through" to a future reader, but the intent is "correctly intercepts." All three examples there are positive cases that the regex rejects - they appear in Suggestion 2 -
|
- S1: rephrase test_arm2_approve docstring. "intervening-tokens cap admits multi-word artifact qualifiers" reads ambiguously — "admits" could mean "lets through" (i.e. fails to reject) when the actual meaning is "still finds the artifact noun in". New wording is unambiguous: "the cap lets the regex still find the artifact noun in goals with multi-word qualifiers". - S2: add positive coverage for the `version` and `pull request` artifact nouns. Both appear in `_EPISODE_GOAL_NOISE_RE`'s alternation but neither was exercised by any positive test; `pull request` is the only multi-word entry and a future regex edit could silently break it. Two new probes added to test_arm2_approve: "Approve the new version" and "Approve the pull request". Counter / arm-mapping unchanged. Suite: 2765 passed (count unchanged; positives count inside one test went from 3 to 5).
Review by KaiThree prior rounds resolved all previously raised findings. Reviewing fresh against this diff. ReviewOverall: clean. No critical issues or new warnings. One observation below. Suggestion - Missing
|
* feat: positive-criterion extraction prompt and replay scaffolding (closes #464) The extraction prompt accumulated exclusion lists across #426, #427, #429 to suppress noise classes observed in production. The pattern is bounded by the author's enumeration of failure modes: a new variant of an already-seen noise class slips through, someone adds another stanza, the prompt grows. The growth curve has no natural cap and each new entry dilutes the model's attention to the positive task. This change replaces the IGNORE block and the fact-side DURABILITY TEST with a single positive criterion: "Would this fact help a future conversation that does not include the current turn?" Applied per candidate, with six worked examples (three emit, three do not emit) anchoring the counterfactual reasoning. The positive criterion is bounded by the property we actually care about, not the author's enumeration; phrasings the prompt has never seen are evaluated by the same test as the ones it has. `_EXTRACTION_PROMPT_VERSION` bumps from "8" to "9" so future cleanup can target old-prompt facts if the class coverage differs materially. The episode-side blocks (EPISODE CLASSIFICATION, EPISODE IGNORE, EPISODE DURABILITY TEST) are unchanged per the parent epic's scope; FORMAT, STORE, CONFIDENCE, and CONSOLIDATION blocks all stay. New `kai.eval.replay` module supports comparing pre-PR and post-PR sandbox fact sets without disturbing production Qdrant. It walks chat-history JSONL pair-by-pair, maintains a rolling PRIOR CONTEXT buffer of `episode_classifier_context_turns` prior pairs, and calls `extract_and_store` with a sandbox `user_id` (enforced to carry a `sandbox-` prefix as defense in depth against accidental writes to real users). The replay reuses `_build_extraction_payload` for prior- context truncation so the sandbox sees production-equivalent input shape, and reuses `history._pair_records_chronologically` so the pair stream is byte-equivalent to what production fed the extractor. Two existing prompt-pinning tests (`test_ignore_bullets_added`, `test_durability_test_section_present`) are removed because the v9 swap retires the structures they pinned; their replacements live in the new `tests/test_extraction_prompt.py`. `test_extraction_prompt_ version_bumped` is updated to assert "9"; the version-history-fragment test gains assertions for the v9 entry. The acceptance evidence (sandbox-replay smoke, canary recall, classification retention report) lands in the PR description after the operator's sandbox phase runs. * fix: initialize memory store in replay before extract loop The replay module called `extract_and_store` per pair but never invoked `memory.init_memory(config)`, so the module-level `_memory` handle in `kai.memory` stayed None for the entire run. Every storage primitive (`search`, `add_structured`, `delete_all`) short-circuits on that None check, so the replay walked the chat history, spawned a `claude --print` extractor per pair, and silently discarded every extracted fact. The symptom surfaced only via `outcome=dropped_backend` in the per-pair `consolidate.intent` log lines; the summary print reported `facts_stored=0` with no exception raised. Fix: one-liner call to `init_memory(config)` immediately after `load_config()` in `_async_main`. Test: TestInitMemory in tests/test_eval_replay.py mocks every post-init dependency and asserts `init_memory` was invoked with the loaded config object. The class docstring documents the original failure mode so a future regression has the diagnostic trail attached. * fix: extraction prompt coherence and replay summary breakdowns Addresses #467 review findings: - STORE block's "Apply the DURABILITY TEST below" rewrite. The v9 swap deleted the fact-side DURABILITY TEST and inserted the QUALITY TEST in its place; the STORE block's directive was outside the edit range and survived stale, pointing the model at a section that no longer exists. Renamed to "QUALITY TEST below"; new snapshot test pins the fact-side region against any further DURABILITY TEST references slipping in. - Worked examples block extended from 3 negatives to 4. The new example ("I'm writing the spec now." -> do not emit) covers the fourth class the spec prose named (ephemeral / in-progress task state) but which the original block omitted. The negative-examples snapshot test now pins all four. - Comments at three sites in memory_extraction.py renamed from "the prompt's IGNORE rules" to "the prompt's QUALITY TEST" so future readers don't hunt for an IGNORE block that v9 removed. - `_run_replay` returns `(counters, dry_run_samples)`. Live runs pull `get_all_facts` for the sandbox user after the loop and emit by-tag, by-speaker, by-prompt-version breakdowns; dry-runs emit a structural payload-shape sample (prior depth + char counts) for the first three pairs. Verbatim text is intentionally not surfaced so a sandbox dry-run cannot leak personal history into stdout artifacts. - `init_memory(config)` moved after the `--context-turns` negative- value validation so an argument-rejection path does not trigger the first-run Mem0 embedding-model download. Eight new unit tests cover the new format helpers (grouping rules, empty-input handling, sort order, missing-metadata defaults, verbatim-text absence), and the existing TestInitMemory + dry-run tests are updated for the new return shape.
Closes #428.
Summary
Three coordinated changes mirroring PR #427's pattern, but for the episode classifier:
Stage-1 prompt v6 -> v7. Refines the
EPISODE CLASSIFICATIONblock in_EXTRACTION_SYSTEM_PROMPTwith a new EPISODE IGNORE list (workflow-loop iterations, routine workflow transactions, process meta-lessons) plus an episode-scoped DURABILITY TEST. Positive criterion 1 narrowed to "durable situation" with a forward reference to the IGNORE list._EXTRACTION_PROMPT_VERSIONbumps 6 -> 7; existing v3/v4/v5/v6 history-block prose is preserved alongside the new v7 paragraph._validate_episoderegex backstop. A Python-side guard hooked into_generate_episodebetween the JSON-decoded stage-2 output andadd_structured._EPISODE_GOAL_NOISE_RErejects goals starting with workflow-shape verbs (Evaluate,Review,Audit,Approve,File,Push,Draft,Schedule,Post) followed by an artifact noun, with a bounded intervening-tokens clause that admits real production goals like "Push a prepared Memory wiki page". Per-user, per-arm rejection counts via the new_EPISODE_VALIDATE_REJECTIONScounter; exposed throughget_extractor_stats()paralleling_RULE_6_REJECTIONS. Newvalidate_rejectedoutcome on thememory.episodelog line.Eval harness extension.
_PROMPT_V6_PINNEDcaptured verbatim before this PR's edits (SHA-2564f1e03dd37d9d52bb39724f949f7c382343837608497ebcd2ab3ae4990d3030c). New--baseline {v5,v6}CLI flag threads through_run_one_probeto select which pinned constant the baseline arm runs against. Default flips tov6so each new prompt revision compares against the immediately prior one;v5retained for cross-revision sanity. Report carries a newbaseline_choicefield._EPISODE_PROMPT_VERSIONintentionally stays at"1". This PR does not change_EPISODE_SYSTEM_PROMPTor_EPISODE_SCHEMA, so the stage-2 version constant is unchanged. The hygiene sweep targets episodes by hand-curated ID list, not by prompt-version filter. Issue body's acceptance criterion 3 (bump to "2") is intentionally superseded; this is documented in the spec evaluation and the Acceptance section there.Tests
TestEpisodeClassificationIgnoreList(4): pin the four prompt-side wording fragments (three IGNORE bullets + DURABILITY TEST gate).TestValidateEpisodeWorkflowRegex(5): one positive-cases test per arm (review/approve/transaction), one per-user-per-arm counter integration test, one negatives test against durable-shape goals from the snapshot.TestValidateEpisodeIntegration(2): async tests driving_generate_episodeend-to-end with mocked stage-2 runner andadd_structured. Covers both reject (outcome="validate_rejected", counter increments,add_structurednot called) and accept (outcome="stored", memory_id flows through, counter unchanged).TestExtractionPromptSoftVocabupdated in place per PR feat(memory): tighten extractor against workflow-event noise (#426) #427's precedent: version pin moves 6 -> 7, history-block fragment list extends with"v7 (2026-04-30, this issue)"and"EPISODE CLASSIFICATION block".test_v6_pinned_driftintests/test_eval_extraction.pymirroringtest_v5_pinned_drift. Newtest_episode_classification_fixture_loadsvalidates the 12-probe example fixture.Operator follow-up after merge
Two operator actions land separately and are not part of this PR's CI:
Hygiene sweep.
scripts/forget-episode-noise.py(untracked, operator-local) deletes 29 hand-curated mundane episode IDs from production: 24 review-loop iterations + 5 routine workflow transactions. Run sequence:Expected drop: episode bucket 42 -> 13 (all durable). Mirrors the PR feat(memory): tighten extractor against workflow-event noise (#426) #427 fact-side sweep precedent.
v6-vs-v7 harness rerun. Run during the PR-review cooldown window:
Expected ~$0.30, ~10 minutes. The per-probe
has_episodeflips will be reported back here when complete; the spec's expected v7 result is workflow-event false-positive rate 0/5 (down from ~5/5 under v6) with durable-episode true-positive rate holding at 4/4. Borderline (3 probes) is the dimension to watch.Risks
^, the artifact-noun list does not includearchitecture, and the_EPISODE_VALIDATE_REJECTIONScounter is exposed in production for real-world false-positive tracking.Spec history
Three rounds of independent review converged with descending severity:
__all__membership).The implementation tracks v3 directly.