feat(memory): tighten episode classifier against workflow-event closures by dcellison · Pull Request #429 · dcellison/kai

dcellison · 2026-05-01T11:13:30Z

Closes #428.

Summary

Three coordinated changes mirroring PR #427's pattern, but for the episode classifier:

Stage-1 prompt v6 -> v7. Refines the EPISODE CLASSIFICATION block in _EXTRACTION_SYSTEM_PROMPT with a new EPISODE IGNORE list (workflow-loop iterations, routine workflow transactions, process meta-lessons) plus an episode-scoped DURABILITY TEST. Positive criterion 1 narrowed to "durable situation" with a forward reference to the IGNORE list. _EXTRACTION_PROMPT_VERSION bumps 6 -> 7; existing v3/v4/v5/v6 history-block prose is preserved alongside the new v7 paragraph.
_validate_episode regex backstop. A Python-side guard hooked into _generate_episode between the JSON-decoded stage-2 output and add_structured. _EPISODE_GOAL_NOISE_RE rejects goals starting with workflow-shape verbs (Evaluate, Review, Audit, Approve, File, Push, Draft, Schedule, Post) followed by an artifact noun, with a bounded intervening-tokens clause that admits real production goals like "Push a prepared Memory wiki page". Per-user, per-arm rejection counts via the new _EPISODE_VALIDATE_REJECTIONS counter; exposed through get_extractor_stats() paralleling _RULE_6_REJECTIONS. New validate_rejected outcome on the memory.episode log line.
Eval harness extension. _PROMPT_V6_PINNED captured verbatim before this PR's edits (SHA-256 4f1e03dd37d9d52bb39724f949f7c382343837608497ebcd2ab3ae4990d3030c). New --baseline {v5,v6} CLI flag threads through _run_one_probe to select which pinned constant the baseline arm runs against. Default flips to v6 so each new prompt revision compares against the immediately prior one; v5 retained for cross-revision sanity. Report carries a new baseline_choice field.

_EPISODE_PROMPT_VERSION intentionally stays at "1". This PR does not change _EPISODE_SYSTEM_PROMPT or _EPISODE_SCHEMA, so the stage-2 version constant is unchanged. The hygiene sweep targets episodes by hand-curated ID list, not by prompt-version filter. Issue body's acceptance criterion 3 (bump to "2") is intentionally superseded; this is documented in the spec evaluation and the Acceptance section there.

Tests

13 new tests added; existing test count rises from 2750 to 2764. Suite: 2764 passed, 1 skipped.
TestEpisodeClassificationIgnoreList (4): pin the four prompt-side wording fragments (three IGNORE bullets + DURABILITY TEST gate).
TestValidateEpisodeWorkflowRegex (5): one positive-cases test per arm (review/approve/transaction), one per-user-per-arm counter integration test, one negatives test against durable-shape goals from the snapshot.
TestValidateEpisodeIntegration (2): async tests driving _generate_episode end-to-end with mocked stage-2 runner and add_structured. Covers both reject (outcome="validate_rejected", counter increments, add_structured not called) and accept (outcome="stored", memory_id flows through, counter unchanged).
Two existing methods in TestExtractionPromptSoftVocab updated in place per PR feat(memory): tighten extractor against workflow-event noise (#426) #427's precedent: version pin moves 6 -> 7, history-block fragment list extends with "v7 (2026-04-30, this issue)" and "EPISODE CLASSIFICATION block".
New test_v6_pinned_drift in tests/test_eval_extraction.py mirroring test_v5_pinned_drift. New test_episode_classification_fixture_loads validates the 12-probe example fixture.

Operator follow-up after merge

Two operator actions land separately and are not part of this PR's CI:

Hygiene sweep. scripts/forget-episode-noise.py (untracked, operator-local) deletes 29 hand-curated mundane episode IDs from production: 24 review-loop iterations + 5 routine workflow transactions. Run sequence:
```
sudo launchctl bootout system/com.syrinx.kai
.venv/bin/python scripts/forget-episode-noise.py --apply
sudo launchctl bootstrap system /Library/LaunchDaemons/com.syrinx.kai.plist
```
Expected drop: episode bucket 42 -> 13 (all durable). Mirrors the PR feat(memory): tighten extractor against workflow-event noise (#426) #427 fact-side sweep precedent.
v6-vs-v7 harness rerun. Run during the PR-review cooldown window:
```
.venv/bin/python -m kai.eval.extraction \
  --probes home/evals/episode-classification-probes.example.jsonl \
  --baseline v6 \
  --output /tmp/v6-v7-report.json
```
Expected ~$0.30, ~10 minutes. The per-probe has_episode flips will be reported back here when complete; the spec's expected v7 result is workflow-event false-positive rate 0/5 (down from ~5/5 under v6) with durable-episode true-positive rate holding at 4/4. Borderline (3 probes) is the dimension to watch.

Risks

Regex over-rejection on legitimate review-shape goals (e.g. "Review the v3 architecture proposal"). Mitigation: arm anchors at ^, the artifact-noun list does not include architecture, and the _EPISODE_VALIDATE_REJECTIONS counter is exposed in production for real-world false-positive tracking.
Borderline-probe drift between v6 and v7. Mitigation: report drift probe-by-probe in this PR's harness rerun; goal is no regression on the 4 durable probes, not perfection on the 3 borderline.
The example probe fixture is illustrative; the production probe fixture is operator-built post-merge.

Spec history

Three rounds of independent review converged with descending severity:

v1: 2 mechanism + 2 wording-class sub-blockers + 3 wording nits.
v2: 1 wording-class sub-blocker (factual claim about __all__ membership).
v3: clean approval.

The implementation tracks v3 directly.

…res (#428) Three coordinated changes mirroring PR #427's pattern for fact extraction: 1. Stage-1 prompt v6 -> v7: refine the EPISODE CLASSIFICATION block in `_EXTRACTION_SYSTEM_PROMPT` with a new EPISODE IGNORE list (workflow-loop iterations, routine workflow transactions, process meta-lessons) plus an EPISODE DURABILITY TEST gate. Positive criterion 1 narrowed to "durable situation". Bump `_EXTRACTION_PROMPT_VERSION` 6 -> 7 with the v7 history-block prose appended; existing v3/v4/v5/v6 entries preserved. 2. `_validate_episode` regex backstop hooked into `_generate_episode`: `_EPISODE_GOAL_NOISE_RE` rejects stage-2 outputs whose `goal` matches workflow-shape verbs (Evaluate/Review/Audit/Approve/File/Push/Draft/ Schedule/Post). Per-user, per-arm rejection counter `_EPISODE_VALIDATE_REJECTIONS` exposed via `get_extractor_stats()`. New `validate_rejected` outcome on the memory.episode log line. `_validate_episode` added to `__all__` paralleling `_validate_facts`; regex constant and counter stay module-level (mirrors `_WORKFLOW_EVENT_RE` / `_RULE_6_REJECTIONS` precedent). 3. Eval harness extended for v6-vs-v7 head-to-head: `_PROMPT_V6_PINNED` captured verbatim before this PR's edits, `--baseline {v5,v6}` CLI flag threaded through `_run_one_probe`. Existing v5-vs-active path retained for cross-revision sanity. Report carries `baseline_choice` so a v5 run is distinguishable from a v6 run; `v5_/v6_prompt_hash` keys retain their historical names but report the actual pair-arm hashes. `_EPISODE_PROMPT_VERSION` deliberately stays at "1": this PR does not change `_EPISODE_SYSTEM_PROMPT` or `_EPISODE_SCHEMA`, so the stage-2 version constant is unchanged. The hygiene sweep targets episodes by hand-curated ID list, not by prompt-version filter. 13 new tests (`TestEpisodeClassificationIgnoreList`, `TestValidateEpisodeWorkflowRegex`, `TestValidateEpisodeIntegration`, plus `test_v6_pinned_drift` and `test_episode_classification_fixture_loads`), plus updates to two existing methods in `TestExtractionPromptSoftVocab`. Full suite: 2764 passed, 1 skipped. A 12-probe example fixture lands at `home/evals/episode-classification-probes.example.jsonl` (5 workflow-event, 4 durable, 3 borderline; sanitized identifiers throughout) with the `.gitignore` exception. The operator-local production fixture and the `scripts/forget-episode-noise.py` hygiene-sweep script (29 hand-curated mundane episode IDs) stay untracked per the precedent set by `scripts/forget-review-noise.py` and its pass-2 counterpart.

dcellison · 2026-05-01T11:17:07Z

Review by Kai

The explore agent found the pre-PR state. The changes are in the diff itself, which I've already fully analyzed. Here's the review.

Review: feat(memory): tighten episode classifier against workflow-event closures

The PR is well-structured with clear defense-in-depth thinking and good test coverage. Three concrete findings, no critical issues.

Warning 1 - `_run_one_probe` default contradicts stated behavior

File: src/kai/eval/extraction.py, new signature in _run_one_probe

baseline_prompt: str = _PROMPT_V5_PINNED,  # added default

The PR description says "Default flips to v6" and the --baseline argparse default is "v6". But the function-level keyword default is _PROMPT_V5_PINNED. Any caller that invokes _run_one_probe(...) without the baseline_prompt kwarg - including future tests or programmatic callers - silently gets v5. The CLI path is fine because it always passes baseline_prompt explicitly, but the function signature is a trap. The default should be _PROMPT_V6_PINNED.

Warning 2 - Output JSON field names are actively misleading at the new default

File: src/kai/eval/extraction.py, report dict in _run_async

"v5_prompt_hash": _hash(baseline_prompt),         # is v6 hash when --baseline v6
"v6_prompt_hash": _hash(memory_extraction._EXTRACTION_SYSTEM_PROMPT),  # is v7 hash

With --baseline v6 (the new default), v5_prompt_hash carries the v6 digest and v6_prompt_hash carries the v7 digest. baseline_choice partially documents this, but any downstream consumer reading those keys by name gets wrong information - including future log grep queries. Renaming to baseline_prompt_hash / active_prompt_hash would fix this cleanly. If the field names must stay stable for existing consumers, the inline comment should at minimum warn that the field names do not reflect the actual versions when baseline != v5.

Suggestion - Inaccurate `reason` string on non-string goal rejection

File: src/kai/memory_extraction.py, _validate_episode and its caller in _generate_episode

_validate_episode returns None for two distinct cases: regex match, and a non-string goal. The caller unconditionally sets:

reason = "goal matches workflow-event regex"

For the non-string path, that reason is factually wrong, and the non-string path emits no log line and does not increment the counter - making it invisible in production. The fix is a one-liner: log at DEBUG in the non-string branch and either return a sentinel or accept a separate reason parameter. Low probability in practice since the stage-2 schema enforces string goal, but the misleading reason will surface in incidents if the schema is ever relaxed.

Suggestion - Unreachable `"unknown"` arm fallback

File: src/kai/memory_extraction.py, _validate_episode

arm = _ARM_FOR_VERB.get(verb, "unknown")

match.group(1) can only capture the six verbs enumerated in _EPISODE_GOAL_NOISE_RE, all of which are keys in _ARM_FOR_VERB. The "unknown" fallback is dead code. An assert verb in _ARM_FOR_VERB would catch a future _EPISODE_GOAL_NOISE_RE edit that adds a verb without updating the map; the silent "unknown" arm silently misfires counters instead.

Prompt changes, fixture, and test coverage all look correct. The _PerArmCounter snapshot logic is sound. The regex bounded-gap approach ({0,6}?) is appropriate and carries no ReDoS risk.

- W1: _run_one_probe baseline_prompt default flips from _PROMPT_V5_PINNED to _PROMPT_V6_PINNED so direct programmatic callers and ad-hoc test fixtures get the same baseline as the CLI default. - W2: report keys renamed v5_prompt_hash -> baseline_prompt_hash and v6_prompt_hash -> active_prompt_hash; output schema bumps 1 -> 2. Historical names lied at the new --baseline v6 default (the field "v5" was carrying a v6 hash). - Suggestion: _validate_episode now returns (episode | None, reason | None). Workflow-regex reject path returns ("workflow-event regex match"); non-string-goal defensive path returns ("non-string goal") with a DEBUG log and no counter increment. _generate_episode unpacks the tuple and uses the reason directly so the memory.episode log line distinguishes the two reject modes. - Suggestion: _ARM_FOR_VERB lookup switched from .get(verb, "unknown") to direct subscript so a future _EPISODE_GOAL_NOISE_RE edit that adds a verb without updating the map fails loud instead of silently miscounting under "unknown". New test test_non_string_goal_rejected_with_distinct_reason pins the defensive-path reason and verifies the per-arm counter stays at zero. Existing TestValidateEpisodeWorkflowRegex tests updated to unpack the tuple and pin the reason string. Full suite: 2765 passed, 1 skipped.

dcellison · 2026-05-01T11:25:09Z

Review by Kai

The prior review's four findings (two warnings, two suggestions) have all been addressed cleanly. I'll verify each resolution, then assess anything new.

Prior findings - resolved

Warning 1 (_run_one_probe default) - Fixed. baseline_prompt: str = _PROMPT_V6_PINNED is now the function-level default, matching the CLI default and the stated intent.

Warning 2 (misleading output field names) - Fixed. v5_prompt_hash/v6_prompt_hash renamed to baseline_prompt_hash/active_prompt_hash; _OUTPUT_SCHEMA_VERSION bumped to "2" with a migration comment; the baseline_choice field disambiguates for any downstream consumer.

Suggestion 1 (non-string goal reason) - Fixed. _validate_episode now returns "non-string goal" vs "workflow-event regex match" as distinct strings, and the non-string path emits a DEBUG log and withholds the counter increment. The caller propagates whichever reason was returned.

Suggestion 2 ("unknown" sentinel arm) - Fixed. Direct _ARM_FOR_VERB[verb] dict access replaces .get(verb, "unknown"); the comment explains that a future regex edit adding a verb without updating the map will now KeyError loudly rather than silently miscount.

New findings

Suggestion - Test count in PR description is off by one. TestValidateEpisodeWorkflowRegex contains 6 methods (the 5 described plus test_non_string_goal_rejected_with_distinct_reason), making the total 14 new tests, not 13. Not a code issue; just a documentation artifact.

Remaining code review - clean

Regex correctness. _EPISODE_GOAL_NOISE_RE anchors at ^, uses a lazy bounded gap {0,6}?, and applies re.IGNORECASE. The .lower() call on match.group(1) before the _ARM_FOR_VERB lookup is correct and handles the case-insensitive flag. The 9 captured verbs are all present in _ARM_FOR_VERB. No ReDoS risk; the token class [\w#-]+ is non-overlapping and the cap is explicit.

_PerArmCounter lifecycle. _reset() uses clear() in-place, consistent with _Counter._reset() and the stated rationale (external importers hold a reference to the module-level instance). The snapshot's shallow per-user copy is sufficient since inner values are ints.

_generate_episode hookup. The episode = validated reassignment when the episode passes validation is a no-op (same dict object), but it's harmless and makes the branching symmetric. The lambda inside run_in_executor correctly closes over episode after the reassignment.

__all__ extension. _validate_episode follows the project's existing pattern of exporting underscore-prefixed names for test access (alongside _validate_facts, _store_facts, etc.).

Fixture probe counts. test_episode_classification_fixture_loads asserts workflow-noise == 8 (5 workflow-event + 3 borderline) and durable-content == 4. Verified against the JSONL file: both counts are correct.

_PROMPT_V6_PINNED drift test. The _V6_PROMPT_HASH in test_eval_extraction.py matches the hash cited in the PR description.

The implementation is correct. No critical issues, no new warnings.

- src/kai/memory_extraction.py:1104-1105: rephrased the _EPISODE_GOAL_NOISE_RE rationale comment to drop "the spec's per-arm noun lists" - same review-round-vocabulary class as the four violations scrubbed in PR #427's final commit. - tests/test_memory_extraction.py:2367-2369: same-class catch in test_extraction_prompt_version_history_extended docstring, rephrased "appended in this spec's history block" to "appended for issue #428". TG Kai flagged the source comment; same-class sweep caught the test docstring. No mechanism changes; comment-only diff. Suite still 2765 passed.

dcellison · 2026-05-01T11:38:21Z

Review by Kai

The two prior rounds resolved all four findings cleanly. I reviewed the diff independently against those resolved items and the full codebase conventions.

Review

Overall: clean. No critical issues or new warnings. Two minor suggestions below.

Suggestion 1 - Ambiguous "admits" in regex comment

File: src/kai/memory_extraction.py, _EPISODE_GOAL_NOISE_RE comment block

# Catches real production goals like
# "Push a prepared Memory wiki page", "Approve v3 of the
# memory-md-to-Qdrant migration spec", ...
# Bounded explicitly so a long fact text cannot
# drive quadratic backtracking. The token class is alphanumerics
# + `#` + `-` so things like "#412", "v3", and "tag-dedup" pass
# through.

The phrase "admits real production goals like 'Push a prepared Memory wiki page'" (earlier in that same block) reads as "allows through" to a future reader, but the intent is "correctly intercepts." All three examples there are positive cases that the regex rejects - they appear in test_arm3_routine_transactions as expected rejections. The word "admits" means the opposite of the intended meaning. "correctly catches" or "intercepts" would be unambiguous.

Suggestion 2 - `pull request` and `version` artifact nouns are untested

File: tests/test_memory_extraction.py, TestValidateEpisodeWorkflowRegex

The _EPISODE_GOAL_NOISE_RE alternation includes pull request (two-word) and version, but neither appears in any positive-case test. The three arm tests cover spec, specification, PR, issue, revision, wiki, epic, reminder, comment. The regex engine handles these correctly - a space inside an alternation literal is legal Python regex - but the two-word pull request form in particular benefits from a test, since it's the only multi-word entry and a future regex edit could silently break it. Low priority, but worth one test line in the arm 2 or arm 3 block.

Everything else - clean

Regex correctness: ^(verb)\s+(?:[\w#-]+\s+){0,6}?(?:nouns)\b with re.IGNORECASE is sound. Lazy {0,6}? carries no ReDoS risk (explicit upper bound, non-overlapping token class). The pull request two-word literal in the alternation works correctly in Python's re engine.
_ARM_FOR_VERB direct lookup: All 9 verbs in the regex are keys in the map. KeyError-on-miss is the correct failure mode.
_PerArmCounter.snapshot(): Shallow per-user copy is sufficient; inner values are ints. _reset() uses clear() in-place, consistent with _Counter._reset().
_validate_episode return contract: Both reject paths ("non-string goal", "workflow-event regex match") return distinct reason strings; counter increments only on the regex path; the non-string path emits a DEBUG log. Caller propagates reason correctly.
_generate_episode hookup: The episode = validated no-op rebind is harmless and makes the branching symmetric. Lambda closure is over the post-reassignment episode, which is the same object.
Eval harness: _PROMPT_V6_PINNED default on _run_one_probe, baseline_prompts dict as single source of truth, field rename to baseline_prompt_hash/active_prompt_hash, schema version bump to "2" - all correct.
Fixture probe counts: 5 wf + 3 borderline = 8 workflow-noise, 4 durable-content = 12 total. Test assertions match.
__all__ extension: _validate_episode follows the existing underscore-export pattern.
Security: No user input reaches the regex or any exec path. No new injection surface.

- S1: rephrase test_arm2_approve docstring. "intervening-tokens cap admits multi-word artifact qualifiers" reads ambiguously — "admits" could mean "lets through" (i.e. fails to reject) when the actual meaning is "still finds the artifact noun in". New wording is unambiguous: "the cap lets the regex still find the artifact noun in goals with multi-word qualifiers". - S2: add positive coverage for the `version` and `pull request` artifact nouns. Both appear in `_EPISODE_GOAL_NOISE_RE`'s alternation but neither was exercised by any positive test; `pull request` is the only multi-word entry and a future regex edit could silently break it. Two new probes added to test_arm2_approve: "Approve the new version" and "Approve the pull request". Counter / arm-mapping unchanged. Suite: 2765 passed (count unchanged; positives count inside one test went from 3 to 5).

dcellison · 2026-05-01T11:43:33Z

Review by Kai

Three prior rounds resolved all previously raised findings. Reviewing fresh against this diff.

Review

Overall: clean. No critical issues or new warnings. One observation below.

Suggestion - Missing `goal` key passes silently

File: src/kai/memory_extraction.py, _validate_episode

goal = episode.get("goal", "")

If the "goal" key is absent from the stage-2 payload, goal defaults to "", which passes the string check and returns (episode, None) - storing an episode with an empty goal. The stage-2 schema enforces goal as a required string, so this is low probability in practice. But the default of "" silently papers over a schema contract violation rather than exposing it. A goal = episode.get("goal") followed by treating goal is None the same as the non-string path (same DEBUG log, same "non-string goal" reason) would keep the validator's two rejection paths consistent. The current behavior is not wrong given the schema guarantee; just worth noting.

Everything else - clean

Regex correctness. ^(verb)\s+(?:[\w#-]+\s+){0,6}?(?:nouns)\b with re.IGNORECASE. The lazy bounded gap is sound, no ReDoS risk. The pull request two-word literal in the alternation works correctly in Python's re engine. version and pull request both now have positive-case coverage in test_arm2_approve. The 6-token cap accommodates the longest production goal in the snapshot with one token of headroom.
_ARM_FOR_VERB direct lookup. All 9 verbs captured by group 1 are keys in the map. KeyError-on-miss is the correct failure mode.
_validate_episode return contract. Both reject paths return distinct reason strings; counter increments only on the regex path; non-string path emits DEBUG log and withholds increment.
_generate_episode hookup. episode = validated is a harmless no-op rebind (same object). Lambda closes over episode after the rebind; no race. validate_rejected outcome correctly short-circuits before add_structured.
Eval harness. _PROMPT_V6_PINNED default on _run_one_probe, baseline_prompts dict as single source of truth, baseline_prompt_hash/active_prompt_hash rename, schema version bump to "2" - all correct. argparse choices and baseline_prompts keys are in sync.
Fixture. 5 workflow-event + 3 borderline = 8 workflow-noise, 4 durable-content = 12 total. Test assertions match the fixture content.
Integration tests. Monkeypatch targets memory_module.add_structured correctly (attribute lookup on the module object, not a name-bound import). Both reject and accept paths are covered.
Security. Regex is applied to LLM output, not raw user input. No injection surface.

* feat: positive-criterion extraction prompt and replay scaffolding (closes #464) The extraction prompt accumulated exclusion lists across #426, #427, #429 to suppress noise classes observed in production. The pattern is bounded by the author's enumeration of failure modes: a new variant of an already-seen noise class slips through, someone adds another stanza, the prompt grows. The growth curve has no natural cap and each new entry dilutes the model's attention to the positive task. This change replaces the IGNORE block and the fact-side DURABILITY TEST with a single positive criterion: "Would this fact help a future conversation that does not include the current turn?" Applied per candidate, with six worked examples (three emit, three do not emit) anchoring the counterfactual reasoning. The positive criterion is bounded by the property we actually care about, not the author's enumeration; phrasings the prompt has never seen are evaluated by the same test as the ones it has. `_EXTRACTION_PROMPT_VERSION` bumps from "8" to "9" so future cleanup can target old-prompt facts if the class coverage differs materially. The episode-side blocks (EPISODE CLASSIFICATION, EPISODE IGNORE, EPISODE DURABILITY TEST) are unchanged per the parent epic's scope; FORMAT, STORE, CONFIDENCE, and CONSOLIDATION blocks all stay. New `kai.eval.replay` module supports comparing pre-PR and post-PR sandbox fact sets without disturbing production Qdrant. It walks chat-history JSONL pair-by-pair, maintains a rolling PRIOR CONTEXT buffer of `episode_classifier_context_turns` prior pairs, and calls `extract_and_store` with a sandbox `user_id` (enforced to carry a `sandbox-` prefix as defense in depth against accidental writes to real users). The replay reuses `_build_extraction_payload` for prior- context truncation so the sandbox sees production-equivalent input shape, and reuses `history._pair_records_chronologically` so the pair stream is byte-equivalent to what production fed the extractor. Two existing prompt-pinning tests (`test_ignore_bullets_added`, `test_durability_test_section_present`) are removed because the v9 swap retires the structures they pinned; their replacements live in the new `tests/test_extraction_prompt.py`. `test_extraction_prompt_ version_bumped` is updated to assert "9"; the version-history-fragment test gains assertions for the v9 entry. The acceptance evidence (sandbox-replay smoke, canary recall, classification retention report) lands in the PR description after the operator's sandbox phase runs. * fix: initialize memory store in replay before extract loop The replay module called `extract_and_store` per pair but never invoked `memory.init_memory(config)`, so the module-level `_memory` handle in `kai.memory` stayed None for the entire run. Every storage primitive (`search`, `add_structured`, `delete_all`) short-circuits on that None check, so the replay walked the chat history, spawned a `claude --print` extractor per pair, and silently discarded every extracted fact. The symptom surfaced only via `outcome=dropped_backend` in the per-pair `consolidate.intent` log lines; the summary print reported `facts_stored=0` with no exception raised. Fix: one-liner call to `init_memory(config)` immediately after `load_config()` in `_async_main`. Test: TestInitMemory in tests/test_eval_replay.py mocks every post-init dependency and asserts `init_memory` was invoked with the loaded config object. The class docstring documents the original failure mode so a future regression has the diagnostic trail attached. * fix: extraction prompt coherence and replay summary breakdowns Addresses #467 review findings: - STORE block's "Apply the DURABILITY TEST below" rewrite. The v9 swap deleted the fact-side DURABILITY TEST and inserted the QUALITY TEST in its place; the STORE block's directive was outside the edit range and survived stale, pointing the model at a section that no longer exists. Renamed to "QUALITY TEST below"; new snapshot test pins the fact-side region against any further DURABILITY TEST references slipping in. - Worked examples block extended from 3 negatives to 4. The new example ("I'm writing the spec now." -> do not emit) covers the fourth class the spec prose named (ephemeral / in-progress task state) but which the original block omitted. The negative-examples snapshot test now pins all four. - Comments at three sites in memory_extraction.py renamed from "the prompt's IGNORE rules" to "the prompt's QUALITY TEST" so future readers don't hunt for an IGNORE block that v9 removed. - `_run_replay` returns `(counters, dry_run_samples)`. Live runs pull `get_all_facts` for the sandbox user after the loop and emit by-tag, by-speaker, by-prompt-version breakdowns; dry-runs emit a structural payload-shape sample (prior depth + char counts) for the first three pairs. Verbatim text is intentionally not surfaced so a sandbox dry-run cannot leak personal history into stdout artifacts. - `init_memory(config)` moved after the `--context-turns` negative- value validation so an argument-rejection path does not trigger the first-run Mem0 embedding-model download. Eight new unit tests cover the new format helpers (grouping rules, empty-input handling, sort order, missing-metadata defaults, verbatim-text absence), and the existing TestInitMemory + dry-run tests are updated for the new return shape.

dcellison merged commit 7960135 into main May 1, 2026
1 check passed

dcellison deleted the feature/episode-extractor-tightening branch May 1, 2026 11:45

dcellison mentioned this pull request May 1, 2026

Memory quality: replace write-time string filters with structural controls #436

Open

4 tasks

dcellison mentioned this pull request May 12, 2026

Memory extraction: replace exclusion stanzas with positive-criterion test #464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory): tighten episode classifier against workflow-event closures#429

feat(memory): tighten episode classifier against workflow-event closures#429
dcellison merged 4 commits into
mainfrom
feature/episode-extractor-tightening

dcellison commented May 1, 2026

Uh oh!

dcellison commented May 1, 2026

Uh oh!

dcellison commented May 1, 2026

Uh oh!

dcellison commented May 1, 2026

Uh oh!

dcellison commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dcellison commented May 1, 2026

Summary

Tests

Operator follow-up after merge

Risks

Spec history

Uh oh!

dcellison commented May 1, 2026

Review by Kai

Review: feat(memory): tighten episode classifier against workflow-event closures

Warning 1 - _run_one_probe default contradicts stated behavior

Warning 2 - Output JSON field names are actively misleading at the new default

Suggestion - Inaccurate reason string on non-string goal rejection

Suggestion - Unreachable "unknown" arm fallback

Uh oh!

dcellison commented May 1, 2026

Review by Kai

Prior findings - resolved

New findings

Remaining code review - clean

Uh oh!

dcellison commented May 1, 2026

Review by Kai

Review

Suggestion 1 - Ambiguous "admits" in regex comment

Suggestion 2 - pull request and version artifact nouns are untested

Everything else - clean

Uh oh!

dcellison commented May 1, 2026

Review by Kai

Review

Suggestion - Missing goal key passes silently

Everything else - clean

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Warning 1 - `_run_one_probe` default contradicts stated behavior

Suggestion - Inaccurate `reason` string on non-string goal rejection

Suggestion - Unreachable `"unknown"` arm fallback

Suggestion 2 - `pull request` and `version` artifact nouns are untested

Suggestion - Missing `goal` key passes silently