Skip to content

feat(memory): tighten episode classifier against workflow-event closures#429

Merged
dcellison merged 4 commits into
mainfrom
feature/episode-extractor-tightening
May 1, 2026
Merged

feat(memory): tighten episode classifier against workflow-event closures#429
dcellison merged 4 commits into
mainfrom
feature/episode-extractor-tightening

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Closes #428.

Summary

Three coordinated changes mirroring PR #427's pattern, but for the episode classifier:

  1. Stage-1 prompt v6 -> v7. Refines the EPISODE CLASSIFICATION block in _EXTRACTION_SYSTEM_PROMPT with a new EPISODE IGNORE list (workflow-loop iterations, routine workflow transactions, process meta-lessons) plus an episode-scoped DURABILITY TEST. Positive criterion 1 narrowed to "durable situation" with a forward reference to the IGNORE list. _EXTRACTION_PROMPT_VERSION bumps 6 -> 7; existing v3/v4/v5/v6 history-block prose is preserved alongside the new v7 paragraph.

  2. _validate_episode regex backstop. A Python-side guard hooked into _generate_episode between the JSON-decoded stage-2 output and add_structured. _EPISODE_GOAL_NOISE_RE rejects goals starting with workflow-shape verbs (Evaluate, Review, Audit, Approve, File, Push, Draft, Schedule, Post) followed by an artifact noun, with a bounded intervening-tokens clause that admits real production goals like "Push a prepared Memory wiki page". Per-user, per-arm rejection counts via the new _EPISODE_VALIDATE_REJECTIONS counter; exposed through get_extractor_stats() paralleling _RULE_6_REJECTIONS. New validate_rejected outcome on the memory.episode log line.

  3. Eval harness extension. _PROMPT_V6_PINNED captured verbatim before this PR's edits (SHA-256 4f1e03dd37d9d52bb39724f949f7c382343837608497ebcd2ab3ae4990d3030c). New --baseline {v5,v6} CLI flag threads through _run_one_probe to select which pinned constant the baseline arm runs against. Default flips to v6 so each new prompt revision compares against the immediately prior one; v5 retained for cross-revision sanity. Report carries a new baseline_choice field.

_EPISODE_PROMPT_VERSION intentionally stays at "1". This PR does not change _EPISODE_SYSTEM_PROMPT or _EPISODE_SCHEMA, so the stage-2 version constant is unchanged. The hygiene sweep targets episodes by hand-curated ID list, not by prompt-version filter. Issue body's acceptance criterion 3 (bump to "2") is intentionally superseded; this is documented in the spec evaluation and the Acceptance section there.

Tests

  • 13 new tests added; existing test count rises from 2750 to 2764. Suite: 2764 passed, 1 skipped.
  • TestEpisodeClassificationIgnoreList (4): pin the four prompt-side wording fragments (three IGNORE bullets + DURABILITY TEST gate).
  • TestValidateEpisodeWorkflowRegex (5): one positive-cases test per arm (review/approve/transaction), one per-user-per-arm counter integration test, one negatives test against durable-shape goals from the snapshot.
  • TestValidateEpisodeIntegration (2): async tests driving _generate_episode end-to-end with mocked stage-2 runner and add_structured. Covers both reject (outcome="validate_rejected", counter increments, add_structured not called) and accept (outcome="stored", memory_id flows through, counter unchanged).
  • Two existing methods in TestExtractionPromptSoftVocab updated in place per PR feat(memory): tighten extractor against workflow-event noise (#426) #427's precedent: version pin moves 6 -> 7, history-block fragment list extends with "v7 (2026-04-30, this issue)" and "EPISODE CLASSIFICATION block".
  • New test_v6_pinned_drift in tests/test_eval_extraction.py mirroring test_v5_pinned_drift. New test_episode_classification_fixture_loads validates the 12-probe example fixture.

Operator follow-up after merge

Two operator actions land separately and are not part of this PR's CI:

  • Hygiene sweep. scripts/forget-episode-noise.py (untracked, operator-local) deletes 29 hand-curated mundane episode IDs from production: 24 review-loop iterations + 5 routine workflow transactions. Run sequence:

    sudo launchctl bootout system/com.syrinx.kai
    .venv/bin/python scripts/forget-episode-noise.py --apply
    sudo launchctl bootstrap system /Library/LaunchDaemons/com.syrinx.kai.plist
    

    Expected drop: episode bucket 42 -> 13 (all durable). Mirrors the PR feat(memory): tighten extractor against workflow-event noise (#426) #427 fact-side sweep precedent.

  • v6-vs-v7 harness rerun. Run during the PR-review cooldown window:

    .venv/bin/python -m kai.eval.extraction \
      --probes home/evals/episode-classification-probes.example.jsonl \
      --baseline v6 \
      --output /tmp/v6-v7-report.json
    

    Expected ~$0.30, ~10 minutes. The per-probe has_episode flips will be reported back here when complete; the spec's expected v7 result is workflow-event false-positive rate 0/5 (down from ~5/5 under v6) with durable-episode true-positive rate holding at 4/4. Borderline (3 probes) is the dimension to watch.

Risks

  • Regex over-rejection on legitimate review-shape goals (e.g. "Review the v3 architecture proposal"). Mitigation: arm anchors at ^, the artifact-noun list does not include architecture, and the _EPISODE_VALIDATE_REJECTIONS counter is exposed in production for real-world false-positive tracking.
  • Borderline-probe drift between v6 and v7. Mitigation: report drift probe-by-probe in this PR's harness rerun; goal is no regression on the 4 durable probes, not perfection on the 3 borderline.
  • The example probe fixture is illustrative; the production probe fixture is operator-built post-merge.

Spec history

Three rounds of independent review converged with descending severity:

  • v1: 2 mechanism + 2 wording-class sub-blockers + 3 wording nits.
  • v2: 1 wording-class sub-blocker (factual claim about __all__ membership).
  • v3: clean approval.

The implementation tracks v3 directly.

…res (#428)

Three coordinated changes mirroring PR #427's pattern for fact extraction:

1. Stage-1 prompt v6 -> v7: refine the EPISODE CLASSIFICATION block in
   `_EXTRACTION_SYSTEM_PROMPT` with a new EPISODE IGNORE list (workflow-loop
   iterations, routine workflow transactions, process meta-lessons) plus
   an EPISODE DURABILITY TEST gate. Positive criterion 1 narrowed to
   "durable situation". Bump `_EXTRACTION_PROMPT_VERSION` 6 -> 7 with the
   v7 history-block prose appended; existing v3/v4/v5/v6 entries preserved.

2. `_validate_episode` regex backstop hooked into `_generate_episode`:
   `_EPISODE_GOAL_NOISE_RE` rejects stage-2 outputs whose `goal` matches
   workflow-shape verbs (Evaluate/Review/Audit/Approve/File/Push/Draft/
   Schedule/Post). Per-user, per-arm rejection counter
   `_EPISODE_VALIDATE_REJECTIONS` exposed via `get_extractor_stats()`.
   New `validate_rejected` outcome on the memory.episode log line.
   `_validate_episode` added to `__all__` paralleling `_validate_facts`;
   regex constant and counter stay module-level (mirrors `_WORKFLOW_EVENT_RE`
   / `_RULE_6_REJECTIONS` precedent).

3. Eval harness extended for v6-vs-v7 head-to-head: `_PROMPT_V6_PINNED`
   captured verbatim before this PR's edits, `--baseline {v5,v6}` CLI flag
   threaded through `_run_one_probe`. Existing v5-vs-active path retained
   for cross-revision sanity. Report carries `baseline_choice` so a v5
   run is distinguishable from a v6 run; `v5_/v6_prompt_hash` keys retain
   their historical names but report the actual pair-arm hashes.

`_EPISODE_PROMPT_VERSION` deliberately stays at "1": this PR does not
change `_EPISODE_SYSTEM_PROMPT` or `_EPISODE_SCHEMA`, so the stage-2
version constant is unchanged. The hygiene sweep targets episodes by
hand-curated ID list, not by prompt-version filter.

13 new tests (`TestEpisodeClassificationIgnoreList`,
`TestValidateEpisodeWorkflowRegex`, `TestValidateEpisodeIntegration`,
plus `test_v6_pinned_drift` and `test_episode_classification_fixture_loads`),
plus updates to two existing methods in `TestExtractionPromptSoftVocab`.
Full suite: 2764 passed, 1 skipped.

A 12-probe example fixture lands at
`home/evals/episode-classification-probes.example.jsonl` (5 workflow-event,
4 durable, 3 borderline; sanitized identifiers throughout) with the
`.gitignore` exception. The operator-local production fixture and the
`scripts/forget-episode-noise.py` hygiene-sweep script (29 hand-curated
mundane episode IDs) stay untracked per the precedent set by
`scripts/forget-review-noise.py` and its pass-2 counterpart.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

The explore agent found the pre-PR state. The changes are in the diff itself, which I've already fully analyzed. Here's the review.


Review: feat(memory): tighten episode classifier against workflow-event closures

The PR is well-structured with clear defense-in-depth thinking and good test coverage. Three concrete findings, no critical issues.


Warning 1 - _run_one_probe default contradicts stated behavior

File: src/kai/eval/extraction.py, new signature in _run_one_probe

baseline_prompt: str = _PROMPT_V5_PINNED,  # added default

The PR description says "Default flips to v6" and the --baseline argparse default is "v6". But the function-level keyword default is _PROMPT_V5_PINNED. Any caller that invokes _run_one_probe(...) without the baseline_prompt kwarg - including future tests or programmatic callers - silently gets v5. The CLI path is fine because it always passes baseline_prompt explicitly, but the function signature is a trap. The default should be _PROMPT_V6_PINNED.


Warning 2 - Output JSON field names are actively misleading at the new default

File: src/kai/eval/extraction.py, report dict in _run_async

"v5_prompt_hash": _hash(baseline_prompt),         # is v6 hash when --baseline v6
"v6_prompt_hash": _hash(memory_extraction._EXTRACTION_SYSTEM_PROMPT),  # is v7 hash

With --baseline v6 (the new default), v5_prompt_hash carries the v6 digest and v6_prompt_hash carries the v7 digest. baseline_choice partially documents this, but any downstream consumer reading those keys by name gets wrong information - including future log grep queries. Renaming to baseline_prompt_hash / active_prompt_hash would fix this cleanly. If the field names must stay stable for existing consumers, the inline comment should at minimum warn that the field names do not reflect the actual versions when baseline != v5.


Suggestion - Inaccurate reason string on non-string goal rejection

File: src/kai/memory_extraction.py, _validate_episode and its caller in _generate_episode

_validate_episode returns None for two distinct cases: regex match, and a non-string goal. The caller unconditionally sets:

reason = "goal matches workflow-event regex"

For the non-string path, that reason is factually wrong, and the non-string path emits no log line and does not increment the counter - making it invisible in production. The fix is a one-liner: log at DEBUG in the non-string branch and either return a sentinel or accept a separate reason parameter. Low probability in practice since the stage-2 schema enforces string goal, but the misleading reason will surface in incidents if the schema is ever relaxed.


Suggestion - Unreachable "unknown" arm fallback

File: src/kai/memory_extraction.py, _validate_episode

arm = _ARM_FOR_VERB.get(verb, "unknown")

match.group(1) can only capture the six verbs enumerated in _EPISODE_GOAL_NOISE_RE, all of which are keys in _ARM_FOR_VERB. The "unknown" fallback is dead code. An assert verb in _ARM_FOR_VERB would catch a future _EPISODE_GOAL_NOISE_RE edit that adds a verb without updating the map; the silent "unknown" arm silently misfires counters instead.


Prompt changes, fixture, and test coverage all look correct. The _PerArmCounter snapshot logic is sound. The regex bounded-gap approach ({0,6}?) is appropriate and carries no ReDoS risk.

- W1: _run_one_probe baseline_prompt default flips from
  _PROMPT_V5_PINNED to _PROMPT_V6_PINNED so direct programmatic
  callers and ad-hoc test fixtures get the same baseline as the
  CLI default.
- W2: report keys renamed v5_prompt_hash -> baseline_prompt_hash
  and v6_prompt_hash -> active_prompt_hash; output schema bumps
  1 -> 2. Historical names lied at the new --baseline v6 default
  (the field "v5" was carrying a v6 hash).
- Suggestion: _validate_episode now returns (episode | None,
  reason | None). Workflow-regex reject path returns
  ("workflow-event regex match"); non-string-goal defensive path
  returns ("non-string goal") with a DEBUG log and no counter
  increment. _generate_episode unpacks the tuple and uses the
  reason directly so the memory.episode log line distinguishes
  the two reject modes.
- Suggestion: _ARM_FOR_VERB lookup switched from .get(verb,
  "unknown") to direct subscript so a future _EPISODE_GOAL_NOISE_RE
  edit that adds a verb without updating the map fails loud
  instead of silently miscounting under "unknown".

New test test_non_string_goal_rejected_with_distinct_reason
pins the defensive-path reason and verifies the per-arm counter
stays at zero. Existing TestValidateEpisodeWorkflowRegex tests
updated to unpack the tuple and pin the reason string. Full
suite: 2765 passed, 1 skipped.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

The prior review's four findings (two warnings, two suggestions) have all been addressed cleanly. I'll verify each resolution, then assess anything new.

Prior findings - resolved

Warning 1 (_run_one_probe default) - Fixed. baseline_prompt: str = _PROMPT_V6_PINNED is now the function-level default, matching the CLI default and the stated intent.

Warning 2 (misleading output field names) - Fixed. v5_prompt_hash/v6_prompt_hash renamed to baseline_prompt_hash/active_prompt_hash; _OUTPUT_SCHEMA_VERSION bumped to "2" with a migration comment; the baseline_choice field disambiguates for any downstream consumer.

Suggestion 1 (non-string goal reason) - Fixed. _validate_episode now returns "non-string goal" vs "workflow-event regex match" as distinct strings, and the non-string path emits a DEBUG log and withholds the counter increment. The caller propagates whichever reason was returned.

Suggestion 2 ("unknown" sentinel arm) - Fixed. Direct _ARM_FOR_VERB[verb] dict access replaces .get(verb, "unknown"); the comment explains that a future regex edit adding a verb without updating the map will now KeyError loudly rather than silently miscount.

New findings

Suggestion - Test count in PR description is off by one. TestValidateEpisodeWorkflowRegex contains 6 methods (the 5 described plus test_non_string_goal_rejected_with_distinct_reason), making the total 14 new tests, not 13. Not a code issue; just a documentation artifact.

Remaining code review - clean

Regex correctness. _EPISODE_GOAL_NOISE_RE anchors at ^, uses a lazy bounded gap {0,6}?, and applies re.IGNORECASE. The .lower() call on match.group(1) before the _ARM_FOR_VERB lookup is correct and handles the case-insensitive flag. The 9 captured verbs are all present in _ARM_FOR_VERB. No ReDoS risk; the token class [\w#-]+ is non-overlapping and the cap is explicit.

_PerArmCounter lifecycle. _reset() uses clear() in-place, consistent with _Counter._reset() and the stated rationale (external importers hold a reference to the module-level instance). The snapshot's shallow per-user copy is sufficient since inner values are ints.

_generate_episode hookup. The episode = validated reassignment when the episode passes validation is a no-op (same dict object), but it's harmless and makes the branching symmetric. The lambda inside run_in_executor correctly closes over episode after the reassignment.

__all__ extension. _validate_episode follows the project's existing pattern of exporting underscore-prefixed names for test access (alongside _validate_facts, _store_facts, etc.).

Fixture probe counts. test_episode_classification_fixture_loads asserts workflow-noise == 8 (5 workflow-event + 3 borderline) and durable-content == 4. Verified against the JSONL file: both counts are correct.

_PROMPT_V6_PINNED drift test. The _V6_PROMPT_HASH in test_eval_extraction.py matches the hash cited in the PR description.

The implementation is correct. No critical issues, no new warnings.

- src/kai/memory_extraction.py:1104-1105: rephrased the
  _EPISODE_GOAL_NOISE_RE rationale comment to drop "the spec's
  per-arm noun lists" - same review-round-vocabulary class as
  the four violations scrubbed in PR #427's final commit.
- tests/test_memory_extraction.py:2367-2369: same-class catch in
  test_extraction_prompt_version_history_extended docstring,
  rephrased "appended in this spec's history block" to "appended
  for issue #428". TG Kai flagged the source comment; same-class
  sweep caught the test docstring.

No mechanism changes; comment-only diff. Suite still 2765 passed.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

The two prior rounds resolved all four findings cleanly. I reviewed the diff independently against those resolved items and the full codebase conventions.


Review

Overall: clean. No critical issues or new warnings. Two minor suggestions below.


Suggestion 1 - Ambiguous "admits" in regex comment

File: src/kai/memory_extraction.py, _EPISODE_GOAL_NOISE_RE comment block

# Catches real production goals like
# "Push a prepared Memory wiki page", "Approve v3 of the
# memory-md-to-Qdrant migration spec", ...
# Bounded explicitly so a long fact text cannot
# drive quadratic backtracking. The token class is alphanumerics
# + `#` + `-` so things like "#412", "v3", and "tag-dedup" pass
# through.

The phrase "admits real production goals like 'Push a prepared Memory wiki page'" (earlier in that same block) reads as "allows through" to a future reader, but the intent is "correctly intercepts." All three examples there are positive cases that the regex rejects - they appear in test_arm3_routine_transactions as expected rejections. The word "admits" means the opposite of the intended meaning. "correctly catches" or "intercepts" would be unambiguous.


Suggestion 2 - pull request and version artifact nouns are untested

File: tests/test_memory_extraction.py, TestValidateEpisodeWorkflowRegex

The _EPISODE_GOAL_NOISE_RE alternation includes pull request (two-word) and version, but neither appears in any positive-case test. The three arm tests cover spec, specification, PR, issue, revision, wiki, epic, reminder, comment. The regex engine handles these correctly - a space inside an alternation literal is legal Python regex - but the two-word pull request form in particular benefits from a test, since it's the only multi-word entry and a future regex edit could silently break it. Low priority, but worth one test line in the arm 2 or arm 3 block.


Everything else - clean

  • Regex correctness: ^(verb)\s+(?:[\w#-]+\s+){0,6}?(?:nouns)\b with re.IGNORECASE is sound. Lazy {0,6}? carries no ReDoS risk (explicit upper bound, non-overlapping token class). The pull request two-word literal in the alternation works correctly in Python's re engine.
  • _ARM_FOR_VERB direct lookup: All 9 verbs in the regex are keys in the map. KeyError-on-miss is the correct failure mode.
  • _PerArmCounter.snapshot(): Shallow per-user copy is sufficient; inner values are ints. _reset() uses clear() in-place, consistent with _Counter._reset().
  • _validate_episode return contract: Both reject paths ("non-string goal", "workflow-event regex match") return distinct reason strings; counter increments only on the regex path; the non-string path emits a DEBUG log. Caller propagates reason correctly.
  • _generate_episode hookup: The episode = validated no-op rebind is harmless and makes the branching symmetric. Lambda closure is over the post-reassignment episode, which is the same object.
  • Eval harness: _PROMPT_V6_PINNED default on _run_one_probe, baseline_prompts dict as single source of truth, field rename to baseline_prompt_hash/active_prompt_hash, schema version bump to "2" - all correct.
  • Fixture probe counts: 5 wf + 3 borderline = 8 workflow-noise, 4 durable-content = 12 total. Test assertions match.
  • __all__ extension: _validate_episode follows the existing underscore-export pattern.
  • Security: No user input reaches the regex or any exec path. No new injection surface.

- S1: rephrase test_arm2_approve docstring. "intervening-tokens cap
  admits multi-word artifact qualifiers" reads ambiguously — "admits"
  could mean "lets through" (i.e. fails to reject) when the actual
  meaning is "still finds the artifact noun in". New wording is
  unambiguous: "the cap lets the regex still find the artifact noun
  in goals with multi-word qualifiers".
- S2: add positive coverage for the `version` and `pull request`
  artifact nouns. Both appear in `_EPISODE_GOAL_NOISE_RE`'s
  alternation but neither was exercised by any positive test;
  `pull request` is the only multi-word entry and a future regex
  edit could silently break it. Two new probes added to
  test_arm2_approve: "Approve the new version" and "Approve the
  pull request". Counter / arm-mapping unchanged.

Suite: 2765 passed (count unchanged; positives count inside one test
went from 3 to 5).
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Three prior rounds resolved all previously raised findings. Reviewing fresh against this diff.

Review

Overall: clean. No critical issues or new warnings. One observation below.


Suggestion - Missing goal key passes silently

File: src/kai/memory_extraction.py, _validate_episode

goal = episode.get("goal", "")

If the "goal" key is absent from the stage-2 payload, goal defaults to "", which passes the string check and returns (episode, None) - storing an episode with an empty goal. The stage-2 schema enforces goal as a required string, so this is low probability in practice. But the default of "" silently papers over a schema contract violation rather than exposing it. A goal = episode.get("goal") followed by treating goal is None the same as the non-string path (same DEBUG log, same "non-string goal" reason) would keep the validator's two rejection paths consistent. The current behavior is not wrong given the schema guarantee; just worth noting.


Everything else - clean

  • Regex correctness. ^(verb)\s+(?:[\w#-]+\s+){0,6}?(?:nouns)\b with re.IGNORECASE. The lazy bounded gap is sound, no ReDoS risk. The pull request two-word literal in the alternation works correctly in Python's re engine. version and pull request both now have positive-case coverage in test_arm2_approve. The 6-token cap accommodates the longest production goal in the snapshot with one token of headroom.
  • _ARM_FOR_VERB direct lookup. All 9 verbs captured by group 1 are keys in the map. KeyError-on-miss is the correct failure mode.
  • _validate_episode return contract. Both reject paths return distinct reason strings; counter increments only on the regex path; non-string path emits DEBUG log and withholds increment.
  • _generate_episode hookup. episode = validated is a harmless no-op rebind (same object). Lambda closes over episode after the rebind; no race. validate_rejected outcome correctly short-circuits before add_structured.
  • Eval harness. _PROMPT_V6_PINNED default on _run_one_probe, baseline_prompts dict as single source of truth, baseline_prompt_hash/active_prompt_hash rename, schema version bump to "2" - all correct. argparse choices and baseline_prompts keys are in sync.
  • Fixture. 5 workflow-event + 3 borderline = 8 workflow-noise, 4 durable-content = 12 total. Test assertions match the fixture content.
  • Integration tests. Monkeypatch targets memory_module.add_structured correctly (attribute lookup on the module object, not a name-bound import). Both reject and accept paths are covered.
  • Security. Regex is applied to LLM output, not raw user input. No injection surface.

@dcellison dcellison merged commit 7960135 into main May 1, 2026
1 check passed
@dcellison dcellison deleted the feature/episode-extractor-tightening branch May 1, 2026 11:45
dcellison added a commit that referenced this pull request May 13, 2026
* feat: positive-criterion extraction prompt and replay scaffolding (closes #464)

The extraction prompt accumulated exclusion lists across #426, #427,
#429 to suppress noise classes observed in production. The pattern is
bounded by the author's enumeration of failure modes: a new variant of
an already-seen noise class slips through, someone adds another
stanza, the prompt grows. The growth curve has no natural cap and
each new entry dilutes the model's attention to the positive task.

This change replaces the IGNORE block and the fact-side DURABILITY
TEST with a single positive criterion: "Would this fact help a future
conversation that does not include the current turn?" Applied per
candidate, with six worked examples (three emit, three do not emit)
anchoring the counterfactual reasoning. The positive criterion is
bounded by the property we actually care about, not the author's
enumeration; phrasings the prompt has never seen are evaluated by
the same test as the ones it has.

`_EXTRACTION_PROMPT_VERSION` bumps from "8" to "9" so future cleanup
can target old-prompt facts if the class coverage differs materially.
The episode-side blocks (EPISODE CLASSIFICATION, EPISODE IGNORE,
EPISODE DURABILITY TEST) are unchanged per the parent epic's scope;
FORMAT, STORE, CONFIDENCE, and CONSOLIDATION blocks all stay.

New `kai.eval.replay` module supports comparing pre-PR and post-PR
sandbox fact sets without disturbing production Qdrant. It walks
chat-history JSONL pair-by-pair, maintains a rolling PRIOR CONTEXT
buffer of `episode_classifier_context_turns` prior pairs, and calls
`extract_and_store` with a sandbox `user_id` (enforced to carry a
`sandbox-` prefix as defense in depth against accidental writes to
real users). The replay reuses `_build_extraction_payload` for prior-
context truncation so the sandbox sees production-equivalent input
shape, and reuses `history._pair_records_chronologically` so the
pair stream is byte-equivalent to what production fed the extractor.

Two existing prompt-pinning tests (`test_ignore_bullets_added`,
`test_durability_test_section_present`) are removed because the v9
swap retires the structures they pinned; their replacements live in
the new `tests/test_extraction_prompt.py`. `test_extraction_prompt_
version_bumped` is updated to assert "9"; the version-history-fragment
test gains assertions for the v9 entry.

The acceptance evidence (sandbox-replay smoke, canary recall,
classification retention report) lands in the PR description after
the operator's sandbox phase runs.

* fix: initialize memory store in replay before extract loop

The replay module called `extract_and_store` per pair but never
invoked `memory.init_memory(config)`, so the module-level `_memory`
handle in `kai.memory` stayed None for the entire run. Every storage
primitive (`search`, `add_structured`, `delete_all`) short-circuits on
that None check, so the replay walked the chat history, spawned a
`claude --print` extractor per pair, and silently discarded every
extracted fact. The symptom surfaced only via `outcome=dropped_backend`
in the per-pair `consolidate.intent` log lines; the summary print
reported `facts_stored=0` with no exception raised.

Fix: one-liner call to `init_memory(config)` immediately after
`load_config()` in `_async_main`.

Test: TestInitMemory in tests/test_eval_replay.py mocks every
post-init dependency and asserts `init_memory` was invoked with the
loaded config object. The class docstring documents the original
failure mode so a future regression has the diagnostic trail attached.

* fix: extraction prompt coherence and replay summary breakdowns

Addresses #467 review findings:

- STORE block's "Apply the DURABILITY TEST below" rewrite. The v9
  swap deleted the fact-side DURABILITY TEST and inserted the QUALITY
  TEST in its place; the STORE block's directive was outside the
  edit range and survived stale, pointing the model at a section
  that no longer exists. Renamed to "QUALITY TEST below"; new
  snapshot test pins the fact-side region against any further
  DURABILITY TEST references slipping in.
- Worked examples block extended from 3 negatives to 4. The new
  example ("I'm writing the spec now." -> do not emit) covers the
  fourth class the spec prose named (ephemeral / in-progress task
  state) but which the original block omitted. The negative-examples
  snapshot test now pins all four.
- Comments at three sites in memory_extraction.py renamed from "the
  prompt's IGNORE rules" to "the prompt's QUALITY TEST" so future
  readers don't hunt for an IGNORE block that v9 removed.
- `_run_replay` returns `(counters, dry_run_samples)`. Live runs
  pull `get_all_facts` for the sandbox user after the loop and emit
  by-tag, by-speaker, by-prompt-version breakdowns; dry-runs emit
  a structural payload-shape sample (prior depth + char counts)
  for the first three pairs. Verbatim text is intentionally not
  surfaced so a sandbox dry-run cannot leak personal history into
  stdout artifacts.
- `init_memory(config)` moved after the `--context-turns` negative-
  value validation so an argument-rejection path does not trigger
  the first-run Mem0 embedding-model download.

Eight new unit tests cover the new format helpers (grouping rules,
empty-input handling, sort order, missing-metadata defaults,
verbatim-text absence), and the existing TestInitMemory + dry-run
tests are updated for the new return shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory: tighten episode classifier against workflow-event closures

3 participants