Skip to content

Seed existing MEMORY.md knowledge into Mem0#317

Merged
dcellison merged 5 commits into
mainfrom
feature/seed-existing-memory
Apr 16, 2026
Merged

Seed existing MEMORY.md knowledge into Mem0#317
dcellison merged 5 commits into
mainfrom
feature/seed-existing-memory

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Summary

Seeds existing MEMORY.md topic files into the Mem0 semantic memory store at startup, bridging the gap between Kai's structured markdown memory and the new vector-based search layer.

  • Adds add_structured() for explicit Track 2 ingestion (infer=False) with type/tag metadata
  • Adds seed_from_memory_md() which parses topic files, classifies them by type, deduplicates via cosine similarity, and stores entries per-user
  • Runs once per user at startup (guarded by memory_seeded:<user_id> flag in settings table)
  • 38 new tests covering parsing, classification, deduplication, multi-user isolation, and startup integration

Implementation details

memory.py - 5 new functions:

  • add_structured() - stores content with infer=False, builds metadata with type/tags/source
  • _classify_source_file() - maps known files to memory types (fact/preference), skips index/api-reference
  • _parse_topic_file() - markdown parser handling bullets, paragraphs, headings, code blocks
  • _is_duplicate() - top-1 cosine similarity search with 0.9 threshold
  • seed_from_memory_md() - orchestrator iterating users x topic files with per-entry dedup

main.py - startup migration block:

  • Checks memory_seeded:<user_id> flag per allowed user
  • Runs seed in thread executor (blocking Mem0 calls off the event loop)
  • Only sets flag when zero failures for that user

Test plan

  • TestAddStructured (10 tests) - fact/preference/custom types, metadata, tags, edge cases
  • TestParseTopicFile (6 tests) - bullets, headings, paragraphs, code blocks, empty files
  • TestClassifySourceFile (3 tests) - known files, skip files, unknown files
  • TestSeedFromMemoryMd (15 tests) - full seeding flow, idempotency, multi-user, partial failure
  • TestMemorySeedIntegration (4 tests) - startup flag logic, skip-when-present, failure handling
  • make check clean (ruff check + format)
  • All 38 new tests pass; pre-existing failures unchanged

Closes #311

@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: Seed MEMORY.md into Mem0

Overall the implementation is clean and well-reasoned. A few real issues below.


Warning

1. Test isolation - _memory global state not cleaned up
tests/test_memory.py - throughout TestAddStructured and TestSeedFromMemoryMd

Every test that sets mem_mod._memory = mock_mem does so without teardown. Two tests assert the disabled path:

  • TestAddStructured.test_disabled_returns_none
  • TestSeedFromMemoryMd.test_disabled_returns_zero_counts

Both pass only if _memory is None. If pytest runs them after any other test that sets _memory, they fail. There's no autouse fixture to restore _memory = None. This is a real fragility - add a fixture or explicit mem_mod._memory = None teardown.

2. Missing memory directory not explicitly handled
memory.py, seed_from_memory_md()

for path in sorted(target_dir.glob("*.md")):

On Python < 3.12, Path.glob() on a non-existent directory raises OSError. The outer except Exception in main.py catches it, but logs a generic "Memory seed migration failed" instead of something clear like "memory directory does not exist". On first install before any memory files are written, this would spam warnings on every startup. A if not target_dir.exists(): return {uid: {...} for uid in user_ids} guard would make this self-healing.


Suggestion

3. Redundant skip list in _classify_source_file()
memory.py:495-499

if filename in ("MEMORY.md", "api-reference.md"):
    return None
return mapping.get(filename)

Neither MEMORY.md nor api-reference.md is in mapping, so mapping.get() already returns None for both. The explicit check is dead code. It documents intent, but it misleads the reader into thinking these files would otherwise match - they wouldn't. Either add them to mapping with None values (making the intent explicit in one place) or drop the guard and rely on the comment.

4. Bare # without space treated as heading
memory.py, _parse_topic_file()

if stripped.startswith("#"):
    current_heading = stripped.lstrip("#").strip()

A line like #hashtag or #311 (e.g. in notes) is swallowed as a heading rather than seeded as content. Standard Markdown heading syntax requires # (hash + space). Changing to stripped.startswith("# ") or stripped == "#" would be safer, though the risk is low given the controlled input files.

5. Integration tests copy-paste production logic
tests/test_main.py, TestMemorySeedIntegration

The four tests inline a verbatim copy of the seed block from _init_and_run() rather than invoking the function itself. If the production code changes (e.g., the log call or flag-key format), these tests will silently pass while the real path regresses. Acceptable given the difficulty of testing _init_and_run() end-to-end, but worth a comment acknowledging the limitation so a future author doesn't trust these tests as regression guards for the actual production flow.


No security issues found. The flag-key format (memory_seeded:{user_id_int} vs memory_seeded:{user_id_str}) looks inconsistent but is semantically identical since f"{123}" == f"{'123'}". The thread-executor offload, dedup threshold, and partial-failure retry logic are all sound.

@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: Seed MEMORY.md into Mem0

All three warnings/suggestions from the prior review have been addressed: the missing directory guard is present, the redundant skip-list in _classify_source_file is removed (with explanatory comment), and the bare-# heading fix is implemented and tested.

One new finding below.


Suggestion

Heading detection is asymmetric for ATX headings without a space
memory.py, _parse_topic_file(), heading condition

if stripped.startswith("# ") or (len(stripped) > 1 and stripped[0] == "#" and stripped[1] == "#"):

The fix correctly rejects #311 and #hashtag (single hash, no space) as paragraph text. But the second branch catches any line beginning with ## regardless of whether a space follows: ##nospace is treated as a heading while #nospace is not. CommonMark requires ## (space after hashes) for all ATX heading levels. In practice this matters very little given the controlled memory files, but the asymmetry could silently swallow a line like ##311 (a GitHub cross-repo reference) as a heading. A uniform stripped.startswith("# ") or re.match(r"#{1,6} ", stripped) would be consistent, or at minimum a comment acknowledging the intentional divergence.


No other new issues. The implementation is solid: the thread-executor offload is correct, partial-failure retry logic is sound, per-user flag isolation is correct, and the 38-test suite covers the meaningful paths including the new missing-directory guard and the bare-hash edge case. No security issues found.

@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: Seed MEMORY.md into Mem0

The heading detection fix (heading_hashes + stripped[heading_hashes : heading_hashes + 1] == " ") correctly and uniformly handles all ATX levels - this closes the asymmetry raised in the last review.

Two new findings below.


Warning

UnicodeDecodeError escapes the except OSError guard in seed_from_memory_md()
memory.py, per-file loop (~line 715 in the new block)

try:
    entries = _parse_topic_file(path)
except OSError:
    counts["failed"] += 1
    continue

_parse_topic_file calls path.read_text(encoding="utf-8"). A file with non-UTF-8 bytes raises UnicodeDecodeError, which is a subclass of ValueError, not OSError. That exception is not caught here; it propagates out of seed_from_memory_md entirely, through the run_in_executor lambda in main.py, and into the outer except Exception block - logging a generic "Memory seed migration failed" instead of continuing with the remaining files. The per-file isolation the code clearly intends breaks on a single malformed file. Fix: except (OSError, UnicodeDecodeError) or except Exception.


Suggestion

_is_duplicate() docstring claims search exceptions return False, but there is no guard
memory.py, _is_duplicate()

def _is_duplicate(content: str, *, user_id: str, threshold: float = 0.9) -> bool:
    results = search(content, user_id=user_id, limit=1)   # no try-except
    if not results:
        return False
    return results[0].score >= threshold

The docstring says failures return False ("including if the store is empty, memory is disabled, or the search itself failed"), but the code has no try-except. This is only correct if search() itself is exception-safe. That function is not in this diff. If it propagates (e.g., during a vector store outage), the exception escapes to seed_from_memory_md, where there is no per-entry guard around the _is_duplicate call - aborting that user's entire migration run. Either add a try-except here to match the docstring, or add a comment noting the exception-safety assumption on search().


No security issues. Logic is sound - the partial-failure retry design, per-user flag isolation, thread-executor offload, and dedup threshold are all correct.

@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: Seed MEMORY.md into Mem0 (Round 4)

All three issues from the prior round are correctly addressed:

  • UnicodeDecodeError guard -- except (OSError, UnicodeDecodeError) in the per-file loop restores the intended per-file isolation.
  • _is_duplicate docstring/code mismatch -- the try/except with a log.warning and return False fallback now matches the documented contract.
  • Docstring accuracy -- the updated docstring correctly describes the exception-safe behavior.

No new critical issues or warnings found.


Suggestion

test_seed_runs_when_flag_absent and test_seed_flag_not_set_on_failure patch kai.sessions.get_setting globally, not per-method

tests/test_main.py, TestMemorySeedIntegration

patch("kai.sessions.get_setting", mock_get_setting),

mock_get_setting = AsyncMock(return_value=None) is applied for the whole with block, which means if sessions.get_setting is called for any reason other than the flag check inside the simulated block, the mock still intercepts it. This is harmless for the current tests since the block is small, but it's an implicit assumption. A comment noting that get_setting is only called once (for the flag key) would make the test intent clearer to a future author.


No security issues. The thread-executor offload, per-user flag isolation, partial-failure retry design, dedup threshold, and missing-directory guard are all correct. The 38-test suite is thorough and the make check green state is consistent with the diff. Ready to merge.

@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: Seed MEMORY.md into Mem0 (Round 5)

All findings from prior rounds have been addressed. One new finding below.


Suggestion

test_is_idempotent_on_rerun is coupled to search() internals not visible in this diff

tests/test_memory.py, TestSeedFromMemoryMd.test_is_idempotent_on_rerun

def mock_search(query, **kwargs):
    for s in stored:
        if s["memory"] == query:
            return {"results": [{"id": s["id"], "memory": s["memory"], "score": 0.95, ...}]}
    return {"results": []}

_is_duplicate calls the module-level search(), which sits between _is_duplicate and _memory.search(). search() likely transforms Mem0's raw dict response into MemoryResult objects before _is_duplicate accesses results[0].score. The mock returns the raw Mem0 dict format; if search() expects a different wire format from _memory.search() than what mock_search returns, the transformation layer would fail to produce .score >= 0.9 objects and the idempotency test would silently stop testing duplicate detection. The other TestSeedFromMemoryMd tests avoid this by using mock_mem.search.return_value = {"results": []} (empty, so dedup always returns False), which is safe. This test is the only one that depends on a truthy score path through search(). A comment noting that the mock format must match what the real Mem0 client returns - so that search() can parse it - would prevent a future author from silently breaking this with a format change.


No new bugs, logic errors, or security issues. The UnicodeDecodeError guard, _is_duplicate exception safety, missing-directory guard, and heading detection are all correct. The thread-executor offload, per-user flag isolation, and partial-failure retry logic remain sound. Ready to merge.

@dcellison dcellison merged commit 59a481b into main Apr 16, 2026
1 check passed
@dcellison dcellison deleted the feature/seed-existing-memory branch April 16, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Seed existing MEMORY.md knowledge into Mem0

2 participants