fix: ingestion pipeline — ghost recursion, double-ingest, backlog races#315
Merged
Conversation
claude -p extraction calls created full Claude Code sessions that triggered all hooks recursively, spawning infinite MCP servers and drainers. Each extraction call burned API quota. A single session close could spawn 18+ ghost sessions in 9 minutes. Fix: models.py sets TRUEMEMORY_EXTRACTION=1 in the claude -p env. All 4 hooks (stop, session_start, user_prompt_submit, compact) bail immediately on this var. MCP server drainer skips starting when the var is set.
Stop, Compact, and UserPromptSubmit could all independently extract the same transcript. A long session could be extracted 2-3 times, each time processing the ENTIRE transcript from message 1 (no delta mode), wasting LLM API calls. Fix: per-session markers in ~/.truememory/extracted/<session_id> track the transcript file size at last extraction. Triggers skip extraction if the file hasn't grown by >1KB. The ingest CLI writes the authoritative marker on successful completion.
Mid-session extraction violated one-session-one-ingest rule and had 3 bugs: global marker blocking cross-session extraction, missing marker causing fire-on-every-prompt loop, and TOCTOU race. Extraction now happens only on Stop (session close) and Compact (context compression), both gated by per-session transcript-size idempotency.
marker_path.unlink() was outside the with spawn_gate() block in all 3 drain paths (session_start.py, mcp_server.py, cli.py). Two concurrent drainers could both read the same marker, both spawn ingest, then both delete it. Moving unlink inside the flock makes the read→spawn→delete sequence atomic.
SessionStart and MCP drainer sent ingest stderr to /dev/null. If the ingest process crashed, there was no log, no trace. Now stderr goes to ~/.truememory/logs/<session_id>.log for diagnosability.
Log warnings and backlog reason strings reported static SPAWN_CAP=2 even when the dynamic cap was 1 or 5. Now reads the persisted cap state from _load_cap_state() for accurate diagnostics without triggering subprocess calls in test environments.
should_extract_session and mark_session_extracted used unsanitized session_id as filesystem path component. Added _safe_session_id() (alphanumeric + dash/underscore, max 64 chars) for defense-in-depth. Hooks already sanitize, but the shared layer should too.
This was referenced May 14, 2026
1. Sanitize session_id in log file paths (session_start.py, mcp_server.py) using _safe_session_id() — same as EXTRACTED_DIR markers. 2. Return True in should_extract_session() when transcript shrinks (file truncation/rotation), not just when it grows. 3. Remove dead should_extract()/mark_extracted() functions and their constants from _shared.py — no callers remain after PR #315.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes critical ingestion pipeline bugs that caused:
claude -pextraction calls created infinite feedback loops, burning 56% of a Claude 20x Max subscription in <2 hoursChanges
models.pyforclaude -pcalls; all hooks + MCP drainer bail on it~/.truememory/extracted/track file size at last extraction; skip if unchangedunlink()insidespawn_gate()in 3 drain paths~/.truememory/logs/instead of DEVNULL_load_cap_state()instead of staticSPAWN_CAP=2_shared.pyTest plan
TRUEMEMORY_EXTRACTIONchecked in all hooks + MCP drainershould_extract_sessionchecked in Stop + Compactinterval=0in compact.py_run_background_ingestionin user_prompt_submit.pymarker_path.unlinkat same indent asregister_spawned_pidin 3 drain files