Skip to content

fix: ingestion pipeline — ghost recursion, double-ingest, backlog races#315

Merged
buildingjoshbetter merged 8 commits into
mainfrom
fix/ingestion-pipeline-v3
May 15, 2026
Merged

fix: ingestion pipeline — ghost recursion, double-ingest, backlog races#315
buildingjoshbetter merged 8 commits into
mainfrom
fix/ingestion-pipeline-v3

Conversation

@buildingjoshbetter
Copy link
Copy Markdown
Owner

Summary

Fixes critical ingestion pipeline bugs that caused:

  • Ghost session recursion: claude -p extraction calls created infinite feedback loops, burning 56% of a Claude 20x Max subscription in <2 hours
  • Double/triple ingestion: Same transcript extracted 2-3x by Stop + Compact + UserPromptSubmit triggers
  • Backlog race conditions: Three drainers could process the same marker simultaneously
  • Silent crash data loss: Stderr went to /dev/null in drain paths

Changes

  1. TRUEMEMORY_EXTRACTION env var — Set in models.py for claude -p calls; all hooks + MCP drainer bail on it
  2. Transcript-size idempotency — Per-session markers in ~/.truememory/extracted/ track file size at last extraction; skip if unchanged
  3. Disable UserPromptSubmit extraction — Violated one-session-one-ingest rule; had 3 bugs (global marker, missing marker loop, TOCTOU race)
  4. Marker deletion inside flock — Moved unlink() inside spawn_gate() in 3 drain paths
  5. Stderr to log files — Drain paths now log to ~/.truememory/logs/ instead of DEVNULL
  6. Dynamic cap in log messages — Use _load_cap_state() instead of static SPAWN_CAP=2
  7. Path traversal defense — Sanitize session_id + guard empty transcript_path in _shared.py

Test plan

  • Full pytest suite passes (611 passed, 0 failed)
  • Ruff lint passes (0 errors)
  • Spawn gate intact in all 5 Popen files
  • Ghost recursion: TRUEMEMORY_EXTRACTION checked in all hooks + MCP drainer
  • Idempotency: should_extract_session checked in Stop + Compact
  • No interval=0 in compact.py
  • No _run_background_ingestion in user_prompt_submit.py
  • marker_path.unlink at same indent as register_spawned_pid in 3 drain files
  • 4-model OpenRouter consensus: 3/4 APPROVE (Codex non-responsive)

claude -p extraction calls created full Claude Code sessions that triggered
all hooks recursively, spawning infinite MCP servers and drainers. Each
extraction call burned API quota. A single session close could spawn 18+
ghost sessions in 9 minutes.

Fix: models.py sets TRUEMEMORY_EXTRACTION=1 in the claude -p env. All 4
hooks (stop, session_start, user_prompt_submit, compact) bail immediately
on this var. MCP server drainer skips starting when the var is set.
Stop, Compact, and UserPromptSubmit could all independently extract the
same transcript. A long session could be extracted 2-3 times, each time
processing the ENTIRE transcript from message 1 (no delta mode), wasting
LLM API calls.

Fix: per-session markers in ~/.truememory/extracted/<session_id> track
the transcript file size at last extraction. Triggers skip extraction if
the file hasn't grown by >1KB. The ingest CLI writes the authoritative
marker on successful completion.
Mid-session extraction violated one-session-one-ingest rule and had 3
bugs: global marker blocking cross-session extraction, missing marker
causing fire-on-every-prompt loop, and TOCTOU race. Extraction now
happens only on Stop (session close) and Compact (context compression),
both gated by per-session transcript-size idempotency.
marker_path.unlink() was outside the with spawn_gate() block in all 3
drain paths (session_start.py, mcp_server.py, cli.py). Two concurrent
drainers could both read the same marker, both spawn ingest, then both
delete it. Moving unlink inside the flock makes the read→spawn→delete
sequence atomic.
SessionStart and MCP drainer sent ingest stderr to /dev/null. If the
ingest process crashed, there was no log, no trace. Now stderr goes to
~/.truememory/logs/<session_id>.log for diagnosability.
Log warnings and backlog reason strings reported static SPAWN_CAP=2
even when the dynamic cap was 1 or 5. Now reads the persisted cap state
from _load_cap_state() for accurate diagnostics without triggering
subprocess calls in test environments.
should_extract_session and mark_session_extracted used unsanitized
session_id as filesystem path component. Added _safe_session_id()
(alphanumeric + dash/underscore, max 64 chars) for defense-in-depth.
Hooks already sanitize, but the shared layer should too.
1. Sanitize session_id in log file paths (session_start.py, mcp_server.py)
   using _safe_session_id() — same as EXTRACTED_DIR markers.
2. Return True in should_extract_session() when transcript shrinks
   (file truncation/rotation), not just when it grows.
3. Remove dead should_extract()/mark_extracted() functions and their
   constants from _shared.py — no callers remain after PR #315.
@buildingjoshbetter buildingjoshbetter merged commit d9a932f into main May 15, 2026
14 checks passed
@buildingjoshbetter buildingjoshbetter deleted the fix/ingestion-pipeline-v3 branch May 21, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant