Skip to content

Ingest process loads 1.5GB model for prediction error when it could be skipped or remote #337

@buildingjoshbetter

Description

@buildingjoshbetter

Problem

The ingest pipeline spawns as a separate subprocess (via stop.py:343):

cmd = [sys.executable, "-m", "truememory.ingest.cli", "ingest", transcript_path]
subprocess.Popen(cmd, start_new_session=True, ...)

Inside this subprocess, the EncodingGate._compute_prediction_error() method loads the full Qwen3-Embedding-0.6B model (1.5GB) just to compute 4 embeddings per fact:

# truememory/ingest/encoding_gate.py:417-420
if self._embed_model is None:
    from truememory.vector_search import get_model
    self._embed_model = get_model()  # Loads 1.5GB Qwen3 into this subprocess

With SPAWN_CAP=2, up to 2 concurrent ingest processes can each load a 1.5GB model = 3 GB of duplicated weights that already exist in the running MCP server processes.

What prediction error does

PE is Signal 3 in the encoding gate (weight 0.30). It embeds (fact, nearest_memory) as a pair and compares to a (memory, memory) self-pair to detect when a new fact contradicts or updates existing knowledge.

  • Gate AUC WITH PE: 0.816
  • Gate AUC WITHOUT PE (novelty + salience only): ~0.75
  • PE uses the embedding model for 4 model.encode() calls per fact
  • Novelty (Signal 1) uses gzip compression — no model needed
  • Salience (Signal 2) uses rule-based scoring — no model needed

Proposed solutions (pick one)

Option A: Skip PE in ingest (simplest, 5 lines)

Add env var TRUEMEMORY_GATE_NO_PE=1 that the stop hook sets when spawning ingest:

# encoding_gate.py:_compute_prediction_error
def _compute_prediction_error(self, fact: str) -> float:
    if os.environ.get("TRUEMEMORY_GATE_NO_PE", "") == "1":
        return 0.0  # Skip — no model needed
    # ... existing PE logic ...
# stop.py:_run_background_ingestion — add to env
env = os.environ.copy()
env["TRUEMEMORY_GATE_NO_PE"] = "1"
subprocess.Popen(cmd, env=env, ...)

Savings: 1.5 GB per ingest process (model never loads)
Cost: Gate AUC drops from 0.816 to ~0.75 (still effective for cold-path filtering)

Option B: Ingest calls running MCP server for embeddings

The MCP server is already alive with the model loaded. Instead of loading a second copy, the ingest process could call back to the MCP server for embeddings.

Challenge: MCP uses stdio transport (JSON-RPC over stdin/stdout). The ingest process can't easily call it. Would need:

  1. A sidecar UDS endpoint on the MCP server (part of Shared embedding server: single process holds models, clients connect via UDS #335 model server proposal)
  2. OR a lightweight HTTP endpoint on the MCP server

This is the "right" solution but depends on #335 (shared model server).

Option C: Use a lighter model for PE in ingest only

Load Model2Vec (30MB) instead of Qwen3 (1.5GB) for PE scoring in ingest. PE accuracy degrades slightly but the signal is still useful.

# encoding_gate.py — ingest-specific lighter model
if os.environ.get("TRUEMEMORY_INGEST_LIGHT_PE", ""):
    from model2vec import StaticModel
    self._embed_model = StaticModel.from_pretrained("minishlab/potion-base-8M")

Savings: 1.47 GB per ingest process (30MB vs 1500MB)
Cost: PE signal slightly less accurate with simpler embeddings

Recommendation

Option A for now (zero-model ingest). The 0.066 AUC drop is acceptable for a cold-path filter — facts that slip through will still be deduplicated. When #335 (model server) lands, switch to Option B.

Relates to

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions