fix(engine, tier_switch): silent-failure + concurrency hardening#347
Open
Huntehhh wants to merge 5 commits into
Open
fix(engine, tier_switch): silent-failure + concurrency hardening#347Huntehhh wants to merge 5 commits into
Huntehhh wants to merge 5 commits into
Conversation
Three issues surfaced in the 2026-05-16 robustness audit. All in engine.py. 1. `delete_all(user_id=...)` builds `DELETE FROM <t> WHERE <col> IN (?, ?, ...)` clauses from the full `msg_ids` and `episode_ids` lists. On SQLite builds with the default `SQLITE_MAX_VARIABLE_NUMBER=999` (still the value on Debian/Ubuntu system packages as of 2026), any user with >999 stored messages hits `OperationalError`, caught by the existing `except Exception: logger.warning(...)` — every related-table cleanup for that user silently failed, leaking `fact_timeline`, `landmark_events`, `causal_edges`, `vec_messages`, `vec_messages_sep`, and `episodes` rows. Added `_delete_in_chunks` helper that batches IN-clause deletes at 500 ids per round trip; applied at all five affected call sites. 2. `search_agentic` returned the modality-fusion result unchanged and skipped the LLM-rerank fallback block (lines ~1758) even when the cross-encoder was degraded. When degraded, `rerank_with_modality_fusion` returns the original ordering with no `fused_score` / `rerank_score` keys. Detect this by inspecting the first result's keys; on degraded-mode detection, log at debug and fall through to the LLM fallback block instead of returning the un-reranked results. This restores the higher-quality option the fallback exists for. 3. `_ensure_connection`'s `except Exception: logger.debug(...)` on sqlite-vec extension load failure swallowed the error without writing to the module-level `_vectors_load_error` tracker. The `truememory_stats.health.vectors` field that operators rely on read as "ok" while search was silently in FTS-only fallback mode. Now writes the error string to `_vectors_load_error` and logs at WARNING so the degradation surfaces in both the health payload and the log stream. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
reranker.get_reranker fast path read `_model` and `_model_name` as two separate global lookups. Under a concurrent `set_active_tier` (called during tier-switch via mcp_server) the GIL could release between the two reads, returning the previous model under the new name and silently serving the wrong reranker for one search before the next call. Bundle the reads into a single tuple unpack — `cached_model, cached_name = _model, _model_name` is a single STORE_NAME/STORE_FAST bytecode op so the GIL cannot release mid-read. Same change inside the lock for the double-checked pattern. vector_search.build_separation_vector_single's silent-loss path (`except Exception: logger.debug(...)`) was the worst kind of failure: the row is in `messages` and `vec_messages` but the separation row was silently dropped, so every future sender-aware search will miss this memory. Promoted to WARNING with explicit "sender-aware search will not surface this row" context so operators see the per-memory loss instead of needing to enable DEBUG-level logging. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Three issues in RebuildManager (truememory/tier_switch/manager.py): 1. start_rebuild did `if self._active_thread and is_alive()` (line ~100) then later assigned `self._active_thread = thread` (line ~136) with NO lock between the check and the write. Two concurrent truememory_configure calls could both observe is_alive()==False, both create RebuildWorker instances, and both spawn rebuild threads racing on the same DB — leaving the tier in an indeterminate state that requires manual recovery. Added `_state_lock` (threading.Lock) in `__init__`; wrapped every read / write of `_active_thread` and `_active_worker` across `start_rebuild`, `run_rebuild_sync`, `_rebuild_thread`, and `cancel`. 2. `start_rebuild` and `run_rebuild_sync` opened a SQLite connection then closed it on each early-return branch individually. Any exception from `tier_group()`, `preflight_ram_check()`, `resolve_rebuild_action()`, `get_messages_to_embed()`, `_create_status_row()`, or `backup_db()` skipped the per-branch close and leaked the handle. Wrapped both methods in a single `try / finally: conn.close()` covering all exit paths. 3. `_apply_config_switch` parses `config.json` via `config_path.read_text()` with NO `encoding=` arg. On Windows the default codec is cp1252 — a non-ASCII byte (a Cyrillic / accented character in an API key or a Cohere endpoint name) crashes the parse and silently drops the tier switch. Added `encoding="utf-8"` to match every other JSON-reading path in the codebase. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
10 tests in tests/test_engine_tier_switch_hardening.py pin the four bug classes addressed by this PR: - `_delete_in_chunks` helper: empty / single / chunked / floor-below-999 invariants. Catches a future bump of `_SQLITE_IN_CHUNK` above 999 before it lands in users' hands. - `reranker.get_reranker` fast-path: source-inspection check for the tuple-bundle pattern. Hard to test runtime atomicity directly because Python's GIL makes the bug rare; the structural check guards the fix against accidental revert. - `RebuildManager`: `_state_lock` attribute exists, `cancel()` reads under lock, two concurrent `start_rebuild` callers result in at most one spawned thread. - `search_agentic` degraded fallthrough: source-inspection check that the `_cross_encoder_ran` guard exists. End-to-end runtime test would require a full engine fixture; structural check covers the common refactor-revert case. - `vector_search` separation-vector failure: log level is WARNING (skipped on builds where the private helper signature differs). CHANGELOG.md gains an `Unreleased` section documenting every fix with the "what broke / why it matters / what changed" pattern that matches the existing 0.6.x entries. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
This was referenced May 17, 2026
Two F401 violations surfaced by upstream CI on PR buildingjoshbetter#347: - `unittest.mock.patch` left over from an earlier reload-based test approach that was refactored to monkeypatch in commit 471a98a. - `_delete_in_chunks` imported alongside `_SQLITE_IN_CHUNK` in the floor-test but never actually called there — only the constant matters for that assertion. Tests still 9 passed + 1 skipped locally; ruff is now clean. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six robustness fixes surfaced by a 2026-05-16 cross-platform audit
(Gemini 3.1 Pro + Grok 4.3 + five sub-agents). All in code paths that
fail silently or under load conditions a single user wouldn't see in
testing.
engine.delete_allsilently fails for users with >999 messages —IN (?, ?, ...)clauses unchunked, hittingSQLITE_MAX_VARIABLE_NUMBERdefault of 999; the existingexcept Exception: logger.warningcaught the resultingOperationalErrorand silently leaked rows fromfact_timeline,landmark_events,causal_edges,vec_messages*,episodes. Added_delete_in_chunkshelper batching at 500.search_agenticskips LLM-rerank fallback when cross-encoder isdegraded —
rerank_with_modality_fusionreturns the originalordering without
fused_scorewhen degraded; the previous codereturned that and bypassed the LLM-rerank block. Now detects via
result-key inspection and falls through.
sqlite-vecload failure invisible in health payload — DEBUG-only log meant
truememory_stats.health.vectorsread as "ok" whilesearch was silently in FTS-only mode. Now writes to module-level
_vectors_load_errorand logs at WARNING.vector_searchseparation-vector failure logged at DEBUG —silently dropped one memory from every future sender-aware search.
Promoted to WARNING with explicit context.
reranker.get_rerankerfast-path TOCTOU —_modeland_model_nameread as two separate globals; GIL release between themunder concurrent
set_active_tiercould return the old model underthe new name. Bundled into single tuple unpack.
RebuildManagercheck-then-write race + conn leaks — twoconcurrent
truememory_configurecalls could both spawn rebuildthreads racing on the same DB. Added
_state_lockaround everyread / write of
_active_threadand_active_worker. Wrappedstart_rebuildandrun_rebuild_syncintry / finallyto fixconn-handle leaks on exception paths. Added
encoding="utf-8"tothe
config.jsonread in_apply_config_switch(silent tier-switchloss on Windows cp1252 default with non-ASCII bytes).
This PR is intentionally scoped to engine, reranker, vector_search,
and tier_switch — the parallel Windows-portability work
(
fix/windows-asr-trampoline-bypass) coversinstall.ps1/install.sh/mcp_server.py/model_server.py/model_client.py/hooks/core.py/ingest/hooks/*/ingest/cli.pyand is disjoint from this change — both can land in either order.
Changes
truememory/engine.py_SQLITE_IN_CHUNK = 500+_delete_in_chunks(conn, table, col, ids)helper;delete_all(user_id=...)re-uses helper at 5 call sites;search_agenticchecks forfused_score/rerank_scoreto detect degraded reranker and fall through to LLM fallback; sqlite-vec load failure writes to_vectors_load_errorand logs at WARNINGtruememory/reranker.pyget_rerankerfast path bundles_model/_model_nameinto single tuple unpack (both fast-path AND double-checked lock branches)truememory/vector_search.pybuild_separation_vector_singlefailure logs at WARNING with sender-aware-search context (was DEBUG)truememory/tier_switch/manager.py_state_lock(threading.Lock) in__init__;start_rebuild/run_rebuild_sync/_rebuild_thread/cancelall read/write_active_threadand_active_workerunder the lock;start_rebuildandrun_rebuild_syncwrapped intry / finallyfor conn cleanup;_apply_config_switchreadsconfig.jsonwithencoding="utf-8"tests/test_engine_tier_switch_hardening.pyCHANGELOG.mdTest Plan
python -m pytest tests/test_engine_tier_switch_hardening.py -v→ 9 passed + 1 skipped (signature probe)delete_allinvocation with >1000 messages no longer drops related-table cleanupTRUEMEMORY_RERANKER_TIMEOUT_SEC=1, corrupt HF cache),search_agenticstill applies LLM reranktruememory_stats.health.vectors.statusshowsdegradednotoktruememory_configurecalls produce one rebuild log line, not twoDesign Notes
_SQLITE_IN_CHUNK? Stays well under the conservative 999 floor; tuning above 500 buys little because the dominant cost is the per-batch WHERE-clause evaluation, not the round trip. Floor is also asserted in the test suite.get_rerankerTOCTOU window and thesearch_agenticdegraded fallthrough are both narrow enough that direct runtime tests would require either threading harnesses (flaky) or full engine fixtures (heavy). The structural checks (inspect.getsource+ substring) catch the common accidental-revert case, which is the realistic threat model.engine.py(lines ~340 / ~520 / ~1730) and the surrounding contexts differ enough that extracting a shared helper would create more reading overhead than it saves.Co-Authored-By: claude-opus-4-7 wontreply@getfucked.ai
Merge ordering
Status:
MERGEABLE(clean againstorigin/main; mergeable butUNSTABLEper GitHub — CI signal pending or partially red, will investigate if blocking).Order-independent. Disjoint files from every other open PR:
engine.pytouched here is in different regions from fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability #344'sadd()lock-shrink (lines ~440 vs lines ~500/1730/340) — mechanical inter-PR merge.reranker.py,vector_search.py,tier_switch/manager.pynot touched by any other open PR.Can ship anywhere in the queue without dependency reshuffling.
Full sequence (10 PRs from a 3-agent coordination):