fix(core): self-heal corrupt FastEmbed model cache by phernandez · Pull Request #900 · basicmachines-co/basic-memory

phernandez · 2026-06-07T19:04:13Z

Summary

A new user's first semantic search can fail with [ONNXRuntimeError] : 3 : NO_SUCHFILE : Load model from .../fastembed_cache/models--qdrant--bge-small-en-v1.5-onnx-q/snapshots/<rev>/model_optimized.onnx failed. File doesn't exist.

Root cause: an interrupted/incomplete FastEmbed model download leaves the HuggingFace snapshot dir present but missing model_optimized.onnx. The raw ONNXRuntimeError then propagates, the user sees a generic "Search Failed" message with no hint to clear the cache, and the broken snapshot is self-perpetuating until the cache is cleared by hand. This is upstream FastEmbed partial-download behavior; basic-memory previously had zero handling.

What changed

src/basic_memory/repository/fastembed_provider.py
- Extracted model construction into _create_model().
- _load_model() now wraps construction: on a load failure that looks like a missing/corrupt artifact (NO_SUCHFILE / model_optimized.onnx / "file doesn't exist" / "no such file"), it deletes only this model's own models--<org>--<repo> cache subtree and retries the load exactly once to force a fresh download. A second failure fails fast with the original error. A non-cache error (e.g. a config error) is never treated as corruption.
- Deletion is scoped by resolving the model's HuggingFace source from FastEmbed's own model description (TextEmbedding._list_supported_models()), so it never touches the whole cache or unrelated models. If no model subdir exists or no cache_dir is configured, it fails fast instead of looping.
src/basic_memory/mcp/tools/search.py
- _format_search_error_response gains a dedicated branch for ONNX/model-load errors. It explains the model is missing/corrupt, names the resolved fastembed cache dir (default ~/.basic-memory/fastembed_cache) to delete so it re-downloads, and offers search_type="text" as an immediate workaround.

Testing

All commands run from the repo root via uv.

uv run ruff check <changed files> → All checks passed
uv run ruff format --check <changed files> → 4 files already formatted
uv run ty check src tests test-int → All checks passed
uv run pytest tests/repository/test_fastembed_provider.py tests/mcp/test_tool_search.py -q → 63 passed

New tests:

tests/repository/test_fastembed_provider.py
- test_fastembed_provider_self_heals_corrupt_model_cache — first construction raises NO_SUCHFILE, provider purges exactly the model's subdir (leaves an unrelated models--other--keep-me dir untouched) and retries to success.
- test_fastembed_provider_fails_fast_on_persistent_corrupt_cache — both constructions fail; asserts exactly one retry (two attempts) then fail fast, no infinite loop.
- test_fastembed_provider_does_not_purge_on_unrelated_error — a ValueError config error propagates with no deletion and no retry.
- test_fastembed_provider_fails_fast_when_no_cache_subdir_to_purge / ..._without_cache_dir — nothing to purge means no retry.
tests/mcp/test_tool_search.py::TestSearchErrorFormatting::test_format_search_error_corrupt_embedding_model — asserts the message names the resolved cache dir and suggests search_type="text".

Risk / validation

Deletion is tightly scoped to the model's own models--<org>--<repo> subtree, resolved from FastEmbed's model description, and gated behind a missing-artifact error match — never the whole cache, never unrelated data.
Retry is capped at exactly once; a persistent failure fails fast with the original (now self-explanatory) error.
These tests use a stub TextEmbedding (the existing sys.modules['fastembed'] monkeypatch pattern) because a real interrupted download is non-deterministic and offline-unfriendly; this is the justified-mock case for the clean-and-retry path.

Closes #895

🤖 Generated with Claude Code

An interrupted FastEmbed model download leaves the HuggingFace snapshot dir present but missing model_optimized.onnx. The ONNX runtime then raises NO_SUCHFILE on every load, and the failure is self-perpetuating until the cache is cleared by hand. Search surfaced only the generic 'Search Failed' message with no hint. FastEmbedEmbeddingProvider now detects a missing/corrupt-artifact load failure, deletes only this model's own models--<org>--<repo> cache subtree (resolved from FastEmbed's model description), and retries the load exactly once to force a fresh download. A second failure fails fast with the original error. The search error formatter gains an ONNX/model-load branch that names the resolved fastembed cache dir to delete and offers search_type="text" as an immediate workaround. Closes #895 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-06-07T19:04:31Z

Claude finished @phernandez's task in 1m 38s —— View job

Code Review: `fix(core): self-heal corrupt FastEmbed model cache`

Summary

This is a well-motivated fix with clean architecture and solid test coverage. The self-heal logic correctly scopes deletion to only the affected model subdir, the retry is capped at once, and the error classifier guards against treating config/network errors as cache corruption. A few items worth addressing before merge:

Issues

1. `shutil.rmtree(ignore_errors=True)` sets `removed = True` even when deletion silently fails

fastembed_provider.py:139

if subdir.exists():
    shutil.rmtree(subdir, ignore_errors=True)
    removed = True  # ← set unconditionally, even if rmtree silently failed

ignore_errors=True suppresses all OS errors (permission denied, locked files on Windows, etc.). If the deletion silently fails, the subdir still exists and the retry will hit the same broken ONNX artifact — but the code returns True, triggering the retry anyway, so the second attempt fails with the same error. The original exception is then silently discarded.

Safer to verify deletion actually happened:

shutil.rmtree(subdir, ignore_errors=True)
if not subdir.exists():
    removed = True

Fix this →

2. `"load model"` error marker in `search.py` is too broad

search.py:95

or "load model" in error_lower

This phrase will match generic errors from any ML framework, database layer, or even log messages. A user error like "could not load model configuration" from an unrelated component would incorrectly route to the ONNX cache-corruption response. The other three markers ("onnxruntime", "no_suchfile", "model_optimized.onnx") are already specific enough to cover the ONNX scenario without this one.

Recommend removing "load model" from the condition or tightening it to "load model from" (which is the exact phrasing from the ONNX error).

Fix this →

Minor Notes

3. Private FastEmbed API: `_list_supported_models()`

fastembed_provider.py:110

for description in TextEmbedding._list_supported_models():

This uses a single-underscore (module-private) method on FastEmbed's TextEmbedding. If FastEmbed ever renames or removes it, the entire self-heal path silently returns no subdirs and the recovery fails with a confusing error. The existing # pragma: no cover on the exception handler at line 115 acknowledges this is defensive code. Consider adding a comment that this is an intentional use of an undocumented API with a known failure mode (already covered by the except Exception fallback).

4. `_resolve_cache_dir` is a private symbol imported across layers

search.py:99

from basic_memory.repository.embedding_provider_factory import _resolve_cache_dir

The leading underscore signals this is an internal detail of the factory module. The comment explains the deferral is intentional, but importing a private symbol from the repository layer into the MCP tools layer creates an unexpected coupling. If this function gets moved or renamed during future refactors, the breakage will be silent. A public re-export from a shared config helper would be more robust long-term, though this is low priority since _resolve_cache_dir itself is trivially simple and unlikely to move.

Team Checklist

Area	Result
Python 3.12+ type annotations / async patterns	✅
Line length ≤ 100 chars	✅
No commented-out code	✅
Unit tests for all new functions	✅
Edge cases covered (no cache dir, no subdir found, persistent failure, unrelated error)	✅
100% coverage maintained (new code fully exercised)	✅
No hardcoded secrets / no sensitive data in logs	✅
No emojis/unicode in console/log output	✅
Windows-compatible path handling (uses `pathlib.Path`)	✅
`shutil.rmtree` correctness with `ignore_errors=True`	⚠️ see issue #1
Error marker specificity in `search.py`	⚠️ see issue #2
Dependency on private FastEmbed API `_list_supported_models`	ℹ️ noted, defended by fallback
Cross-layer private import `_resolve_cache_dir`	ℹ️ low priority, noted

The fix as a whole is sound — the scoped deletion, single-retry cap, and error classifier are all correct. Issues #1 and #2 are the ones worth fixing before merge.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 648172439e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T19:08:55Z

+            from fastembed import TextEmbedding
+
+            for description in TextEmbedding._list_supported_models():
+                if description.model == resolved_model_name:


Match FastEmbed model names case-insensitively

FastEmbed accepts supported model names case-insensitively, so a valid config such as semantic_embedding_model="baai/bge-small-en-v1.5" still loads, but this exact-case comparison fails to resolve the HF source and _purge_corrupt_model_cache() returns false. In that scenario a corrupt cache for an otherwise valid model is never deleted or retried, so the new self-heal path remains disabled for lower-/mixed-case model names; compare normalized names here as FastEmbed does.

Useful? React with 👍 / 👎.

… path The self-heal introduced in this PR misclassified cold/transient model load failures as cache corruption, purged the in-progress download, and left the model unloadable for the whole session — breaking ~20 unrelated search integration tests in CI (cold model cache). Root cause: purge was driven purely by exception-text matching plus a bare subdir.exists() check, so a normal not-yet-downloaded model (or a transient 'from any source' download error) could trigger a destructive purge. Fixes: - Gate the purge on positive filesystem confirmation: the HF snapshot dir exists but the model artifact file (model_optimized.onnx) is missing from every snapshot. A cold load (no snapshot dir) is never treated as corrupt and never purges; the original error re-raises unchanged. - After a purge+retry that still fails, re-raise the ORIGINAL exception. - (Codex P2) Resolve the HF cache subdir case-insensitively, mirroring FastEmbed, so model='baai/bge-small-en-v1.5' still resolves correctly. - (claude-review #1) _purge_model_subdirs sets removed=True only when the subdir actually disappears after rmtree (Windows locked-file no-op safety). - (claude-review #2) Tighten the search error formatter ONNX marker from the broad 'load model' to the exact 'load model from' phrasing. - (minor) Document that TextEmbedding._list_supported_models() is an intentional undocumented-API use with a defensive fallback. Verified locally with the real model: a cold load downloads cleanly with no purge; a genuinely corrupt cache (snapshot present, artifact deleted) is detected, purged, and re-downloaded. Added regression tests for cold load, artifact-present non-purge, case-insensitive resolution, and locked-file rmtree no-op. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d90cccf92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T20:25:42Z

+            # The artifact lives at snapshots/<rev>/<model_file>; an interrupted download
+            # leaves the snapshot tree but no artifact. rglob covers any revision dir.
+            artifact_present = any(snapshot_dir.rglob(model_file))
+            if not artifact_present:


Check the failed snapshot, not any cached artifact

When HuggingFace keeps multiple revisions for the same FastEmbed repo, a corrupt current snapshot can coexist with an older complete snapshot. Because this searches the entire models--... tree, any older model_optimized.onnx makes artifact_present true, so the self-heal path skips purging even though ONNX just failed to load snapshots/<current-rev>/model_optimized.onnx. In that mixed-cache state the broken snapshot remains self-perpetuating until the user manually deletes the cache; use the failed path from the exception or inspect each snapshot individually instead of treating any matching file under the model root as proof the cache is healthy.

Useful? React with 👍 / 👎.

github-actions · 2026-06-07T20:26:08Z

Claude finished @phernandez's task in 1m 55s —— View job

Code Review: `fix(core): self-heal corrupt FastEmbed model cache`

Gather context and read CLAUDE.md
Read changed files (fastembed_provider.py, search.py, tests)
Review against team checklist
Post final review

Summary

The second commit (1d90ccc) directly addresses all critical issues raised in the prior Claude and Codex reviews. This is ready to merge.

Issues from previous review — all resolved

Issue #1: shutil.rmtree(ignore_errors=True) with unconditional removed = True ✅ Fixed

fastembed_provider.py:189–193 now correctly gates the success flag on a post-deletion existence check:

shutil.rmtree(subdir, ignore_errors=True)
# Set removed only when the subdir is truly gone — a silent rmtree no-op
# (e.g. a locked file on Windows) must not be reported as a successful purge.
if not subdir.exists():
    removed_any = True

The new test test_fastembed_provider_fails_fast_when_purge_silently_noops validates the Windows locked-file scenario with a monkeypatched no-op rmtree.

Issue #2: "load model" error marker too broad ✅ Fixed

Changed to "load model from" — the exact ONNX error phrasing — with an explanatory comment at search.py:91–93. The new test test_format_search_error_load_model_phrase_does_not_overmatch verifies that "Failed to load model configuration for this project" no longer routes to the embedding-corrupt branch.

Codex P2: Case-insensitive model name matching ✅ Fixed

_model_cache_candidates() at line 123 now lowercases both sides before comparing:

resolved_model_name = self._resolved_model_name().lower()
...
if description.model.lower() != resolved_model_name:

New test test_fastembed_provider_self_heals_with_case_insensitive_model_name covers the model="baai/bge-small-en-v1.5" config case.

Codex P2: Multiple snapshot revisions / check specific snapshot ✅ Intentionally addressed with conservative approach

The code uses rglob across the entire model subtree (any(snapshot_dir.rglob(model_file))). If any revision contains the artifact the cache is considered healthy and no purge is attempted. The Codex concern (a corrupt new snapshot coexisting with an older valid one) is real, but the PR consciously chooses the conservative path: false negatives are preferred over false positives — the title of the second commit says exactly this. The self-heal simply won't fire in that edge case; the user falls back to the manual fix message in search.py. The new test test_fastembed_provider_does_not_purge_when_artifact_present validates this behavior explicitly. This is a defensible trade-off.

Team Checklist

Area	Result
Python 3.12+ type annotations and async patterns	✅
100-character line length	✅
No commented-out code	✅
Literate Programming: section headers, decision-point comments, constraint comments	✅ (`_corrupt_model_subdirs` has exemplary trigger/why/outcome comments)
Unit tests for all new functions	✅
Edge cases covered (cold load, no cache dir, locked file on Windows, unrelated error, case-insensitive name, artifact present)	✅
100% coverage maintained (new code fully exercised)	✅
No hardcoded secrets / no sensitive data in logs	✅
No emojis/unicode in console/log output	✅
Windows-compatible path handling (uses `pathlib.Path`)	✅
`shutil.rmtree` correctness	✅ fixed
Error marker specificity in `search.py`	✅ fixed
Case-insensitive model name resolution	✅ fixed
Cold-load not misread as corruption	✅ `_corrupt_model_subdirs` skips if subdir absent
Private FastEmbed API `_list_supported_models()`	ℹ️ intentional, documented, defended by `except Exception` fallback
Cross-layer private import `_resolve_cache_dir`	ℹ️ low priority, unchanged from prior review, comment explains rationale

Minor Note

search.py:102 still imports _resolve_cache_dir from the repository layer (private symbol, noted in the prior review as low priority). No action needed now, but worth making public if embedding_provider_factory is ever refactored.

This PR is in excellent shape. All critical issues from both the Claude and Codex reviews are addressed, the conservative self-heal design is sound, and the test coverage is thorough. ✔️ Approved.

_corrupt_model_subdirs scanned the whole models--<repo> tree with rglob, so an older complete snapshot masked a corrupt current revision and self-heal skipped the purge — leaving the broken snapshot self-perpetuating. Inspect each snapshot revision individually and flag the model when any revision is missing the artifact (Codex review on #900). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-06-07T20:59:45Z

Claude finished @phernandez's task in 1m 49s —— View job

Code Review: `fix(core): self-heal corrupt FastEmbed model cache`

Gather context and read CLAUDE.md
Read changed files (all three commits)
Review against team checklist
Post final review

Summary

The third commit (8bde497) directly addresses the Codex P2 concern about per-revision snapshot detection. All prior issues from the previous two Claude and Codex review passes are resolved. This PR is ready to merge.

Third Commit Analysis: `8bde497 fix(core): detect corrupt FastEmbed snapshot per-revision`

The Codex P2 concern was: a corrupt current snapshot can coexist with an older complete one, and the prior rglob across the whole model tree would find the old artifact and wrongly skip the purge.

The fix is correct. _corrupt_model_subdirs now inspects each revision directory individually:

if any(not any(rev.rglob(model_file)) for rev in revision_dirs):
    corrupt.append(model_dir)

This flags the model when any revision is missing the artifact. The trade-off (purging the whole models--<repo> tree including good revisions when the current one is corrupt) is correct and intentional — the docstring explains it:

"flag the model dir so the whole tree re-downloads cleanly"

The new test test_fastembed_provider_self_heals_when_current_revision_corrupt validates this exactly — a good rev_old with the artifact, a corrupt rev_current without it, and the provider correctly detects and purges the whole tree.

Minor Observation (no action needed)

_model_cache_candidates (lines 102–150) has a small internal naming inconsistency: the local variable is called snapshot_dir but the returned Path is actually the model root (models--<org>--<repo>), not the snapshot subdirectory. The callers use the correct name model_dir when unpacking. No behavioral issue, and the docstring heading (Resolve (snapshot_dir, model_file) pairs) is slightly stale from this rename. Low priority — worth a one-word cleanup if the file is touched again.

# _model_cache_candidates line 138 — variable name is misleading
snapshot_dir = cache_root / f"models--{hf_source.replace('/', '--')}"
#  ^^^^^^^^^^^^ this is the model root, not a snapshot dir

Team Checklist

Area	Result
Python 3.12+ type annotations / async patterns	✅
Line length ≤ 100 chars	✅
No commented-out code	✅
Literate Programming: section headers, decision-point comments	✅ (`_corrupt_model_subdirs` trigger/why/outcome comments are exemplary)
Unit tests for all new functions	✅
Edge cases covered (cold load, no cache dir, locked file on Windows, unrelated error, case-insensitive name, artifact present, mixed old-good/new-corrupt revisions)	✅
100% coverage maintained	✅
No hardcoded secrets / no sensitive data in logs	✅
No emojis/unicode in console/log output	✅
Windows-compatible path handling (`pathlib.Path` throughout)	✅
`shutil.rmtree` gated on post-deletion existence check	✅
Error marker specificity (`"load model from"`, not `"load model"`)	✅
Case-insensitive model name resolution	✅
Cold-load not misread as corruption	✅
Per-revision snapshot corruption detection	✅ fixed in commit 3
Private FastEmbed API `_list_supported_models()`	ℹ️ intentional, documented, defended by `except Exception` fallback
Cross-layer private import `_resolve_cache_dir` in `search.py`	ℹ️ low priority, unchanged — comment explains rationale
`snapshot_dir` variable naming in `_model_cache_candidates`	ℹ️ cosmetic, callers use correct name `model_dir`

All critical issues from both review passes are resolved. ✔️ Approved.

After rebasing onto #900 (merged), _create_model passes enable_cpu_mem_arena=False; the self-heal test stub from #900 must accept it. Resolves the rebase integration gap between #900 and #903. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

phernandez merged commit 5b034f0 into main Jun 7, 2026
39 of 41 checks passed

phernandez deleted the fix/fastembed-cache-selfheal branch June 7, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): self-heal corrupt FastEmbed model cache#900

fix(core): self-heal corrupt FastEmbed model cache#900
phernandez merged 3 commits into
mainfrom
fix/fastembed-cache-selfheal

phernandez commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Jun 7, 2026

Summary

What changed

Testing

Risk / validation

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: fix(core): self-heal corrupt FastEmbed model cache

Summary

Issues

1. shutil.rmtree(ignore_errors=True) sets removed = True even when deletion silently fails

2. "load model" error marker in search.py is too broad

Minor Notes

3. Private FastEmbed API: _list_supported_models()

4. _resolve_cache_dir is a private symbol imported across layers

Team Checklist

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: fix(core): self-heal corrupt FastEmbed model cache

Summary

Issues from previous review — all resolved

Team Checklist

Minor Note

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: fix(core): self-heal corrupt FastEmbed model cache

Summary

Third Commit Analysis: 8bde497 fix(core): detect corrupt FastEmbed snapshot per-revision

Minor Observation (no action needed)

Team Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Code Review: `fix(core): self-heal corrupt FastEmbed model cache`

1. `shutil.rmtree(ignore_errors=True)` sets `removed = True` even when deletion silently fails

2. `"load model"` error marker in `search.py` is too broad

3. Private FastEmbed API: `_list_supported_models()`

4. `_resolve_cache_dir` is a private symbol imported across layers

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Code Review: `fix(core): self-heal corrupt FastEmbed model cache`

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Code Review: `fix(core): self-heal corrupt FastEmbed model cache`

Third Commit Analysis: `8bde497 fix(core): detect corrupt FastEmbed snapshot per-revision`