Skip to content

fix(core): self-heal corrupt FastEmbed model cache#900

Merged
phernandez merged 3 commits into
mainfrom
fix/fastembed-cache-selfheal
Jun 7, 2026
Merged

fix(core): self-heal corrupt FastEmbed model cache#900
phernandez merged 3 commits into
mainfrom
fix/fastembed-cache-selfheal

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

A new user's first semantic search can fail with [ONNXRuntimeError] : 3 : NO_SUCHFILE : Load model from .../fastembed_cache/models--qdrant--bge-small-en-v1.5-onnx-q/snapshots/<rev>/model_optimized.onnx failed. File doesn't exist.

Root cause: an interrupted/incomplete FastEmbed model download leaves the HuggingFace snapshot dir present but missing model_optimized.onnx. The raw ONNXRuntimeError then propagates, the user sees a generic "Search Failed" message with no hint to clear the cache, and the broken snapshot is self-perpetuating until the cache is cleared by hand. This is upstream FastEmbed partial-download behavior; basic-memory previously had zero handling.

What changed

  • src/basic_memory/repository/fastembed_provider.py
    • Extracted model construction into _create_model().
    • _load_model() now wraps construction: on a load failure that looks like a missing/corrupt artifact (NO_SUCHFILE / model_optimized.onnx / "file doesn't exist" / "no such file"), it deletes only this model's own models--<org>--<repo> cache subtree and retries the load exactly once to force a fresh download. A second failure fails fast with the original error. A non-cache error (e.g. a config error) is never treated as corruption.
    • Deletion is scoped by resolving the model's HuggingFace source from FastEmbed's own model description (TextEmbedding._list_supported_models()), so it never touches the whole cache or unrelated models. If no model subdir exists or no cache_dir is configured, it fails fast instead of looping.
  • src/basic_memory/mcp/tools/search.py
    • _format_search_error_response gains a dedicated branch for ONNX/model-load errors. It explains the model is missing/corrupt, names the resolved fastembed cache dir (default ~/.basic-memory/fastembed_cache) to delete so it re-downloads, and offers search_type="text" as an immediate workaround.

Testing

All commands run from the repo root via uv.

  • uv run ruff check <changed files> → All checks passed
  • uv run ruff format --check <changed files> → 4 files already formatted
  • uv run ty check src tests test-int → All checks passed
  • uv run pytest tests/repository/test_fastembed_provider.py tests/mcp/test_tool_search.py -q63 passed

New tests:

  • tests/repository/test_fastembed_provider.py
    • test_fastembed_provider_self_heals_corrupt_model_cache — first construction raises NO_SUCHFILE, provider purges exactly the model's subdir (leaves an unrelated models--other--keep-me dir untouched) and retries to success.
    • test_fastembed_provider_fails_fast_on_persistent_corrupt_cache — both constructions fail; asserts exactly one retry (two attempts) then fail fast, no infinite loop.
    • test_fastembed_provider_does_not_purge_on_unrelated_error — a ValueError config error propagates with no deletion and no retry.
    • test_fastembed_provider_fails_fast_when_no_cache_subdir_to_purge / ..._without_cache_dir — nothing to purge means no retry.
  • tests/mcp/test_tool_search.py::TestSearchErrorFormatting::test_format_search_error_corrupt_embedding_model — asserts the message names the resolved cache dir and suggests search_type="text".

Risk / validation

  • Deletion is tightly scoped to the model's own models--<org>--<repo> subtree, resolved from FastEmbed's model description, and gated behind a missing-artifact error match — never the whole cache, never unrelated data.
  • Retry is capped at exactly once; a persistent failure fails fast with the original (now self-explanatory) error.
  • These tests use a stub TextEmbedding (the existing sys.modules['fastembed'] monkeypatch pattern) because a real interrupted download is non-deterministic and offline-unfriendly; this is the justified-mock case for the clean-and-retry path.

Closes #895

🤖 Generated with Claude Code

An interrupted FastEmbed model download leaves the HuggingFace snapshot dir present but missing model_optimized.onnx. The ONNX runtime then raises NO_SUCHFILE on every load, and the failure is self-perpetuating until the cache is cleared by hand. Search surfaced only the generic 'Search Failed' message with no hint.

FastEmbedEmbeddingProvider now detects a missing/corrupt-artifact load failure, deletes only this model's own models--<org>--<repo> cache subtree (resolved from FastEmbed's model description), and retries the load exactly once to force a fresh download. A second failure fails fast with the original error. The search error formatter gains an ONNX/model-load branch that names the resolved fastembed cache dir to delete and offers search_type="text" as an immediate workaround.

Closes #895

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

Claude finished @phernandez's task in 1m 38s —— View job


Code Review: fix(core): self-heal corrupt FastEmbed model cache

  • Read CLAUDE.md and gather context
  • Review fastembed_provider.py changes
  • Review search.py changes
  • Review test files
  • Run checklist against PR
  • Post final review

Summary

This is a well-motivated fix with clean architecture and solid test coverage. The self-heal logic correctly scopes deletion to only the affected model subdir, the retry is capped at once, and the error classifier guards against treating config/network errors as cache corruption. A few items worth addressing before merge:


Issues

1. shutil.rmtree(ignore_errors=True) sets removed = True even when deletion silently fails

fastembed_provider.py:139

if subdir.exists():
    shutil.rmtree(subdir, ignore_errors=True)
    removed = True  # ← set unconditionally, even if rmtree silently failed

ignore_errors=True suppresses all OS errors (permission denied, locked files on Windows, etc.). If the deletion silently fails, the subdir still exists and the retry will hit the same broken ONNX artifact — but the code returns True, triggering the retry anyway, so the second attempt fails with the same error. The original exception is then silently discarded.

Safer to verify deletion actually happened:

shutil.rmtree(subdir, ignore_errors=True)
if not subdir.exists():
    removed = True

Fix this →


2. "load model" error marker in search.py is too broad

search.py:95

or "load model" in error_lower

This phrase will match generic errors from any ML framework, database layer, or even log messages. A user error like "could not load model configuration" from an unrelated component would incorrectly route to the ONNX cache-corruption response. The other three markers ("onnxruntime", "no_suchfile", "model_optimized.onnx") are already specific enough to cover the ONNX scenario without this one.

Recommend removing "load model" from the condition or tightening it to "load model from" (which is the exact phrasing from the ONNX error).

Fix this →


Minor Notes

3. Private FastEmbed API: _list_supported_models()

fastembed_provider.py:110

for description in TextEmbedding._list_supported_models():

This uses a single-underscore (module-private) method on FastEmbed's TextEmbedding. If FastEmbed ever renames or removes it, the entire self-heal path silently returns no subdirs and the recovery fails with a confusing error. The existing # pragma: no cover on the exception handler at line 115 acknowledges this is defensive code. Consider adding a comment that this is an intentional use of an undocumented API with a known failure mode (already covered by the except Exception fallback).


4. _resolve_cache_dir is a private symbol imported across layers

search.py:99

from basic_memory.repository.embedding_provider_factory import _resolve_cache_dir

The leading underscore signals this is an internal detail of the factory module. The comment explains the deferral is intentional, but importing a private symbol from the repository layer into the MCP tools layer creates an unexpected coupling. If this function gets moved or renamed during future refactors, the breakage will be silent. A public re-export from a shared config helper would be more robust long-term, though this is low priority since _resolve_cache_dir itself is trivially simple and unlikely to move.


Team Checklist

Area Result
Python 3.12+ type annotations / async patterns
Line length ≤ 100 chars
No commented-out code
Unit tests for all new functions
Edge cases covered (no cache dir, no subdir found, persistent failure, unrelated error)
100% coverage maintained (new code fully exercised)
No hardcoded secrets / no sensitive data in logs
No emojis/unicode in console/log output
Windows-compatible path handling (uses pathlib.Path)
shutil.rmtree correctness with ignore_errors=True ⚠️ see issue #1
Error marker specificity in search.py ⚠️ see issue #2
Dependency on private FastEmbed API _list_supported_models ℹ️ noted, defended by fallback
Cross-layer private import _resolve_cache_dir ℹ️ low priority, noted

The fix as a whole is sound — the scoped deletion, single-retry cap, and error classifier are all correct. Issues #1 and #2 are the ones worth fixing before merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 648172439e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

from fastembed import TextEmbedding

for description in TextEmbedding._list_supported_models():
if description.model == resolved_model_name:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match FastEmbed model names case-insensitively

FastEmbed accepts supported model names case-insensitively, so a valid config such as semantic_embedding_model="baai/bge-small-en-v1.5" still loads, but this exact-case comparison fails to resolve the HF source and _purge_corrupt_model_cache() returns false. In that scenario a corrupt cache for an otherwise valid model is never deleted or retried, so the new self-heal path remains disabled for lower-/mixed-case model names; compare normalized names here as FastEmbed does.

Useful? React with 👍 / 👎.

… path

The self-heal introduced in this PR misclassified cold/transient model load
failures as cache corruption, purged the in-progress download, and left the
model unloadable for the whole session — breaking ~20 unrelated search
integration tests in CI (cold model cache).

Root cause: purge was driven purely by exception-text matching plus a bare
subdir.exists() check, so a normal not-yet-downloaded model (or a transient
'from any source' download error) could trigger a destructive purge.

Fixes:
- Gate the purge on positive filesystem confirmation: the HF snapshot dir
  exists but the model artifact file (model_optimized.onnx) is missing from
  every snapshot. A cold load (no snapshot dir) is never treated as corrupt
  and never purges; the original error re-raises unchanged.
- After a purge+retry that still fails, re-raise the ORIGINAL exception.
- (Codex P2) Resolve the HF cache subdir case-insensitively, mirroring
  FastEmbed, so model='baai/bge-small-en-v1.5' still resolves correctly.
- (claude-review #1) _purge_model_subdirs sets removed=True only when the
  subdir actually disappears after rmtree (Windows locked-file no-op safety).
- (claude-review #2) Tighten the search error formatter ONNX marker from the
  broad 'load model' to the exact 'load model from' phrasing.
- (minor) Document that TextEmbedding._list_supported_models() is an
  intentional undocumented-API use with a defensive fallback.

Verified locally with the real model: a cold load downloads cleanly with no
purge; a genuinely corrupt cache (snapshot present, artifact deleted) is
detected, purged, and re-downloaded. Added regression tests for cold load,
artifact-present non-purge, case-insensitive resolution, and locked-file
rmtree no-op.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d90cccf92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +168 to +171
# The artifact lives at snapshots/<rev>/<model_file>; an interrupted download
# leaves the snapshot tree but no artifact. rglob covers any revision dir.
artifact_present = any(snapshot_dir.rglob(model_file))
if not artifact_present:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check the failed snapshot, not any cached artifact

When HuggingFace keeps multiple revisions for the same FastEmbed repo, a corrupt current snapshot can coexist with an older complete snapshot. Because this searches the entire models--... tree, any older model_optimized.onnx makes artifact_present true, so the self-heal path skips purging even though ONNX just failed to load snapshots/<current-rev>/model_optimized.onnx. In that mixed-cache state the broken snapshot remains self-perpetuating until the user manually deletes the cache; use the failed path from the exception or inspect each snapshot individually instead of treating any matching file under the model root as proof the cache is healthy.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

Claude finished @phernandez's task in 1m 55s —— View job


Code Review: fix(core): self-heal corrupt FastEmbed model cache

  • Gather context and read CLAUDE.md
  • Read changed files (fastembed_provider.py, search.py, tests)
  • Review against team checklist
  • Post final review

Summary

The second commit (1d90ccc) directly addresses all critical issues raised in the prior Claude and Codex reviews. This is ready to merge.


Issues from previous review — all resolved

Issue #1: shutil.rmtree(ignore_errors=True) with unconditional removed = True ✅ Fixed

fastembed_provider.py:189–193 now correctly gates the success flag on a post-deletion existence check:

shutil.rmtree(subdir, ignore_errors=True)
# Set removed only when the subdir is truly gone — a silent rmtree no-op
# (e.g. a locked file on Windows) must not be reported as a successful purge.
if not subdir.exists():
    removed_any = True

The new test test_fastembed_provider_fails_fast_when_purge_silently_noops validates the Windows locked-file scenario with a monkeypatched no-op rmtree.

Issue #2: "load model" error marker too broad ✅ Fixed

Changed to "load model from" — the exact ONNX error phrasing — with an explanatory comment at search.py:91–93. The new test test_format_search_error_load_model_phrase_does_not_overmatch verifies that "Failed to load model configuration for this project" no longer routes to the embedding-corrupt branch.

Codex P2: Case-insensitive model name matching ✅ Fixed

_model_cache_candidates() at line 123 now lowercases both sides before comparing:

resolved_model_name = self._resolved_model_name().lower()
...
if description.model.lower() != resolved_model_name:

New test test_fastembed_provider_self_heals_with_case_insensitive_model_name covers the model="baai/bge-small-en-v1.5" config case.

Codex P2: Multiple snapshot revisions / check specific snapshot ✅ Intentionally addressed with conservative approach

The code uses rglob across the entire model subtree (any(snapshot_dir.rglob(model_file))). If any revision contains the artifact the cache is considered healthy and no purge is attempted. The Codex concern (a corrupt new snapshot coexisting with an older valid one) is real, but the PR consciously chooses the conservative path: false negatives are preferred over false positives — the title of the second commit says exactly this. The self-heal simply won't fire in that edge case; the user falls back to the manual fix message in search.py. The new test test_fastembed_provider_does_not_purge_when_artifact_present validates this behavior explicitly. This is a defensible trade-off.


Team Checklist

Area Result
Python 3.12+ type annotations and async patterns
100-character line length
No commented-out code
Literate Programming: section headers, decision-point comments, constraint comments ✅ (_corrupt_model_subdirs has exemplary trigger/why/outcome comments)
Unit tests for all new functions
Edge cases covered (cold load, no cache dir, locked file on Windows, unrelated error, case-insensitive name, artifact present)
100% coverage maintained (new code fully exercised)
No hardcoded secrets / no sensitive data in logs
No emojis/unicode in console/log output
Windows-compatible path handling (uses pathlib.Path)
shutil.rmtree correctness ✅ fixed
Error marker specificity in search.py ✅ fixed
Case-insensitive model name resolution ✅ fixed
Cold-load not misread as corruption _corrupt_model_subdirs skips if subdir absent
Private FastEmbed API _list_supported_models() ℹ️ intentional, documented, defended by except Exception fallback
Cross-layer private import _resolve_cache_dir ℹ️ low priority, unchanged from prior review, comment explains rationale

Minor Note

search.py:102 still imports _resolve_cache_dir from the repository layer (private symbol, noted in the prior review as low priority). No action needed now, but worth making public if embedding_provider_factory is ever refactored.


This PR is in excellent shape. All critical issues from both the Claude and Codex reviews are addressed, the conservative self-heal design is sound, and the test coverage is thorough. ✔️ Approved.

_corrupt_model_subdirs scanned the whole models--<repo> tree with rglob, so an older complete snapshot masked a corrupt current revision and self-heal skipped the purge — leaving the broken snapshot self-perpetuating. Inspect each snapshot revision individually and flag the model when any revision is missing the artifact (Codex review on #900).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

Claude finished @phernandez's task in 1m 49s —— View job


Code Review: fix(core): self-heal corrupt FastEmbed model cache

  • Gather context and read CLAUDE.md
  • Read changed files (all three commits)
  • Review against team checklist
  • Post final review

Summary

The third commit (8bde497) directly addresses the Codex P2 concern about per-revision snapshot detection. All prior issues from the previous two Claude and Codex review passes are resolved. This PR is ready to merge.


Third Commit Analysis: 8bde497 fix(core): detect corrupt FastEmbed snapshot per-revision

The Codex P2 concern was: a corrupt current snapshot can coexist with an older complete one, and the prior rglob across the whole model tree would find the old artifact and wrongly skip the purge.

The fix is correct. _corrupt_model_subdirs now inspects each revision directory individually:

if any(not any(rev.rglob(model_file)) for rev in revision_dirs):
    corrupt.append(model_dir)

This flags the model when any revision is missing the artifact. The trade-off (purging the whole models--<repo> tree including good revisions when the current one is corrupt) is correct and intentional — the docstring explains it:

"flag the model dir so the whole tree re-downloads cleanly"

The new test test_fastembed_provider_self_heals_when_current_revision_corrupt validates this exactly — a good rev_old with the artifact, a corrupt rev_current without it, and the provider correctly detects and purges the whole tree.


Minor Observation (no action needed)

_model_cache_candidates (lines 102–150) has a small internal naming inconsistency: the local variable is called snapshot_dir but the returned Path is actually the model root (models--<org>--<repo>), not the snapshot subdirectory. The callers use the correct name model_dir when unpacking. No behavioral issue, and the docstring heading (Resolve (snapshot_dir, model_file) pairs) is slightly stale from this rename. Low priority — worth a one-word cleanup if the file is touched again.

# _model_cache_candidates line 138 — variable name is misleading
snapshot_dir = cache_root / f"models--{hf_source.replace('/', '--')}"
#  ^^^^^^^^^^^^ this is the model root, not a snapshot dir

Team Checklist

Area Result
Python 3.12+ type annotations / async patterns
Line length ≤ 100 chars
No commented-out code
Literate Programming: section headers, decision-point comments ✅ (_corrupt_model_subdirs trigger/why/outcome comments are exemplary)
Unit tests for all new functions
Edge cases covered (cold load, no cache dir, locked file on Windows, unrelated error, case-insensitive name, artifact present, mixed old-good/new-corrupt revisions)
100% coverage maintained
No hardcoded secrets / no sensitive data in logs
No emojis/unicode in console/log output
Windows-compatible path handling (pathlib.Path throughout)
shutil.rmtree gated on post-deletion existence check
Error marker specificity ("load model from", not "load model")
Case-insensitive model name resolution
Cold-load not misread as corruption
Per-revision snapshot corruption detection ✅ fixed in commit 3
Private FastEmbed API _list_supported_models() ℹ️ intentional, documented, defended by except Exception fallback
Cross-layer private import _resolve_cache_dir in search.py ℹ️ low priority, unchanged — comment explains rationale
snapshot_dir variable naming in _model_cache_candidates ℹ️ cosmetic, callers use correct name model_dir

All critical issues from both review passes are resolved. ✔️ Approved.

@phernandez phernandez merged commit 5b034f0 into main Jun 7, 2026
39 of 41 checks passed
@phernandez phernandez deleted the fix/fastembed-cache-selfheal branch June 7, 2026 22:37
phernandez added a commit that referenced this pull request Jun 7, 2026
After rebasing onto #900 (merged), _create_model passes enable_cpu_mem_arena=False; the self-heal test stub from #900 must accept it. Resolves the rebase integration gap between #900 and #903.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] model_optimized.onnx not found

1 participant