fix(core): self-heal corrupt FastEmbed model cache#900
Conversation
An interrupted FastEmbed model download leaves the HuggingFace snapshot dir present but missing model_optimized.onnx. The ONNX runtime then raises NO_SUCHFILE on every load, and the failure is self-perpetuating until the cache is cleared by hand. Search surfaced only the generic 'Search Failed' message with no hint. FastEmbedEmbeddingProvider now detects a missing/corrupt-artifact load failure, deletes only this model's own models--<org>--<repo> cache subtree (resolved from FastEmbed's model description), and retries the load exactly once to force a fresh download. A second failure fails fast with the original error. The search error formatter gains an ONNX/model-load branch that names the resolved fastembed cache dir to delete and offers search_type="text" as an immediate workaround. Closes #895 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 38s —— View job Code Review:
|
| Area | Result |
|---|---|
| Python 3.12+ type annotations / async patterns | ✅ |
| Line length ≤ 100 chars | ✅ |
| No commented-out code | ✅ |
| Unit tests for all new functions | ✅ |
| Edge cases covered (no cache dir, no subdir found, persistent failure, unrelated error) | ✅ |
| 100% coverage maintained (new code fully exercised) | ✅ |
| No hardcoded secrets / no sensitive data in logs | ✅ |
| No emojis/unicode in console/log output | ✅ |
Windows-compatible path handling (uses pathlib.Path) |
✅ |
shutil.rmtree correctness with ignore_errors=True |
|
Error marker specificity in search.py |
|
Dependency on private FastEmbed API _list_supported_models |
ℹ️ noted, defended by fallback |
Cross-layer private import _resolve_cache_dir |
ℹ️ low priority, noted |
The fix as a whole is sound — the scoped deletion, single-retry cap, and error classifier are all correct. Issues #1 and #2 are the ones worth fixing before merge.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 648172439e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from fastembed import TextEmbedding | ||
|
|
||
| for description in TextEmbedding._list_supported_models(): | ||
| if description.model == resolved_model_name: |
There was a problem hiding this comment.
Match FastEmbed model names case-insensitively
FastEmbed accepts supported model names case-insensitively, so a valid config such as semantic_embedding_model="baai/bge-small-en-v1.5" still loads, but this exact-case comparison fails to resolve the HF source and _purge_corrupt_model_cache() returns false. In that scenario a corrupt cache for an otherwise valid model is never deleted or retried, so the new self-heal path remains disabled for lower-/mixed-case model names; compare normalized names here as FastEmbed does.
Useful? React with 👍 / 👎.
… path The self-heal introduced in this PR misclassified cold/transient model load failures as cache corruption, purged the in-progress download, and left the model unloadable for the whole session — breaking ~20 unrelated search integration tests in CI (cold model cache). Root cause: purge was driven purely by exception-text matching plus a bare subdir.exists() check, so a normal not-yet-downloaded model (or a transient 'from any source' download error) could trigger a destructive purge. Fixes: - Gate the purge on positive filesystem confirmation: the HF snapshot dir exists but the model artifact file (model_optimized.onnx) is missing from every snapshot. A cold load (no snapshot dir) is never treated as corrupt and never purges; the original error re-raises unchanged. - After a purge+retry that still fails, re-raise the ORIGINAL exception. - (Codex P2) Resolve the HF cache subdir case-insensitively, mirroring FastEmbed, so model='baai/bge-small-en-v1.5' still resolves correctly. - (claude-review #1) _purge_model_subdirs sets removed=True only when the subdir actually disappears after rmtree (Windows locked-file no-op safety). - (claude-review #2) Tighten the search error formatter ONNX marker from the broad 'load model' to the exact 'load model from' phrasing. - (minor) Document that TextEmbedding._list_supported_models() is an intentional undocumented-API use with a defensive fallback. Verified locally with the real model: a cold load downloads cleanly with no purge; a genuinely corrupt cache (snapshot present, artifact deleted) is detected, purged, and re-downloaded. Added regression tests for cold load, artifact-present non-purge, case-insensitive resolution, and locked-file rmtree no-op. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d90cccf92
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # The artifact lives at snapshots/<rev>/<model_file>; an interrupted download | ||
| # leaves the snapshot tree but no artifact. rglob covers any revision dir. | ||
| artifact_present = any(snapshot_dir.rglob(model_file)) | ||
| if not artifact_present: |
There was a problem hiding this comment.
Check the failed snapshot, not any cached artifact
When HuggingFace keeps multiple revisions for the same FastEmbed repo, a corrupt current snapshot can coexist with an older complete snapshot. Because this searches the entire models--... tree, any older model_optimized.onnx makes artifact_present true, so the self-heal path skips purging even though ONNX just failed to load snapshots/<current-rev>/model_optimized.onnx. In that mixed-cache state the broken snapshot remains self-perpetuating until the user manually deletes the cache; use the failed path from the exception or inspect each snapshot individually instead of treating any matching file under the model root as proof the cache is healthy.
Useful? React with 👍 / 👎.
|
Claude finished @phernandez's task in 1m 55s —— View job Code Review:
|
| Area | Result |
|---|---|
| Python 3.12+ type annotations and async patterns | ✅ |
| 100-character line length | ✅ |
| No commented-out code | ✅ |
| Literate Programming: section headers, decision-point comments, constraint comments | ✅ (_corrupt_model_subdirs has exemplary trigger/why/outcome comments) |
| Unit tests for all new functions | ✅ |
| Edge cases covered (cold load, no cache dir, locked file on Windows, unrelated error, case-insensitive name, artifact present) | ✅ |
| 100% coverage maintained (new code fully exercised) | ✅ |
| No hardcoded secrets / no sensitive data in logs | ✅ |
| No emojis/unicode in console/log output | ✅ |
Windows-compatible path handling (uses pathlib.Path) |
✅ |
shutil.rmtree correctness |
✅ fixed |
Error marker specificity in search.py |
✅ fixed |
| Case-insensitive model name resolution | ✅ fixed |
| Cold-load not misread as corruption | ✅ _corrupt_model_subdirs skips if subdir absent |
Private FastEmbed API _list_supported_models() |
ℹ️ intentional, documented, defended by except Exception fallback |
Cross-layer private import _resolve_cache_dir |
ℹ️ low priority, unchanged from prior review, comment explains rationale |
Minor Note
search.py:102 still imports _resolve_cache_dir from the repository layer (private symbol, noted in the prior review as low priority). No action needed now, but worth making public if embedding_provider_factory is ever refactored.
This PR is in excellent shape. All critical issues from both the Claude and Codex reviews are addressed, the conservative self-heal design is sound, and the test coverage is thorough. ✔️ Approved.
_corrupt_model_subdirs scanned the whole models--<repo> tree with rglob, so an older complete snapshot masked a corrupt current revision and self-heal skipped the purge — leaving the broken snapshot self-perpetuating. Inspect each snapshot revision individually and flag the model when any revision is missing the artifact (Codex review on #900). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 49s —— View job Code Review:
|
| Area | Result |
|---|---|
| Python 3.12+ type annotations / async patterns | ✅ |
| Line length ≤ 100 chars | ✅ |
| No commented-out code | ✅ |
| Literate Programming: section headers, decision-point comments | ✅ (_corrupt_model_subdirs trigger/why/outcome comments are exemplary) |
| Unit tests for all new functions | ✅ |
| Edge cases covered (cold load, no cache dir, locked file on Windows, unrelated error, case-insensitive name, artifact present, mixed old-good/new-corrupt revisions) | ✅ |
| 100% coverage maintained | ✅ |
| No hardcoded secrets / no sensitive data in logs | ✅ |
| No emojis/unicode in console/log output | ✅ |
Windows-compatible path handling (pathlib.Path throughout) |
✅ |
shutil.rmtree gated on post-deletion existence check |
✅ |
Error marker specificity ("load model from", not "load model") |
✅ |
| Case-insensitive model name resolution | ✅ |
| Cold-load not misread as corruption | ✅ |
| Per-revision snapshot corruption detection | ✅ fixed in commit 3 |
Private FastEmbed API _list_supported_models() |
ℹ️ intentional, documented, defended by except Exception fallback |
Cross-layer private import _resolve_cache_dir in search.py |
ℹ️ low priority, unchanged — comment explains rationale |
snapshot_dir variable naming in _model_cache_candidates |
ℹ️ cosmetic, callers use correct name model_dir |
All critical issues from both review passes are resolved. ✔️ Approved.
After rebasing onto #900 (merged), _create_model passes enable_cpu_mem_arena=False; the self-heal test stub from #900 must accept it. Resolves the rebase integration gap between #900 and #903. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
Summary
A new user's first semantic search can fail with
[ONNXRuntimeError] : 3 : NO_SUCHFILE : Load model from .../fastembed_cache/models--qdrant--bge-small-en-v1.5-onnx-q/snapshots/<rev>/model_optimized.onnx failed. File doesn't exist.Root cause: an interrupted/incomplete FastEmbed model download leaves the HuggingFace snapshot dir present but missing
model_optimized.onnx. The raw ONNXRuntimeError then propagates, the user sees a generic "Search Failed" message with no hint to clear the cache, and the broken snapshot is self-perpetuating until the cache is cleared by hand. This is upstream FastEmbed partial-download behavior; basic-memory previously had zero handling.What changed
src/basic_memory/repository/fastembed_provider.py_create_model()._load_model()now wraps construction: on a load failure that looks like a missing/corrupt artifact (NO_SUCHFILE/model_optimized.onnx/ "file doesn't exist" / "no such file"), it deletes only this model's ownmodels--<org>--<repo>cache subtree and retries the load exactly once to force a fresh download. A second failure fails fast with the original error. A non-cache error (e.g. a config error) is never treated as corruption.TextEmbedding._list_supported_models()), so it never touches the whole cache or unrelated models. If no model subdir exists or nocache_diris configured, it fails fast instead of looping.src/basic_memory/mcp/tools/search.py_format_search_error_responsegains a dedicated branch for ONNX/model-load errors. It explains the model is missing/corrupt, names the resolved fastembed cache dir (default~/.basic-memory/fastembed_cache) to delete so it re-downloads, and offerssearch_type="text"as an immediate workaround.Testing
All commands run from the repo root via
uv.uv run ruff check <changed files>→ All checks passeduv run ruff format --check <changed files>→ 4 files already formatteduv run ty check src tests test-int→ All checks passeduv run pytest tests/repository/test_fastembed_provider.py tests/mcp/test_tool_search.py -q→ 63 passedNew tests:
tests/repository/test_fastembed_provider.pytest_fastembed_provider_self_heals_corrupt_model_cache— first construction raises NO_SUCHFILE, provider purges exactly the model's subdir (leaves an unrelatedmodels--other--keep-medir untouched) and retries to success.test_fastembed_provider_fails_fast_on_persistent_corrupt_cache— both constructions fail; asserts exactly one retry (two attempts) then fail fast, no infinite loop.test_fastembed_provider_does_not_purge_on_unrelated_error— aValueErrorconfig error propagates with no deletion and no retry.test_fastembed_provider_fails_fast_when_no_cache_subdir_to_purge/..._without_cache_dir— nothing to purge means no retry.tests/mcp/test_tool_search.py::TestSearchErrorFormatting::test_format_search_error_corrupt_embedding_model— asserts the message names the resolved cache dir and suggestssearch_type="text".Risk / validation
models--<org>--<repo>subtree, resolved from FastEmbed's model description, and gated behind a missing-artifact error match — never the whole cache, never unrelated data.TextEmbedding(the existingsys.modules['fastembed']monkeypatch pattern) because a real interrupted download is non-deterministic and offline-unfriendly; this is the justified-mock case for the clean-and-retry path.Closes #895
🤖 Generated with Claude Code