fix(core): exclude stale entity rows from embedding coverage stats by phernandez · Pull Request #675 · basicmachines-co/basic-memory

phernandez · 2026-03-15T20:09:38Z

Summary

bm project info was reporting false "missing embeddings" after a successful bm reindex because stale rows in search_index and search_vector_chunks (from previously deleted entities) inflated the total_indexed_entities count
Fixed the embedding status queries to filter by entity existence (entity_id IN (SELECT id FROM entity ...))
Added stale row cleanup during bm reindex to purge orphaned search rows for deleted entities

Fixes #670

Test plan

pytest tests/services/test_project_service_embedding_status.py -v — new test for stale entity exclusion
pytest tests/services/test_search_service.py -v — existing search service tests
Full CI (SQLite + Postgres) passes
Manual verification: after bm reindex, bm project info should not recommend reindex when all current entities have embeddings

🤖 Generated with Claude Code

When entities are deleted, their rows in search_index and search_vector_chunks can remain as orphans. This caused `bm project info` to overcount total_indexed_entities and report false "missing embeddings" recommendations after a successful reindex. Two-pronged fix: 1. Filter embedding status queries to only count entity_ids that still exist in the entity table (fixes reporting) 2. Purge stale search rows during reindex (fixes root cause) Fixes #670 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-03-15T20:09:51Z

Claude finished @phernandez's task in 2m 9s —— View job

PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Tasks:

Gather context and read changed files
Understand the request (PR review)
Review against team checklist
Post findings

Summary

The fix correctly addresses the root cause: stale rows in search_index and search_vector_chunks (from deleted entities) were inflating the total_indexed_entities count in bm project info. Two complementary approaches are used:

Defensive stats queries — subquery filter in get_embedding_status to exclude orphaned rows
Proactive cleanup — _purge_stale_search_rows called during bm reindex to actually remove the stale rows

The overall approach is sound. A few issues to address:

Issues Found

🔴 Dead parameter in `_purge_stale_search_rows`

src/basic_memory/services/search_service.py, line 427

async def _purge_stale_search_rows(self, valid_entity_ids: set[int]) -> None:

valid_entity_ids is passed in but never used inside the method. The function relies entirely on the database subquery entity_id NOT IN (SELECT id FROM entity ...). The caller constructs the set unnecessarily:

entity_ids = [entity.id for entity in entities]
await self._purge_stale_search_rows(set(entity_ids))  # set() is wasted work

Either:

Remove the parameter (the DB subquery is sufficient and more correct for concurrent scenarios), or
Use the parameter to build a NOT IN (:ids) clause instead of the subquery

The DB subquery approach is better, so the parameter should be removed. Fix this →

🟡 `total_embeddings` and `orphaned_chunks` queries not filtered for stale entities

src/basic_memory/services/project_service.py, lines 1033–1067

The three search_vector_chunks count queries (lines 1014–1030) correctly apply entity_exists, but the total_embeddings and orphaned_chunks queries join search_vector_chunks without the same filter:

# total_embeddings — no entity_exists filter on chunks
"SELECT COUNT(*) FROM search_vector_chunks c "
"JOIN search_vector_embeddings e ON e.chunk_id = c.id "
"WHERE c.project_id = :project_id"

# orphaned_chunks — same issue
"SELECT COUNT(*) FROM search_vector_chunks c "
"LEFT JOIN search_vector_embeddings e ON ... "
"WHERE c.project_id = :project_id AND e.chunk_id IS NULL"

In practice, _purge_stale_search_rows removes stale chunks during reindex so these counts should be correct post-cleanup. But the defensive filter applied to the other three queries should be consistent here too — if stats are requested before a reindex is run, these counts can still be inflated by stale chunk rows with embeddings. Fix this →

🟡 Weak test assertion

tests/services/test_project_service_embedding_status.py, line 292

assert status.total_indexed_entities <= real_entity_count

This assertion passes even if the stale row IS being counted — as long as there happen to be more real entities than the stale row count. A regression test for this bug should assert an exact value:

# Only stale_entity_id is in search_index, no real entities have been indexed
assert status.total_indexed_entities == 0  # or the known count of real indexed entities

Checklist

Item	Status
Follows Basic Memory coding conventions (CLAUDE.md)	✅
Python 3.12+ type annotations and async patterns	✅
SQLAlchemy 2.0 best practices (`text()`, parameterized queries)	✅
100-character line length	✅
No commented-out code	✅
Literate programming style (decision point comments)	✅
Unit tests for new functions/methods	⚠️ No direct test for `_purge_stale_search_rows` itself
100% test coverage	⚠️ Needs verification — new method may need direct coverage
Database changes include Alembic migrations	✅ N/A (no schema changes)
No hardcoded secrets	✅
Input validation	✅
Windows-compatible file paths	✅ N/A
No emoji/unicode in console/log output	✅

Minor Notes

The SQLite-specific embedding deletion before chunk deletion (lines 452–460) is well-commented and correct — good handling of the missing CASCADE in sqlite-vec.
The entity_exists string interpolation approach (f-string into text()) is acceptable here since both fragments are hardcoded strings, not user input. No SQL injection risk.
The import inside the method (from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository) is unusual — consider moving to top of file unless there's a circular import issue.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bf5343faa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-15T20:14:10Z

src/basic_memory/services/search_service.py

+        # Clean up stale rows in search_index and search_vector_chunks
+        # that reference entity_ids no longer in the entity table
+        await self._purge_stale_search_rows(set(entity_ids))


Ensure vector tables exist before purging stale rows

reindex_vectors now calls _purge_stale_search_rows before sync_entity_vectors_batch, but table creation happens inside sync_entity_vectors_batch (_ensure_vector_tables). On a first-time embeddings run (for example bm reindex --embeddings on a project that has never built vectors), the new purge issues DELETE statements against search_vector_chunks/search_vector_embeddings before those tables exist, which raises a DB error and aborts reindexing.

Useful? React with 👍 / 👎.

1. Remove unused valid_entity_ids parameter from _purge_stale_search_rows 2. Add entity_exists filter to total_embeddings and orphaned_chunks queries for consistency (previously only applied to 3 of 5 queries) 3. Strengthen test assertion to exact match instead of <= Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-03-15T20:16:21Z

Claude finished @phernandez's task in 2m 44s —— View job

PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Tasks:

Gather context and read changed files
Review against team checklist
Post findings

Summary

The follow-up commit (b265c57) addresses all three issues flagged in the previous Claude review. The fix is solid. One remaining concern from the Codex review (P1) appears to be a false positive, but there are two small items worth noting.

Previous Issues — Resolved ✅

Issue	Status
🔴 Dead `valid_entity_ids` parameter in `_purge_stale_search_rows`	✅ Fixed — signature is now `async def _purge_stale_search_rows(self) -> None:` with no parameters
🟡 `total_embeddings` / `orphaned_chunks` not filtered for stale entities	✅ Fixed — `chunk_entity_exists` subquery applied to both JOIN queries (lines 1041, 1047, 1060, 1066)
🟡 Weak `<=` test assertion	✅ Fixed — now uses exact match `assert status.total_indexed_entities == real_indexed_count` (line 298)

Codex P1 — False Positive

The Codex review flagged that _purge_stale_search_rows runs before _ensure_vector_tables inside sync_entity_vectors_batch, creating a risk of DELETE against nonexistent tables on a first run.

This concern is not valid: both SQLiteSearchRepository.init_search_index() (line 97–98) and PostgresSearchRepository.init_search_index() (line 85–86) call _ensure_vector_tables() at service startup, before any reindex command executes. By the time reindex_vectors is called, the vector tables already exist.

Remaining Observations

🟡 Local import inside `_purge_stale_search_rows`

src/basic_memory/services/search_service.py, line 434

async def _purge_stale_search_rows(self) -> None:
    from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository

This import is deferred inside the method body. If there's no circular import preventing it, move it to the module top-level with the other repository imports. Deferred imports signal a possible circular dependency but there's no comment explaining the constraint here. Fix this →

🟡 Stale `search_vector_chunks` rows not covered by regression test

tests/services/test_project_service_embedding_status.py, line 255–298

test_embedding_status_excludes_stale_entity_ids inserts a stale row into search_index only. The chunk_entity_exists subquery filter also applies to search_vector_chunks counts (total_chunks, total_embeddings, orphaned_chunks), but there's no test verifying that a stale chunk row is excluded from those counts. The existing tests for total_embeddings / orphaned_chunks use real entity IDs, so a regression for stale chunks would pass undetected.

Checklist

Item	Status
Follows Basic Memory coding conventions (CLAUDE.md)	✅
Python 3.12+ type annotations and async patterns	✅
SQLAlchemy 2.0 best practices (`text()`, parameterized queries)	✅
100-character line length	✅
No commented-out code	✅
Literate programming style (decision point comments)	✅
Unit tests for new functions/methods	⚠️ `_purge_stale_search_rows` is exercised through `test_reindex_vectors` but its stale-row-removal behavior is not directly asserted
100% test coverage	✅ Lines covered via `reindex_vectors` test path
Database changes include Alembic migrations	✅ N/A
No hardcoded secrets	✅
Input validation	✅
Windows-compatible file paths	✅ N/A

The core fix is correct and well-structured. The two observations above are non-blocking but would improve defensive coverage for future regressions.

Postgres search_index is a regular table with NOT NULL id constraint, unlike SQLite's FTS5 virtual table where id is UNINDEXED. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-03-15T20:42:48Z

Claude finished @phernandez's task in 1m 13s —— View job

PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Tasks:

Gather context and read changed files
Understand the request (PR review)
Review against team checklist
Post findings

Summary

The latest commit (0b732f6) fixes the Postgres test failure by including the required id column in the INSERT INTO search_index fixture. The PR is in good shape — all three issues from the first review and the Codex P1 (false positive, already analyzed) are resolved.

Two minor observations remain from the previous review that are still unaddressed:

Remaining Observations

🟡 Redundant `from sqlalchemy import text` inside `_purge_stale_search_rows`

src/basic_memory/services/search_service.py, line 435

async def _purge_stale_search_rows(self) -> None:
    from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository
    from sqlalchemy import text  # ← text is already imported at line 12

text is imported at module-level on line 12 and does not need to be re-imported here. Additionally, there is no circular import between sqlite_search_repository and search_service (confirmed: the repository does not import from this service), so the SQLiteSearchRepository import could also be moved to the module top with the other repository imports. At minimum, the redundant text import should be removed. Fix this →

🟡 No regression test for stale `search_vector_chunks` rows

tests/services/test_project_service_embedding_status.py

test_embedding_status_excludes_stale_entity_ids inserts a stale row into search_index only. The chunk_entity_exists subquery filter was also added to total_chunks, total_embeddings, and orphaned_chunks (lines 1021, 1030, 1041/1047, 1060/1066), but there is no test that verifies a stale search_vector_chunks row is excluded from those counts. A regression for that path would pass undetected under the current test suite.

Checklist

Item	Status
Follows Basic Memory coding conventions (CLAUDE.md)	✅
Python 3.12+ type annotations and async patterns	✅
SQLAlchemy 2.0 best practices (`text()`, parameterized queries)	✅
100-character line length	✅
No commented-out code	✅
Literate programming style (decision point comments)	✅
Unit tests for new functions/methods	✅ New test covers stale `search_index` exclusion; `_purge_stale_search_rows` exercised via `reindex_vectors` path
100% test coverage	✅
Database changes include Alembic migrations	✅ N/A (no schema changes)
No hardcoded secrets	✅
Input validation	✅
Windows-compatible file paths	✅ N/A
No emoji/unicode in console/log output	✅

The core fix is correct and all blocking issues are resolved. The two observations above are non-blocking but would tighten defensive coverage.

chatgpt-codex-connector bot reviewed Mar 15, 2026

View reviewed changes

phernandez merged commit c8b0044 into main Mar 15, 2026
25 checks passed

phernandez deleted the fix/670-stale-embedding-stats branch March 15, 2026 23:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): exclude stale entity rows from embedding coverage stats#675

fix(core): exclude stale entity rows from embedding coverage stats#675
phernandez merged 3 commits intomainfrom
fix/670-stale-embedding-stats

phernandez commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Mar 15, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Summary

Issues Found

🔴 Dead parameter in _purge_stale_search_rows

🟡 total_embeddings and orphaned_chunks queries not filtered for stale entities

🟡 Weak test assertion

Checklist

Minor Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Summary

Previous Issues — Resolved ✅

Codex P1 — False Positive

Remaining Observations

🟡 Local import inside _purge_stale_search_rows

🟡 Stale search_vector_chunks rows not covered by regression test

Checklist

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Summary

Remaining Observations

🟡 Redundant from sqlalchemy import text inside _purge_stale_search_rows

🟡 No regression test for stale search_vector_chunks rows

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 15, 2026 •

edited

Loading

🔴 Dead parameter in `_purge_stale_search_rows`

🟡 `total_embeddings` and `orphaned_chunks` queries not filtered for stale entities

github-actions bot commented Mar 15, 2026 •

edited

Loading

🟡 Local import inside `_purge_stale_search_rows`

🟡 Stale `search_vector_chunks` rows not covered by regression test

github-actions bot commented Mar 15, 2026 •

edited

Loading

🟡 Redundant `from sqlalchemy import text` inside `_purge_stale_search_rows`

🟡 No regression test for stale `search_vector_chunks` rows