Skip to content

fix(core): exclude stale entity rows from embedding coverage stats#675

Merged
phernandez merged 3 commits intomainfrom
fix/670-stale-embedding-stats
Mar 15, 2026
Merged

fix(core): exclude stale entity rows from embedding coverage stats#675
phernandez merged 3 commits intomainfrom
fix/670-stale-embedding-stats

Conversation

@phernandez
Copy link
Member

Summary

  • bm project info was reporting false "missing embeddings" after a successful bm reindex because stale rows in search_index and search_vector_chunks (from previously deleted entities) inflated the total_indexed_entities count
  • Fixed the embedding status queries to filter by entity existence (entity_id IN (SELECT id FROM entity ...))
  • Added stale row cleanup during bm reindex to purge orphaned search rows for deleted entities

Fixes #670

Test plan

  • pytest tests/services/test_project_service_embedding_status.py -v — new test for stale entity exclusion
  • pytest tests/services/test_search_service.py -v — existing search service tests
  • Full CI (SQLite + Postgres) passes
  • Manual verification: after bm reindex, bm project info should not recommend reindex when all current entities have embeddings

🤖 Generated with Claude Code

When entities are deleted, their rows in search_index and
search_vector_chunks can remain as orphans. This caused
`bm project info` to overcount total_indexed_entities and report
false "missing embeddings" recommendations after a successful reindex.

Two-pronged fix:
1. Filter embedding status queries to only count entity_ids that
   still exist in the entity table (fixes reporting)
2. Purge stale search rows during reindex (fixes root cause)

Fixes #670

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 15, 2026

Claude finished @phernandez's task in 2m 9s —— View job


PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Tasks:

  • Gather context and read changed files
  • Understand the request (PR review)
  • Review against team checklist
  • Post findings

Summary

The fix correctly addresses the root cause: stale rows in search_index and search_vector_chunks (from deleted entities) were inflating the total_indexed_entities count in bm project info. Two complementary approaches are used:

  1. Defensive stats queries — subquery filter in get_embedding_status to exclude orphaned rows
  2. Proactive cleanup_purge_stale_search_rows called during bm reindex to actually remove the stale rows

The overall approach is sound. A few issues to address:


Issues Found

🔴 Dead parameter in _purge_stale_search_rows

src/basic_memory/services/search_service.py, line 427

async def _purge_stale_search_rows(self, valid_entity_ids: set[int]) -> None:

valid_entity_ids is passed in but never used inside the method. The function relies entirely on the database subquery entity_id NOT IN (SELECT id FROM entity ...). The caller constructs the set unnecessarily:

entity_ids = [entity.id for entity in entities]
await self._purge_stale_search_rows(set(entity_ids))  # set() is wasted work

Either:

  • Remove the parameter (the DB subquery is sufficient and more correct for concurrent scenarios), or
  • Use the parameter to build a NOT IN (:ids) clause instead of the subquery

The DB subquery approach is better, so the parameter should be removed. Fix this →


🟡 total_embeddings and orphaned_chunks queries not filtered for stale entities

src/basic_memory/services/project_service.py, lines 1033–1067

The three search_vector_chunks count queries (lines 1014–1030) correctly apply entity_exists, but the total_embeddings and orphaned_chunks queries join search_vector_chunks without the same filter:

# total_embeddings — no entity_exists filter on chunks
"SELECT COUNT(*) FROM search_vector_chunks c "
"JOIN search_vector_embeddings e ON e.chunk_id = c.id "
"WHERE c.project_id = :project_id"

# orphaned_chunks — same issue
"SELECT COUNT(*) FROM search_vector_chunks c "
"LEFT JOIN search_vector_embeddings e ON ... "
"WHERE c.project_id = :project_id AND e.chunk_id IS NULL"

In practice, _purge_stale_search_rows removes stale chunks during reindex so these counts should be correct post-cleanup. But the defensive filter applied to the other three queries should be consistent here too — if stats are requested before a reindex is run, these counts can still be inflated by stale chunk rows with embeddings. Fix this →


🟡 Weak test assertion

tests/services/test_project_service_embedding_status.py, line 292

assert status.total_indexed_entities <= real_entity_count

This assertion passes even if the stale row IS being counted — as long as there happen to be more real entities than the stale row count. A regression test for this bug should assert an exact value:

# Only stale_entity_id is in search_index, no real entities have been indexed
assert status.total_indexed_entities == 0  # or the known count of real indexed entities

Checklist

Item Status
Follows Basic Memory coding conventions (CLAUDE.md)
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices (text(), parameterized queries)
100-character line length
No commented-out code
Literate programming style (decision point comments)
Unit tests for new functions/methods ⚠️ No direct test for _purge_stale_search_rows itself
100% test coverage ⚠️ Needs verification — new method may need direct coverage
Database changes include Alembic migrations ✅ N/A (no schema changes)
No hardcoded secrets
Input validation
Windows-compatible file paths ✅ N/A
No emoji/unicode in console/log output

Minor Notes

  • The SQLite-specific embedding deletion before chunk deletion (lines 452–460) is well-commented and correct — good handling of the missing CASCADE in sqlite-vec.
  • The entity_exists string interpolation approach (f-string into text()) is acceptable here since both fragments are hardcoded strings, not user input. No SQL injection risk.
  • The import inside the method (from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository) is unusual — consider moving to top of file unless there's a circular import issue.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bf5343faa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +407 to +409
# Clean up stale rows in search_index and search_vector_chunks
# that reference entity_ids no longer in the entity table
await self._purge_stale_search_rows(set(entity_ids))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ensure vector tables exist before purging stale rows

reindex_vectors now calls _purge_stale_search_rows before sync_entity_vectors_batch, but table creation happens inside sync_entity_vectors_batch (_ensure_vector_tables). On a first-time embeddings run (for example bm reindex --embeddings on a project that has never built vectors), the new purge issues DELETE statements against search_vector_chunks/search_vector_embeddings before those tables exist, which raises a DB error and aborts reindexing.

Useful? React with 👍 / 👎.

1. Remove unused valid_entity_ids parameter from _purge_stale_search_rows
2. Add entity_exists filter to total_embeddings and orphaned_chunks
   queries for consistency (previously only applied to 3 of 5 queries)
3. Strengthen test assertion to exact match instead of <=

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 15, 2026

Claude finished @phernandez's task in 2m 44s —— View job


PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Tasks:

  • Gather context and read changed files
  • Review against team checklist
  • Post findings

Summary

The follow-up commit (b265c57) addresses all three issues flagged in the previous Claude review. The fix is solid. One remaining concern from the Codex review (P1) appears to be a false positive, but there are two small items worth noting.


Previous Issues — Resolved ✅

Issue Status
🔴 Dead valid_entity_ids parameter in _purge_stale_search_rows ✅ Fixed — signature is now async def _purge_stale_search_rows(self) -> None: with no parameters
🟡 total_embeddings / orphaned_chunks not filtered for stale entities ✅ Fixed — chunk_entity_exists subquery applied to both JOIN queries (lines 1041, 1047, 1060, 1066)
🟡 Weak <= test assertion ✅ Fixed — now uses exact match assert status.total_indexed_entities == real_indexed_count (line 298)

Codex P1 — False Positive

The Codex review flagged that _purge_stale_search_rows runs before _ensure_vector_tables inside sync_entity_vectors_batch, creating a risk of DELETE against nonexistent tables on a first run.

This concern is not valid: both SQLiteSearchRepository.init_search_index() (line 97–98) and PostgresSearchRepository.init_search_index() (line 85–86) call _ensure_vector_tables() at service startup, before any reindex command executes. By the time reindex_vectors is called, the vector tables already exist.


Remaining Observations

🟡 Local import inside _purge_stale_search_rows

src/basic_memory/services/search_service.py, line 434

async def _purge_stale_search_rows(self) -> None:
    from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository

This import is deferred inside the method body. If there's no circular import preventing it, move it to the module top-level with the other repository imports. Deferred imports signal a possible circular dependency but there's no comment explaining the constraint here. Fix this →

🟡 Stale search_vector_chunks rows not covered by regression test

tests/services/test_project_service_embedding_status.py, line 255–298

test_embedding_status_excludes_stale_entity_ids inserts a stale row into search_index only. The chunk_entity_exists subquery filter also applies to search_vector_chunks counts (total_chunks, total_embeddings, orphaned_chunks), but there's no test verifying that a stale chunk row is excluded from those counts. The existing tests for total_embeddings / orphaned_chunks use real entity IDs, so a regression for stale chunks would pass undetected.


Checklist

Item Status
Follows Basic Memory coding conventions (CLAUDE.md)
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices (text(), parameterized queries)
100-character line length
No commented-out code
Literate programming style (decision point comments)
Unit tests for new functions/methods ⚠️ _purge_stale_search_rows is exercised through test_reindex_vectors but its stale-row-removal behavior is not directly asserted
100% test coverage ✅ Lines covered via reindex_vectors test path
Database changes include Alembic migrations ✅ N/A
No hardcoded secrets
Input validation
Windows-compatible file paths ✅ N/A

The core fix is correct and well-structured. The two observations above are non-blocking but would improve defensive coverage for future regressions.

Postgres search_index is a regular table with NOT NULL id constraint,
unlike SQLite's FTS5 virtual table where id is UNINDEXED.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 15, 2026

Claude finished @phernandez's task in 1m 13s —— View job


PR Review: fix(core): exclude stale entity rows from embedding coverage stats

Tasks:

  • Gather context and read changed files
  • Understand the request (PR review)
  • Review against team checklist
  • Post findings

Summary

The latest commit (0b732f6) fixes the Postgres test failure by including the required id column in the INSERT INTO search_index fixture. The PR is in good shape — all three issues from the first review and the Codex P1 (false positive, already analyzed) are resolved.

Two minor observations remain from the previous review that are still unaddressed:


Remaining Observations

🟡 Redundant from sqlalchemy import text inside _purge_stale_search_rows

src/basic_memory/services/search_service.py, line 435

async def _purge_stale_search_rows(self) -> None:
    from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository
    from sqlalchemy import text  # ← text is already imported at line 12

text is imported at module-level on line 12 and does not need to be re-imported here. Additionally, there is no circular import between sqlite_search_repository and search_service (confirmed: the repository does not import from this service), so the SQLiteSearchRepository import could also be moved to the module top with the other repository imports. At minimum, the redundant text import should be removed. Fix this →


🟡 No regression test for stale search_vector_chunks rows

tests/services/test_project_service_embedding_status.py

test_embedding_status_excludes_stale_entity_ids inserts a stale row into search_index only. The chunk_entity_exists subquery filter was also added to total_chunks, total_embeddings, and orphaned_chunks (lines 1021, 1030, 1041/1047, 1060/1066), but there is no test that verifies a stale search_vector_chunks row is excluded from those counts. A regression for that path would pass undetected under the current test suite.


Checklist

Item Status
Follows Basic Memory coding conventions (CLAUDE.md)
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices (text(), parameterized queries)
100-character line length
No commented-out code
Literate programming style (decision point comments)
Unit tests for new functions/methods ✅ New test covers stale search_index exclusion; _purge_stale_search_rows exercised via reindex_vectors path
100% test coverage
Database changes include Alembic migrations ✅ N/A (no schema changes)
No hardcoded secrets
Input validation
Windows-compatible file paths ✅ N/A
No emoji/unicode in console/log output

The core fix is correct and all blocking issues are resolved. The two observations above are non-blocking but would tighten defensive coverage.

@phernandez phernandez merged commit c8b0044 into main Mar 15, 2026
25 checks passed
@phernandez phernandez deleted the fix/670-stale-embedding-stats branch March 15, 2026 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] bm project info reports missing embeddings after successful reindex

1 participant