fix: Return matched chunk text in search results by phernandez · Pull Request #601 · basicmachines-co/basic-memory

phernandez · 2026-02-23T01:24:17Z

Summary

Thread matching chunk text from vector search through to API response as new matched_chunk field
Previously search results always returned the beginning of the note (truncated to 250 chars) regardless of which section matched
New optional field — no breaking changes, FTS-only results have matched_chunk: null

Changes

SearchIndexRow: new matched_chunk_text field
Vector queries (SQLite + Postgres): add c.chunk_text to SELECT
_search_vector_only: track best_chunk_by_si_id, set on ranked results
SearchResult schema: new optional matched_chunk field
Hybrid merge: vector rows overwrite FTS rows, preserving chunk text naturally

Test plan

New test: test_matched_chunk_text_populated_on_vector_results (unit)
New test: test_search_result_includes_matched_chunk (API router)
New test: test_search_result_omits_matched_chunk_when_none (API router)
Updated mock vector rows in threshold/pagination/hybrid tests
All 28 targeted tests pass
Lint, format, typecheck clean

Closes #599.

🤖 Generated with Claude Code

The `Entity.entity_type` column stores the frontmatter `type` value (note, spec, schema, person) but its name collides with `SearchItemType` (entity/observation/relation). This caused real bugs where `search_by_metadata({"entity_type": "spec"})` would fail because the metadata filter looked in the wrong JSON column. Changes: - Alembic migration renames column + index on entity table, updates search_index JSON metadata (both SQLite and Postgres) - ORM model, Pydantic schemas, services, repositories, API routers, MCP tools/clients, CLI commands, and schema inference engine all updated to use `note_type` - `SearchQuery.types` renamed to `SearchQuery.note_types` for clarity - Type alias `EntityType` renamed to `NoteType` - ~52 test files updated Unchanged: `SearchItemType` enum, `entity_types` params that filter by entity/observation/relation, frontmatter YAML `type:` key, `entity_metadata` column. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

Signed-off-by: phernandez <paul@basicmachines.co>

Thread the actual matching chunk text from vector search through to the API response as a new `matched_chunk` field. Previously, search results always returned the beginning of the note (truncated to 250 chars) regardless of which section matched the query. The vector search pipeline already matches against specific chunks, but the chunk text was discarded — only chunk_key and distance were returned. This fix adds c.chunk_text to the vector query SELECT (both SQLite and Postgres), tracks the best-matching chunk per search_index row, and surfaces it as `matched_chunk` on the SearchResult schema. - SearchIndexRow: new `matched_chunk_text` field - Vector queries: SELECT c.chunk_text alongside existing columns - _search_vector_only: track best_chunk_by_si_id, set on ranked results - SearchResult schema: new optional `matched_chunk` field - Hybrid merge: vector rows overwrite FTS rows, preserving chunk text - FTS-only results: matched_chunk remains None (no vector match) Closes #599. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-02-23T01:24:29Z

Claude finished @phernandez's task in 3m 0s —— View job

PR Review: Return matched chunk text in search results

Reviewing PR #601 against the team checklist...

Read changed files and gather context
Analyze code quality and standards
Review testing and documentation
Check architecture and design patterns
Assess security and performance
Check compatibility concerns
Post final review summary

Summary

This PR cleanly threads the matched chunk text from vector search through to the API response. The implementation is minimal, backwards-compatible, and well-tested. A few observations below.

Code Quality & Standards

Follows Basic Memory's coding conventions in CLAUDE.md
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices (raw SQL via text(), consistent with existing code)
FastAPI and Typer conventions followed
100-character line length limit maintained
No commented-out code blocks

Testing & Documentation

Unit tests for new logic (test_matched_chunk_text_populated_on_vector_results)
API router tests added (positive + null cases)
FakeRow updated in all three test files that define it (threshold, pagination, hybrid)
100% test coverage maintained — no uncovered branches introduced
Minor observation: The new test_search_result_includes_matched_chunk and test_search_result_omits_matched_chunk_when_none tests mock at the service boundary (get_search_service_v2_external) rather than testing the actual to_search_results utility directly. This is fine for API-level tests but to_search_results in api/v2/utils.py (line 180: matched_chunk=r.matched_chunk_text) doesn't have a direct unit test — it's only covered by the mocked API tests. Not a blocker, but worth noting.

Basic Memory Architecture

No database schema changes — data already exists in search_vector_chunks.chunk_text, just not selected
No Alembic migration needed (correct)
Backwards compatible: matched_chunk: Optional[str] = None is a new optional field
FTS-only results correctly return matched_chunk: null
Hybrid merge correctly preserves chunk text: vector rows overwrite FTS rows in rows_by_id (line 1136 of search_repository_base.py), so results present in both FTS and vector carry the chunk text naturally

Specific Code Observations

1. Empty string vs. None for missing chunk text (search_repository_base.py:881)

chunk_text = row.get("chunk_text", "")

If chunk_text comes back as an empty string from the DB, it would be stored in best_chunk_by_si_id and surface as matched_chunk: "" in the API response rather than null. In practice this can't happen because the chunking code filters empty chunks before storage ([chunk for chunk in chunked_sections if chunk.strip()]) and the column has a NOT NULL constraint. But the default of "" is slightly misleading — None would be a safer signal. Fix this →

2. best_chunk_by_si_id not pruned after threshold filtering (search_repository_base.py:895–905)

When effective_min_similarity > 0.0, similarity_by_si_id is filtered but best_chunk_by_si_id is not. The dict retains entries for below-threshold results. This is functionally harmless since .get(si_id) is only called for IDs in the filtered similarity_by_si_id, but it's a minor memory holding. Not a blocker.

3. Alembic migration file included in diff

The migration j3d4e5f6g7h8_rename_entity_type_to_note_type.py shows +1/-4 changes. This appears to be a carry-over fix from the base branch (note_type_entity_type_refactor) and looks correct. Not related to the chunk text feature itself.

Security & Performance

No hardcoded secrets or credentials
No new input validation surface (chunk text is read from DB, not user input)
chunk_text is selected alongside existing columns — no additional round-trip to DB
No sensitive data logged

Compatibility

Windows path handling unchanged
No emoji/unicode in console or log output introduced
Optional field — no breaking changes to existing consumers

CI Status

The "Pull Request Title" check failed due to the semantic PR title linter (amannn/action-semantic-pull-request) — the title needs a conventional commit prefix (e.g., feat:, fix:). This is unrelated to code quality. Consider renaming the PR title to feat: return matched chunk text in search results to pass the check.

Verdict

LGTM. The implementation is clean, minimal, and well-tested. The only actionable item is the chunk_text default ("" → None), which is minor. The PR title needs updating for CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-02-23T02:31:21Z

Claude finished @phernandez's task in 1m 18s —— View job

PR Review: Return matched chunk text in search results

Reviewing PR #601 against the team checklist...

Read changed files and gather context
Analyze code quality and standards
Review testing and documentation
Check architecture and design patterns
Assess security and performance
Check compatibility concerns
Post final review summary

Summary

This PR cleanly threads the matched chunk text from vector search through to the API response. The implementation is minimal, well-scoped, and backwards-compatible. A few observations below.

Code Quality & Standards

Follows Basic Memory's coding conventions in CLAUDE.md
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices (text() raw SQL, consistent with existing patterns)
FastAPI conventions followed
100-character line length limit maintained
No commented-out code blocks

Testing & Documentation

Unit test for new logic: test_matched_chunk_text_populated_on_vector_results in test_vector_threshold.py
API router tests added for both positive case (test_search_result_includes_matched_chunk) and null case (test_search_result_omits_matched_chunk_when_none)
FakeRow dataclass updated with matched_chunk_text in all three test files that define it (test_vector_threshold.py, test_hybrid_rrf.py, test_vector_pagination.py)
100% test coverage maintained — no uncovered branches introduced

Minor observation: to_search_results in api/v2/utils.py:180 (the line matched_chunk=r.matched_chunk_text) is exercised only through mocked API tests, not a direct unit test of the utility function. The existing mock-based API tests provide adequate coverage, so this is not a blocker.

Basic Memory Architecture

No database schema changes — chunk_text already exists in search_vector_chunks, just wasn't selected before
No Alembic migration needed (correct — no schema change)
Backwards compatible: matched_chunk: Optional[str] = None is a new optional field with a null default
FTS-only results correctly return matched_chunk: null (field defaults to None)
Hybrid merge preserves chunk text correctly: vector rows overwrite FTS rows in rows_by_id (search_repository_base.py:1136), so results present in both carry the chunk text naturally

Specific Code Observations

1. Empty string default for missing chunk text (search_repository_base.py:881)

chunk_text = row.get("chunk_text", "")

If chunk_text comes back empty from the DB, it would be stored and surface as matched_chunk: "" in the API response rather than null. In practice this can't happen because the chunking code filters empty strings before storage ([chunk for chunk in chunked_sections if chunk.strip()]) and the column has a NOT NULL constraint. However, None would be a semantically safer sentinel value here, and consistent with matched_chunk: Optional[str] = None in the schema. Fix this →

2. best_chunk_by_si_id not pruned after threshold filtering (search_repository_base.py:895–905)

After the threshold filter removes entries from similarity_by_si_id, best_chunk_by_si_id still holds entries for those removed IDs. This is functionally harmless — .get(si_id) is only called for keys present in the filtered similarity_by_si_id — but the dict retains memory for filtered-out results. Not a blocker.

3. Alembic migration file included in diff

j3d4e5f6g7h8_rename_entity_type_to_note_type.py shows +1/-4 changes. This is a carry-over from the base branch rename refactor and looks correct. Not related to the chunk text feature.

Security & Performance

No hardcoded secrets or credentials
No new input validation surface — chunk_text is read from DB, not user input
chunk_text is selected alongside existing columns with no additional round-trip to DB
No sensitive data logged

Compatibility

Windows path handling unchanged
No emoji or non-ASCII characters in console or log output
Optional field — no breaking changes to existing consumers

CI Status

The "Pull Request Title" check is failing because the title needs a conventional commit prefix. Consider renaming the PR title to feat: return matched chunk text in search results to pass the amannn/action-semantic-pull-request linter. This is unrelated to code correctness.

Verdict

LGTM. The implementation is clean, minimal, and well-tested. The only actionable item is the chunk_text = row.get("chunk_text", "") default (observation #1 above), which is a minor semantic concern. The PR title needs a feat: prefix for CI to pass.

phernandez and others added 3 commits February 22, 2026 16:23

add cstlsp to dev deps

2f5ca83

Signed-off-by: phernandez <paul@basicmachines.co>

Base automatically changed from note_type_entity_type_refactor to main February 23, 2026 02:28

merge main into matched-chunk-text

bf1c565

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>

phernandez changed the title ~~Return matched chunk text in search results~~ fix: Return matched chunk text in search results Feb 23, 2026

phernandez merged commit 0f3889f into main Feb 23, 2026
26 of 27 checks passed

phernandez deleted the matched-chunk-text branch February 23, 2026 03:01

phernandez added this to the v0.19.0 milestone Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Return matched chunk text in search results#601

fix: Return matched chunk text in search results#601
phernandez merged 4 commits intomainfrom
matched-chunk-text

phernandez commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Feb 23, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Return matched chunk text in search results

Summary

Code Quality & Standards

Testing & Documentation

Basic Memory Architecture

Specific Code Observations

Security & Performance

Compatibility

CI Status

Verdict

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Return matched chunk text in search results

Summary

Code Quality & Standards

Testing & Documentation

Basic Memory Architecture

Specific Code Observations

Security & Performance

Compatibility

CI Status

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 23, 2026 •

edited

Loading

github-actions bot commented Feb 23, 2026 •

edited

Loading