Skip to content

fix: Return matched chunk text in search results#601

Merged
phernandez merged 4 commits intomainfrom
matched-chunk-text
Feb 23, 2026
Merged

fix: Return matched chunk text in search results#601
phernandez merged 4 commits intomainfrom
matched-chunk-text

Conversation

@phernandez
Copy link
Member

Summary

  • Thread matching chunk text from vector search through to API response as new matched_chunk field
  • Previously search results always returned the beginning of the note (truncated to 250 chars) regardless of which section matched
  • New optional field — no breaking changes, FTS-only results have matched_chunk: null

Changes

  • SearchIndexRow: new matched_chunk_text field
  • Vector queries (SQLite + Postgres): add c.chunk_text to SELECT
  • _search_vector_only: track best_chunk_by_si_id, set on ranked results
  • SearchResult schema: new optional matched_chunk field
  • Hybrid merge: vector rows overwrite FTS rows, preserving chunk text naturally

Test plan

  • New test: test_matched_chunk_text_populated_on_vector_results (unit)
  • New test: test_search_result_includes_matched_chunk (API router)
  • New test: test_search_result_omits_matched_chunk_when_none (API router)
  • Updated mock vector rows in threshold/pagination/hybrid tests
  • All 28 targeted tests pass
  • Lint, format, typecheck clean

Closes #599.

🤖 Generated with Claude Code

phernandez and others added 3 commits February 22, 2026 16:23
The `Entity.entity_type` column stores the frontmatter `type` value
(note, spec, schema, person) but its name collides with `SearchItemType`
(entity/observation/relation). This caused real bugs where
`search_by_metadata({"entity_type": "spec"})` would fail because the
metadata filter looked in the wrong JSON column.

Changes:
- Alembic migration renames column + index on entity table, updates
  search_index JSON metadata (both SQLite and Postgres)
- ORM model, Pydantic schemas, services, repositories, API routers,
  MCP tools/clients, CLI commands, and schema inference engine all
  updated to use `note_type`
- `SearchQuery.types` renamed to `SearchQuery.note_types` for clarity
- Type alias `EntityType` renamed to `NoteType`
- ~52 test files updated

Unchanged: `SearchItemType` enum, `entity_types` params that filter by
entity/observation/relation, frontmatter YAML `type:` key,
`entity_metadata` column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Thread the actual matching chunk text from vector search through to the API
response as a new `matched_chunk` field. Previously, search results always
returned the beginning of the note (truncated to 250 chars) regardless of
which section matched the query.

The vector search pipeline already matches against specific chunks, but the
chunk text was discarded — only chunk_key and distance were returned. This
fix adds c.chunk_text to the vector query SELECT (both SQLite and Postgres),
tracks the best-matching chunk per search_index row, and surfaces it as
`matched_chunk` on the SearchResult schema.

- SearchIndexRow: new `matched_chunk_text` field
- Vector queries: SELECT c.chunk_text alongside existing columns
- _search_vector_only: track best_chunk_by_si_id, set on ranked results
- SearchResult schema: new optional `matched_chunk` field
- Hybrid merge: vector rows overwrite FTS rows, preserving chunk text
- FTS-only results: matched_chunk remains None (no vector match)

Closes #599.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 23, 2026

Claude finished @phernandez's task in 3m 0s —— View job


PR Review: Return matched chunk text in search results

Reviewing PR #601 against the team checklist...

  • Read changed files and gather context
  • Analyze code quality and standards
  • Review testing and documentation
  • Check architecture and design patterns
  • Assess security and performance
  • Check compatibility concerns
  • Post final review summary

Summary

This PR cleanly threads the matched chunk text from vector search through to the API response. The implementation is minimal, backwards-compatible, and well-tested. A few observations below.


Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices (raw SQL via text(), consistent with existing code)
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit tests for new logic (test_matched_chunk_text_populated_on_vector_results)
  • API router tests added (positive + null cases)
  • FakeRow updated in all three test files that define it (threshold, pagination, hybrid)
  • 100% test coverage maintained — no uncovered branches introduced
  • Minor observation: The new test_search_result_includes_matched_chunk and test_search_result_omits_matched_chunk_when_none tests mock at the service boundary (get_search_service_v2_external) rather than testing the actual to_search_results utility directly. This is fine for API-level tests but to_search_results in api/v2/utils.py (line 180: matched_chunk=r.matched_chunk_text) doesn't have a direct unit test — it's only covered by the mocked API tests. Not a blocker, but worth noting.

Basic Memory Architecture

  • No database schema changes — data already exists in search_vector_chunks.chunk_text, just not selected
  • No Alembic migration needed (correct)
  • Backwards compatible: matched_chunk: Optional[str] = None is a new optional field
  • FTS-only results correctly return matched_chunk: null
  • Hybrid merge correctly preserves chunk text: vector rows overwrite FTS rows in rows_by_id (line 1136 of search_repository_base.py), so results present in both FTS and vector carry the chunk text naturally

Specific Code Observations

1. Empty string vs. None for missing chunk text (search_repository_base.py:881)

chunk_text = row.get("chunk_text", "")

If chunk_text comes back as an empty string from the DB, it would be stored in best_chunk_by_si_id and surface as matched_chunk: "" in the API response rather than null. In practice this can't happen because the chunking code filters empty chunks before storage ([chunk for chunk in chunked_sections if chunk.strip()]) and the column has a NOT NULL constraint. But the default of "" is slightly misleading — None would be a safer signal. Fix this →

2. best_chunk_by_si_id not pruned after threshold filtering (search_repository_base.py:895–905)

When effective_min_similarity > 0.0, similarity_by_si_id is filtered but best_chunk_by_si_id is not. The dict retains entries for below-threshold results. This is functionally harmless since .get(si_id) is only called for IDs in the filtered similarity_by_si_id, but it's a minor memory holding. Not a blocker.

3. Alembic migration file included in diff

The migration j3d4e5f6g7h8_rename_entity_type_to_note_type.py shows +1/-4 changes. This appears to be a carry-over fix from the base branch (note_type_entity_type_refactor) and looks correct. Not related to the chunk text feature itself.


Security & Performance

  • No hardcoded secrets or credentials
  • No new input validation surface (chunk text is read from DB, not user input)
  • chunk_text is selected alongside existing columns — no additional round-trip to DB
  • No sensitive data logged

Compatibility

  • Windows path handling unchanged
  • No emoji/unicode in console or log output introduced
  • Optional field — no breaking changes to existing consumers

CI Status

The "Pull Request Title" check failed due to the semantic PR title linter (amannn/action-semantic-pull-request) — the title needs a conventional commit prefix (e.g., feat:, fix:). This is unrelated to code quality. Consider renaming the PR title to feat: return matched chunk text in search results to pass the check.


Verdict

LGTM. The implementation is clean, minimal, and well-tested. The only actionable item is the chunk_text default (""None), which is minor. The PR title needs updating for CI.

Base automatically changed from note_type_entity_type_refactor to main February 23, 2026 02:28
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 23, 2026

Claude finished @phernandez's task in 1m 18s —— View job


PR Review: Return matched chunk text in search results

Reviewing PR #601 against the team checklist...

  • Read changed files and gather context
  • Analyze code quality and standards
  • Review testing and documentation
  • Check architecture and design patterns
  • Assess security and performance
  • Check compatibility concerns
  • Post final review summary

Summary

This PR cleanly threads the matched chunk text from vector search through to the API response. The implementation is minimal, well-scoped, and backwards-compatible. A few observations below.


Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices (text() raw SQL, consistent with existing patterns)
  • FastAPI conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit test for new logic: test_matched_chunk_text_populated_on_vector_results in test_vector_threshold.py
  • API router tests added for both positive case (test_search_result_includes_matched_chunk) and null case (test_search_result_omits_matched_chunk_when_none)
  • FakeRow dataclass updated with matched_chunk_text in all three test files that define it (test_vector_threshold.py, test_hybrid_rrf.py, test_vector_pagination.py)
  • 100% test coverage maintained — no uncovered branches introduced

Minor observation: to_search_results in api/v2/utils.py:180 (the line matched_chunk=r.matched_chunk_text) is exercised only through mocked API tests, not a direct unit test of the utility function. The existing mock-based API tests provide adequate coverage, so this is not a blocker.


Basic Memory Architecture

  • No database schema changes — chunk_text already exists in search_vector_chunks, just wasn't selected before
  • No Alembic migration needed (correct — no schema change)
  • Backwards compatible: matched_chunk: Optional[str] = None is a new optional field with a null default
  • FTS-only results correctly return matched_chunk: null (field defaults to None)
  • Hybrid merge preserves chunk text correctly: vector rows overwrite FTS rows in rows_by_id (search_repository_base.py:1136), so results present in both carry the chunk text naturally

Specific Code Observations

1. Empty string default for missing chunk text (search_repository_base.py:881)

chunk_text = row.get("chunk_text", "")

If chunk_text comes back empty from the DB, it would be stored and surface as matched_chunk: "" in the API response rather than null. In practice this can't happen because the chunking code filters empty strings before storage ([chunk for chunk in chunked_sections if chunk.strip()]) and the column has a NOT NULL constraint. However, None would be a semantically safer sentinel value here, and consistent with matched_chunk: Optional[str] = None in the schema. Fix this →

2. best_chunk_by_si_id not pruned after threshold filtering (search_repository_base.py:895–905)

After the threshold filter removes entries from similarity_by_si_id, best_chunk_by_si_id still holds entries for those removed IDs. This is functionally harmless — .get(si_id) is only called for keys present in the filtered similarity_by_si_id — but the dict retains memory for filtered-out results. Not a blocker.

3. Alembic migration file included in diff

j3d4e5f6g7h8_rename_entity_type_to_note_type.py shows +1/-4 changes. This is a carry-over from the base branch rename refactor and looks correct. Not related to the chunk text feature.


Security & Performance

  • No hardcoded secrets or credentials
  • No new input validation surface — chunk_text is read from DB, not user input
  • chunk_text is selected alongside existing columns with no additional round-trip to DB
  • No sensitive data logged

Compatibility

  • Windows path handling unchanged
  • No emoji or non-ASCII characters in console or log output
  • Optional field — no breaking changes to existing consumers

CI Status

The "Pull Request Title" check is failing because the title needs a conventional commit prefix. Consider renaming the PR title to feat: return matched chunk text in search results to pass the amannn/action-semantic-pull-request linter. This is unrelated to code correctness.


Verdict

LGTM. The implementation is clean, minimal, and well-tested. The only actionable item is the chunk_text = row.get("chunk_text", "") default (observation #1 above), which is a minor semantic concern. The PR title needs a feat: prefix for CI to pass.

@phernandez phernandez changed the title Return matched chunk text in search results fix: Return matched chunk text in search results Feb 23, 2026
@phernandez phernandez merged commit 0f3889f into main Feb 23, 2026
26 of 27 checks passed
@phernandez phernandez deleted the matched-chunk-text branch February 23, 2026 03:01
@phernandez phernandez added this to the v0.19.0 milestone Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Return matched chunk text in search results instead of truncated note header

1 participant