Skip to content

feat(search): Implement semantic chunk merging and strict search modes#34

Merged
doITmagic merged 3 commits intodevfrom
feat/smart-search-merge
Mar 7, 2026
Merged

feat(search): Implement semantic chunk merging and strict search modes#34
doITmagic merged 3 commits intodevfrom
feat/smart-search-merge

Conversation

@doITmagic
Copy link
Copy Markdown
Owner

Description

This PR addresses the issue of search result "pollution" when querying large documents. Previously, large Markdown, HTML, YAML, or JSON files indexed with TreeSitter would dominate the top search results with multiple fragmented chunks, pushing out relevant backend code.

Changes included:

  1. Tree-based Chunk Merging (Deduplication): A post-retrieval processing step groupDocsByTree that intelligently merges adjacent or overlapping chunks from the same file and AST signature back into a single cohesive block. It safely loads any missing gap lines directly from disk.
  2. Strict Search Modes: Added a new Mode field to the rag_search tool (strict_code, strict_docs, all), allowing AI agents to explicitly filter out documentation or code to avoid context pollution.
  3. Added robust unit tests verifying gap retrieval, graceful fallbacks, and multi-file merging prevention.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

  • I have performed a self-review of my own code
  • I have formatted my code with go fmt ./...
  • I have run tests go test ./... and they pass
  • I have verified integration with Ollama/Qdrant (if applicable)
  • I have updated the documentation accordingly

Copilot AI review requested due to automatic review settings March 7, 2026 17:38
@doITmagic doITmagic self-assigned this Mar 7, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses search result "pollution" from large documentation files by introducing: (1) a post-retrieval chunk merging step that consolidates adjacent/overlapping chunks from the same file and AST signature into unified blocks, reading gap lines directly from disk; and (2) new Mode field (strict_code, strict_docs, all) for the rag_search tool to explicitly filter documentation or code results.

Changes:

  • Added mode-based post-retrieval filtering (strict_code, strict_docs, all) in Execute.
  • Added groupDocsByTree to merge doc chunks by (filePath, signature) key, with disk-based gap fill and fallback to [...] concatenation.
  • Added TestGroupDocsByTree covering gap retrieval, multi-file isolation, and fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
internal/service/tools/smart_search.go Adds Mode field to SmartSearchInput, mode-based result filtering in Execute, readLines helper, and groupDocsByTree merging function.
internal/service/tools/smart_search_test.go Unit tests for groupDocsByTree covering code pass-through, doc grouping with gap fill, multi-file isolation, and disk-fallback behavior.

Comment thread internal/service/tools/smart_search.go
Comment thread internal/service/tools/smart_search.go Outdated
Comment thread internal/service/tools/smart_search.go Outdated
Comment thread internal/service/tools/smart_search_test.go Outdated
Comment thread internal/service/tools/smart_search_test.go Outdated
Comment thread internal/service/tools/smart_search.go Outdated
Comment thread internal/service/tools/smart_search.go
- Auto-enable IncludeDocs when mode=strict_docs
- Use bufio.Scanner in readLines for memory efficiency
- Use struct groupKey instead of string separator for collision safety
- Extract shared isDocSymbolType/isDocExtension helpers for consistency
- Remove non-doc extensions (.sh, .sql, .css, .scss, .svelte) from isDocFile
- Fix misleading comment about code_block in test
- Add error check for os.WriteFile in test
@doITmagic doITmagic merged commit 9a1c628 into dev Mar 7, 2026
5 checks passed
@doITmagic doITmagic deleted the feat/smart-search-merge branch March 7, 2026 20:02
doITmagic pushed a commit that referenced this pull request Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants