feat(search): Implement semantic chunk merging and strict search modes#34
Merged
feat(search): Implement semantic chunk merging and strict search modes#34
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses search result "pollution" from large documentation files by introducing: (1) a post-retrieval chunk merging step that consolidates adjacent/overlapping chunks from the same file and AST signature into unified blocks, reading gap lines directly from disk; and (2) new Mode field (strict_code, strict_docs, all) for the rag_search tool to explicitly filter documentation or code results.
Changes:
- Added mode-based post-retrieval filtering (
strict_code,strict_docs,all) inExecute. - Added
groupDocsByTreeto merge doc chunks by(filePath, signature)key, with disk-based gap fill and fallback to[...]concatenation. - Added
TestGroupDocsByTreecovering gap retrieval, multi-file isolation, and fallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
internal/service/tools/smart_search.go |
Adds Mode field to SmartSearchInput, mode-based result filtering in Execute, readLines helper, and groupDocsByTree merging function. |
internal/service/tools/smart_search_test.go |
Unit tests for groupDocsByTree covering code pass-through, doc grouping with gap fill, multi-file isolation, and disk-fallback behavior. |
- Auto-enable IncludeDocs when mode=strict_docs - Use bufio.Scanner in readLines for memory efficiency - Use struct groupKey instead of string separator for collision safety - Extract shared isDocSymbolType/isDocExtension helpers for consistency - Remove non-doc extensions (.sh, .sql, .css, .scss, .svelte) from isDocFile - Fix misleading comment about code_block in test - Add error check for os.WriteFile in test
doITmagic
pushed a commit
that referenced
this pull request
Mar 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses the issue of search result "pollution" when querying large documents. Previously, large Markdown, HTML, YAML, or JSON files indexed with TreeSitter would dominate the top search results with multiple fragmented chunks, pushing out relevant backend code.
Changes included:
groupDocsByTreethat intelligently merges adjacent or overlapping chunks from the same file and AST signature back into a single cohesive block. It safely loads any missing gap lines directly from disk.Modefield to therag_searchtool (strict_code,strict_docs,all), allowing AI agents to explicitly filter out documentation or code to avoid context pollution.Type of change
Checklist:
go fmt ./...go test ./...and they pass