feat(code-index): add Semble as a local on-the-fly embedding provider#399
feat(code-index): add Semble as a local on-the-fly embedding provider#399navedmerchant wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR adds Semble, a local hybrid code search tool, as a new embedder provider for code indexing. The implementation spans type definitions, configuration management, a CLI wrapper with process spawning, provider orchestration, manager integration, service guards, webview UI controls, and comprehensive test coverage. ChangesSemble local indexing support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
packages/types/src/codebase-index.tsESLint skipped: missing config or dependency (missing-dependency). The ESLint configuration references a package that is not available in the sandbox. packages/types/src/embedding.tsESLint skipped: the ESLint configuration for this file references a package that is not available in the sandbox. packages/types/src/vscode-extension-host.tsESLint skipped: the ESLint configuration for this file references a package that is not available in the sandbox.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Not ready for review yet, the code is being reviewed and evolved |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/services/code-index/interfaces/manager.ts (1)
78-87: ⚡ Quick winConsider consolidating the duplicated EmbedderProvider type definition.
EmbedderProvideris defined both here and inpackages/types/src/embedding.ts. Different parts of the codebase import from different locations (e.g.,config-manager.tsimports from this file, whileservice-factory.tsimports from@roo-code/types). While currently in sync, maintaining two sources of truth creates risk of drift and inconsistency.♻️ Suggested consolidation approach
Consider removing the local type definition and importing from the shared package instead:
+import { EmbedderProvider } from "`@roo-code/types`" import { VectorStoreSearchResult } from "./vector-store" import * as vscode from "vscode" /** * Interface for the code index manager */ export interface ICodeIndexManager { // ... rest of interface -export type IndexingState = "Standby" | "Indexing" | "Indexed" | "Error" | "Stopping" -export type EmbedderProvider = - | "openai" - | "ollama" - | "openai-compatible" - | "gemini" - | "mistral" - | "vercel-ai-gateway" - | "bedrock" - | "openrouter" - | "semble"This ensures a single source of truth and prevents accidental divergence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/code-index/interfaces/manager.ts` around lines 78 - 87, Duplicate EmbedderProvider type exists locally and in the shared package; remove the local type definition (EmbedderProvider) from this file and replace usages/imports (e.g., in config-manager.ts) to import the type from the canonical package (`@roo-code/types` or packages/types/src/embedding.ts), update any export or re-export as needed so service-factory.ts and other consumers use the single source of truth, run TypeScript build/typecheck to ensure no unresolved imports remain, and delete the now-unused local declaration to avoid drift.webview-ui/src/components/chat/CodeIndexPopover.tsx (1)
774-776: ⚡ Quick winConsider adding test coverage for semble provider UI.
Per coding guidelines for
webview-ui/src/**/*.{ts,tsx}, prefer local webview-ui tests for React/webview behavior. Consider adding tests underwebview-ui/src/**/__tests__to cover:
- Selecting semble from the provider dropdown
- Qdrant URL/API key fields hidden when semble is selected
- Validation passes with only
codebaseIndexEnabledfor semble- Semble path input field renders and updates state correctly
- Installation instructions panel renders
As per coding guidelines: "Prefer local
webview-uitests for React/webview behavior such as component rendering, local state, hooks, form dirty-state, validation, or prop wiring."Also applies to: 1446-1482, 1484-1537
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@webview-ui/src/components/chat/CodeIndexPopover.tsx` around lines 774 - 776, Add local webview-ui tests for the CodeIndexPopover component to cover the semble provider UI: create tests under webview-ui/src/**/__tests__ that render CodeIndexPopover, select the SelectItem with value "semble", assert that Qdrant URL/API key fields are hidden, confirm validation passes when only codebaseIndexEnabled is set for semble, verify the semble path input field appears and updates component state on change, and assert the installation instructions panel is rendered; use the existing test utilities (render, fireEvent/userEvent, assertions) consistent with other webview-ui tests to locate and interact with the provider dropdown and semble-specific inputs.webview-ui/src/i18n/locales/en/settings.json (1)
204-204: 💤 Low valueConsider adding a Windows path example.
The
semblePathDescriptionprovides a helpful Unix path example ("/usr/local/bin/semble"), but users on Windows might benefit from seeing a Windows path example as well, such as"C:\\Python310\\Scripts\\semble.exe".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@webview-ui/src/i18n/locales/en/settings.json` at line 204, The translation entry semblePathDescription currently shows only a Unix example; update its text to include a Windows path example (e.g., "C:\\Python310\\Scripts\\semble.exe") so Windows users have guidance—locate the semblePathDescription key in the locales/en/settings.json and add a short Windows example alongside the existing Unix example, keeping escaping consistent for backslashes and quotes.src/services/code-index/semble/semble-cli.ts (1)
226-256: ⚡ Quick winConsider adding runtime validation for CLI output structure.
Lines 242 and 247 use type assertions (
as SembleSearchResult[]) without runtime validation. If the semble CLI output format changes or contains unexpected data, this could cause runtime errors downstream when code expects the SembleSearchResult shape.Consider adding a simple validation helper or at least checking for required fields before casting.
🛡️ Example validation approach
// Handle successful response: {query, results: [{chunk, score}]} if (parsed.results && Array.isArray(parsed.results)) { - return parsed.results as SembleSearchResult[] + // Basic validation: check that results have expected shape + return parsed.results.filter((r: any) => + r?.chunk?.file_path && + r?.chunk?.content && + typeof r?.score === 'number' + ) as SembleSearchResult[] } // Fallback: if it's a flat array (older format) if (Array.isArray(parsed)) { - return parsed as SembleSearchResult[] + return parsed.filter((r: any) => + r?.chunk?.file_path && + r?.chunk?.content && + typeof r?.score === 'number' + ) as SembleSearchResult[] }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/code-index/semble/semble-cli.ts` around lines 226 - 256, The _parseOutput method casts parsed.results and parsed arrays to SembleSearchResult[] without runtime checks; add a lightweight validator (e.g., validateSembleResult) and use it to filter/transform parsed.results and parsed before returning to ensure each item has required properties (e.g., chunk is string, score is number, and any other SembleSearchResult fields); replace the direct casts in _parseOutput with Array.isArray checks plus items.filter(validateSembleResult) (or map-to-safe-shape) so only well-formed SembleSearchResult objects are returned and malformed entries are dropped or logged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/services/code-index/semble/semble-cli.ts`:
- Around line 150-157: The spawn call in _spawn uses an unsupported maxBuffer
option and silences the TS error with an as any cast; remove the maxBuffer
property from the options passed to spawn(this.semblePath, args, { shell: false,
timeout: options.timeout, stdio: [...] }) and drop the as any cast, and if you
need to limit output implement manual buffering and size checks inside
child.stdout?.on("data", ...) and child.stderr?.on("data", ...) (e.g.,
accumulate into stdout/stderr strings, enforce a MAX_BUFFER threshold, kill the
child and reject if exceeded) while keeping the rest of _spawn (child handling,
resolve/reject) intact.
---
Nitpick comments:
In `@src/services/code-index/interfaces/manager.ts`:
- Around line 78-87: Duplicate EmbedderProvider type exists locally and in the
shared package; remove the local type definition (EmbedderProvider) from this
file and replace usages/imports (e.g., in config-manager.ts) to import the type
from the canonical package (`@roo-code/types` or packages/types/src/embedding.ts),
update any export or re-export as needed so service-factory.ts and other
consumers use the single source of truth, run TypeScript build/typecheck to
ensure no unresolved imports remain, and delete the now-unused local declaration
to avoid drift.
In `@src/services/code-index/semble/semble-cli.ts`:
- Around line 226-256: The _parseOutput method casts parsed.results and parsed
arrays to SembleSearchResult[] without runtime checks; add a lightweight
validator (e.g., validateSembleResult) and use it to filter/transform
parsed.results and parsed before returning to ensure each item has required
properties (e.g., chunk is string, score is number, and any other
SembleSearchResult fields); replace the direct casts in _parseOutput with
Array.isArray checks plus items.filter(validateSembleResult) (or
map-to-safe-shape) so only well-formed SembleSearchResult objects are returned
and malformed entries are dropped or logged.
In `@webview-ui/src/components/chat/CodeIndexPopover.tsx`:
- Around line 774-776: Add local webview-ui tests for the CodeIndexPopover
component to cover the semble provider UI: create tests under
webview-ui/src/**/__tests__ that render CodeIndexPopover, select the SelectItem
with value "semble", assert that Qdrant URL/API key fields are hidden, confirm
validation passes when only codebaseIndexEnabled is set for semble, verify the
semble path input field appears and updates component state on change, and
assert the installation instructions panel is rendered; use the existing test
utilities (render, fireEvent/userEvent, assertions) consistent with other
webview-ui tests to locate and interact with the provider dropdown and
semble-specific inputs.
In `@webview-ui/src/i18n/locales/en/settings.json`:
- Line 204: The translation entry semblePathDescription currently shows only a
Unix example; update its text to include a Windows path example (e.g.,
"C:\\Python310\\Scripts\\semble.exe") so Windows users have guidance—locate the
semblePathDescription key in the locales/en/settings.json and add a short
Windows example alongside the existing Unix example, keeping escaping consistent
for backslashes and quotes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 332e2e21-894e-4a2d-9e0f-464ce90a3d05
📒 Files selected for processing (19)
packages/types/src/codebase-index.tspackages/types/src/embedding.tspackages/types/src/vscode-extension-host.tssrc/core/webview/ClineProvider.tssrc/core/webview/webviewMessageHandler.tssrc/services/code-index/config-manager.tssrc/services/code-index/interfaces/config.tssrc/services/code-index/interfaces/manager.tssrc/services/code-index/manager.tssrc/services/code-index/semble/__tests__/provider.spec.tssrc/services/code-index/semble/__tests__/semble-cli.spec.tssrc/services/code-index/semble/index.tssrc/services/code-index/semble/provider.tssrc/services/code-index/semble/semble-cli.tssrc/services/code-index/semble/types.tssrc/services/code-index/service-factory.tssrc/shared/embeddingModels.tswebview-ui/src/components/chat/CodeIndexPopover.tsxwebview-ui/src/i18n/locales/en/settings.json
| private _spawn(args: string[], options: { timeout: number }): Promise<{ stdout: string; stderr: string }> { | ||
| return new Promise((resolve, reject) => { | ||
| const child = spawn(this.semblePath, args, { | ||
| shell: false, | ||
| timeout: options.timeout, | ||
| maxBuffer: 10 * 1024 * 1024, | ||
| stdio: ["ignore", "pipe", "pipe"], | ||
| } as any) |
There was a problem hiding this comment.
Remove invalid maxBuffer option from spawn() call.
The maxBuffer option on line 155 is not supported by child_process.spawn(). The maxBuffer option only exists for exec() and execFile(). Setting it here has no effect, and the as any cast on line 157 suppresses the TypeScript error.
If buffer limiting is needed, consider implementing manual buffer size tracking in the data event handlers.
🔧 Proposed fix to remove unsupported option
private _spawn(args: string[], options: { timeout: number }): Promise<{ stdout: string; stderr: string }> {
return new Promise((resolve, reject) => {
const child = spawn(this.semblePath, args, {
shell: false,
timeout: options.timeout,
- maxBuffer: 10 * 1024 * 1024,
stdio: ["ignore", "pipe", "pipe"],
- } as any)
+ })If you need to enforce buffer limits, add size checks in the data handlers:
const MAX_BUFFER = 10 * 1024 * 1024
let stdout = ""
let stderr = ""
child.stdout?.on("data", (data: Buffer) => {
stdout += data.toString()
if (stdout.length > MAX_BUFFER) {
child.kill()
reject(new Error("Output exceeded buffer limit"))
}
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private _spawn(args: string[], options: { timeout: number }): Promise<{ stdout: string; stderr: string }> { | |
| return new Promise((resolve, reject) => { | |
| const child = spawn(this.semblePath, args, { | |
| shell: false, | |
| timeout: options.timeout, | |
| maxBuffer: 10 * 1024 * 1024, | |
| stdio: ["ignore", "pipe", "pipe"], | |
| } as any) | |
| private _spawn(args: string[], options: { timeout: number }): Promise<{ stdout: string; stderr: string }> { | |
| return new Promise((resolve, reject) => { | |
| const child = spawn(this.semblePath, args, { | |
| shell: false, | |
| timeout: options.timeout, | |
| stdio: ["ignore", "pipe", "pipe"], | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/services/code-index/semble/semble-cli.ts` around lines 150 - 157, The
spawn call in _spawn uses an unsupported maxBuffer option and silences the TS
error with an as any cast; remove the maxBuffer property from the options passed
to spawn(this.semblePath, args, { shell: false, timeout: options.timeout, stdio:
[...] }) and drop the as any cast, and if you need to limit output implement
manual buffering and size checks inside child.stdout?.on("data", ...) and
child.stderr?.on("data", ...) (e.g., accumulate into stdout/stderr strings,
enforce a MAX_BUFFER threshold, kill the child and reject if exceeded) while
keeping the rest of _spawn (child handling, resolve/reject) intact.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Related GitHub Issue
Closes: #200
Description
This PR integrates [semble](https://pypi.org/project/semble/) as a new local embedding provider for code indexing. Semble performs hybrid semantic search on-the-fly — no API keys, no Qdrant instance, and no separate indexing step required.
How it works:
SembleCLIwraps thesembleCLI usingchild_process.spawnwith array arguments (no shell) to prevent injection. It handlessearch,find-related, version checking viapip show, and JSON output parsing for the v0.3.0+ response format.SembleProviderimplements the existingCodeIndexprovider interface. Oninitialize()it checks that semble is installed and meets the>= 0.3.0version requirement. Since semble indexes on-the-fly per search call,startIndexing()is a no-op that just marks the provider as ready.CodeIndexManagerdetectsembedderProvider === "semble"and delegates directly toSembleProvider, bypassing theServiceFactory → orchestratorpipeline (which requires Qdrant). Guards were added toServiceFactoryto throw clearly if it's called for semble by mistake.CodeIndexPopoverUI adds semble to the provider dropdown with a minimal config section (just an optional executable path field). Validation for semble skips the Qdrant URL and API key requirements that other providers need.Key design choices:
Indexedafter install check — no progress bar needed.searchPath(notworkspacePath) to handledirectoryPrefixcorrectly.pip show(falling back topip3) rather than parsing--versionoutput, which is more reliable across semble releases.Test Procedure
Automated tests (run with
pnpm testin the repo root):src/services/code-index/semble/__tests__/semble-cli.spec.ts— unit tests forSembleCLI: spawn arg construction, shell injection safety, JSON parsing (v0.3.0+ format, empty results, error responses), version validation, pip fallback to pip3.src/services/code-index/semble/__tests__/provider.spec.ts— unit tests forSembleProvider: initialization, state transitions, search delegation, result conversion, error handling.Manual testing steps:
pip install semble(requires Python 3.10+)F5)semble(or provide a full path if needed) and click SaveTo verify the no-semble error path without uninstalling:
codebaseIndexSemblePathtosemble-does-not-existin settings and confirm the error message appears.Pre-Submission Checklist
Screenshots / Videos
Additional Notes
findRelatedis implemented inSembleCLIbut not yet exposed throughSembleProviderorCodeIndexManager. It's reserved for a future "find similar code" action.~/Library/Caches/semble/(macOS) or equivalent.clearIndexData()resets provider state but does not delete the cache — users can runsemble clear-cachemanually.Summary by CodeRabbit
New Features
Documentation