Skip to content

feat(opencode-plugin,aft): add optional external semantic embedding backends#11

Merged
ualtinok merged 2 commits intocortexkit:mainfrom
freelanceagent1:fix/windows-aft-semantic-runtime
Apr 16, 2026
Merged

feat(opencode-plugin,aft): add optional external semantic embedding backends#11
ualtinok merged 2 commits intocortexkit:mainfrom
freelanceagent1:fix/windows-aft-semantic-runtime

Conversation

@freelanceagent1
Copy link
Copy Markdown
Contributor

Summary

This adds optional external semantic embedding backends to AFT while keeping the current default behavior unchanged.

Default behavior is still:

  • local fastembed + all-MiniLM-L6-v2

New opt-in backends:

  • openai_compatible for LM Studio / OpenAI-compatible embeddings endpoints
  • ollama for Ollama native embeddings API

What changed

Plugin/config

  • added nested semantic config block
  • forwards semantic backend/model/base_url/api_key_env/timeout_ms/max_batch_size through configure
  • keeps ONNX runtime setup only for fastembed
  • external backends do not trigger ONNX setup

Rust semantic backend

  • introduced backend selection for semantic embedding generation
  • implemented:
    • fastembed
    • openai_compatible
    • ollama
  • query embedding uses the same configured backend as index build
  • external backend failures are explicit; there is no silent fallback

Persistence/status

  • semantic cache now rebuilds when backend/model/base_url/dimension changes
  • status now exposes semantic backend + model
  • preserves the existing semantic progress/status work already on this branch

Why there is no split query/index backend config

This PR intentionally uses one semantic backend for both:

  • index build
  • query embedding

Using one backend to build the index and another to embed queries is unsafe unless both produce vectors in the exact same embedding space. That is not guaranteed across runtimes, even for the same model family.

Validation

Automated checks

  • cargo test -p agent-file-tools semantic_ -- --nocapture
  • bun test packages/opencode-plugin/src/__tests__/config.test.ts packages/opencode-plugin/src/__tests__/shared-status.test.ts packages/opencode-plugin/src/__tests__/bridge.test.ts
  • bun run typecheck in packages/opencode-plugin

Provider validation

  • OpenAI-compatible provider path tested against a live LM Studio-compatible endpoint
  • Ollama provider path tested against:
    • mock-server protocol tests
    • live local Ollama endpoint

Compatibility note

  • Runtime-validated on Windows
  • External-backend implementation uses cross-platform Rust HTTP/JSON codepaths and does not introduce Windows-specific logic
  • Linux/macOS were not runtime-tested in this branch from this machine

Benchmark appendix

These numbers are for the new semantic backend feature, not a replacement for the README trigram-search benchmark.

Benchmark environments

Local benchmark machine:

  • Windows 11
  • AMD Radeon RX 6800M
  • AMD Radeon(TM) Graphics

Remote LM Studio host:

  • user-reported Windows 11 + RTX 5070 Ti

Backends tested:

  • fastembed local CPU
  • ollama local CPU
  • ollama local with Vulkan forced
  • LM Studio over LAN

Large-repo semantic build comparison (codex-fresh)

  • fastembed local CPU: 11m 33s
  • ollama local CPU: 8m 15s
  • ollama local Vulkan GPU: 3m 08s
  • LM Studio LAN: 2m 13s

All successful runs ended with:

  • 23,638 semantic entries
  • 384 dimensions

README-style query sets

I also reran the README query sets on current local snapshots of:

  • opencode-aft
  • reth
  • Chromium/base

Important caveat:

  • these are the same query sets, but not guaranteed the exact same repo snapshots as the published README tables

Indexed grep vs ripgrep

Observed indexed-grep speedups remained strong:

  • opencode-aft: ~18x to 40x
  • reth: ~12x to 20x
  • Chromium/base: ~26x to 71x

Semantic query latency

Cached semantic query latency favored local fastembed:

  • local fastembed was fastest
  • LM Studio was usually second
  • Ollama was slower on query latency even with Vulkan enabled

This suggests:

  • external backends improve semantic build throughput
  • local fastembed still gives the best query-time latency

Same-model note

LM Studio and Ollama were benchmarked with MiniLM-family embedding models, but not guaranteed byte-identical artifacts.

A stricter follow-up benchmark would import the same second-state MiniLM GGUF into both runtimes and rerun the comparison.

@ualtinok ualtinok merged commit 1414835 into cortexkit:main Apr 16, 2026
2 checks passed
ualtinok added a commit that referenced this pull request Apr 16, 2026
P0: Fix config merge security bypass — project config can no longer
leak sensitive semantic fields (backend/base_url/api_key_env) through
the ...override spread when mergeSemanticConfig returns undefined.

P1: Fix deep-merge — only defined safe fields from project config
overwrite user values (prevents silent erasure of user model/backend).

P1: Fix OpenAI index-field bug — use enumerate() fallback for
providers that omit the index field in embedding responses.

P1: Add cross-batch dimension tracking — error on inconsistent
embedding dimensions across batches.

P2: Clamp semantic timeout below bridge timeout (25s).
P2: Add HTTP retry with backoff (2 retries, 5xx/429 only).
P2: SSRF validation — http/https only, no-redirect policy.
P2: Stale semantic cache detection via file mtime verification.
P2: Normalize base_url before fingerprinting.
P2: Reject unknown fastembed model names explicitly.
P2: Change fallback timeout constant from 60s to 25s.

13 council audit findings addressed with 7 new tests.
@ualtinok
Copy link
Copy Markdown
Collaborator

Merged and shipped in v0.13.0 with additional hardening from a council security audit. Thanks @freelanceagent1 for the excellent feature work — the external backends architecture is clean and well-designed.

Follow-up improvements we added on top of your PR:

Security (P0):

  • Project config can no longer set semantic.backend, semantic.base_url, or semantic.api_key_env — those are user-only config now. Prevents a supply-chain attack where a malicious .opencode/aft.jsonc could exfiltrate env secrets to an attacker's server.

Correctness (P1):

  • Deep-merge now only overwrites defined safe fields — a project config with just { semantic: { timeout_ms: 5000 } } no longer erases the user's model/backend
  • OpenAI-compatible response parser uses enumerate() fallback for providers that omit the index field (was causing all items to map to slot 0)
  • Cross-batch embedding dimension tracking — errors on inconsistent dimensions instead of silently corrupting the index

Robustness (P2):

  • HTTP retry with exponential backoff (2 retries for 5xx/429 errors)
  • SSRF validation: only http:// and https:// schemes, no-redirect policy
  • Semantic timeout clamped to bridge timeout (25s max)
  • Stale semantic cache detection via file mtime verification on restart
  • Normalized base_url in fingerprints to avoid spurious rebuilds
  • Unknown fastembed model names now return explicit errors
  • 7 new tests covering all the above edge cases

All 689 Rust tests + 211 TypeScript tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants