Todos
- Test MCP server
- Publish MCP server to npm
Describe a vulnerability in plain English. Find every relevant CVE.
Semantic CVE is a vulnerability intelligence platform that combines local Ollama embeddings with full-text and regex search across the entire CVE database (354k+ records). Instead of keyword matching, you describe what you're looking for in natural language and the platform finds semantically similar CVEs.
The project ships as three deployable applications — a Fastify API, a Next.js frontend, and an MCP server for AI agent tool integration — orchestrated as a Turborepo monorepo.
- Semantic search — Describe a vulnerability naturally (e.g. "Windows privilege escalation using symbolic links") and get relevant CVEs ranked by cosine similarity against 1024-dim Ollama embeddings
- Full-text search — SQLite FTS5-based search on description, vendor, product, and CWE fields
- Regex search — SQLite REGEXP pattern matching for power users
- Hybrid search — Weighted combination of semantic (70%) + full-text (30%) for best results by default
- MCP server — Exposes all search capabilities as MCP tools for AI agent (Claude, Cursor, etc.) integration
- Similar CVEs — For any CVE, find semantically similar ones via in-memory cosine similarity against the full vector store
- Daily sync — Incremental sync from cvelistV5 with SHA256 change tracking; only re-embeds what changed
- Synthetic document embedding — Combines CVE description, vendor, product, CWE, severity, and affected versions into a single document before embedding, dramatically improving match quality
User query (plain English)
│
▼
┌──────────────────────────────────────────────────────┐
│ search query │
│ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ │
│ │ Semantic │ │ Full-text│ │ Regex │ │ Hybrid │ │
│ │ (Ollama │ │ (FTS5) │ │ (LIKE) │ │ 0.7/+ │ │
│ │ cosine) │ │ │ │ │ │ 0.3 │ │
│ └────┬─────┘ └────┬─────┘ └───┬────┘ └────┬───┘ │
│ │ │ │ │ │
│ └─────────────┴────────────┴────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ SQLite DB │ │
│ │ (better-sqlite3)│ │
│ │ CVEs │ Vectors │ │
│ │ FTS │ Hashes │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────┘
│
┌─────────┴─────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Fastify API │ │ MCP Server │
│ (port 4000) │ │ (stdio) │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Next.js 15 │ │ AI Agents │
│ (port 3000) │ │ (Claude etc)│
└──────────────┘ └──────────────┘
semantic-cve/
├── apps/
│ ├── web/ # Next.js 15 frontend (Tailwind v4)
│ ├── api/ # Fastify TypeScript API server
│ └── mcp/ # MCP server (stdio JSON-RPC 2.0)
├── packages/
│ ├── shared/ # Types, constants, enums
│ ├── db/ # SQLite schema, queries, better-sqlite3
│ ├── cve-parser/ # cvelistV5 JSON parser + synthetic doc builder
│ ├── embeddings/ # Ollama embedding client (single + batch)
│ ├── search/ # Four search engine implementations
│ └── vector-store/ # Float32Array ops, cosine similarity
├── scripts/
│ ├── sync-cves.ts # Clone/pull cvelistV5, SHA256 tracking
│ ├── embed-cves.ts # Batch-embed CVEs into vector store
│ ├── daily-update.ts # Cron entry point: sync → embed
│ ├── inspect.ts # DB inspection utility
│ ├── drop-embeddings.ts# Reset embedding state
│ ├── fix-cveids.ts # Repair CVE ID mismatches
│ └── speedtest.ts # Search performance benchmarking
├── turbo.json # Turborepo build orchestration
└── package.json # Bun workspace root
Next.js 15 static export with Tailwind CSS v4, consumed by Cloudflare Pages via NEXT_PUBLIC_API_URL.
| Route | Purpose |
|---|---|
/ |
Hero search bar with example query chips |
/search?q=&mode= |
Search results with pagination and mode selector |
/cve?id= |
CVE detail with metadata, similar CVEs, references, raw JSON (query-param for static export) |
bun run --filter @semantic-cve/web dev # development on :3000
bun run --filter @semantic-cve/web build # static export to apps/web/out/
bun run --filter @semantic-cve/web deploy # wrangler deploy to Cloudflare PagesFastify 5 server with CORS enabled, serving as the backend for both the web frontend and direct HTTP clients. Routes are registered under the /api prefix.
Routes:
| Method | Route | Description |
|---|---|---|
GET |
/api/health |
Health check returning { status, timestamp } |
GET |
/api/search?query=&mode=&limit=&offset=&cvssMin=&cvssMax=&cwe=&vendor=&product= |
Search CVEs across all four modes |
GET |
/api/cves/:id |
Single CVE detail by CVE ID |
GET |
/api/cves/:id/similar?limit= |
Semantically similar CVEs via vector proximity |
GET |
/api/cves?days=&limit=&offset= |
Latest CVEs filtered by recency |
Search query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
string | — | Required. Natural language or keyword query |
mode |
string | hybrid |
One of: semantic, fulltext, regex, hybrid |
limit |
number | 20 |
Max results to return |
offset |
number | 0 |
Result offset for pagination |
cvssMin |
number | — | Minimum CVSS score filter |
cvssMax |
number | — | Maximum CVSS score filter |
cwe |
string | — | CWE ID filter (e.g. CWE-79) |
vendor |
string | — | Vendor name filter |
product |
string | — | Product name filter |
Dependencies: fastify@5, @fastify/cors, @semantic-cve/search, @semantic-cve/db, @semantic-cve/shared
bun run --filter @semantic-cve/api dev # development on :4000 (hot reload via tsx watch)
bun run --filter @semantic-cve/api build # tsc compilationThe MCP (Model Context Protocol) server exposes CVE search as AI agent tools over stdio JSON-RPC 2.0. It is framework-free with zero external MCP SDK dependencies — the protocol is implemented in ~160 lines of TypeScript.
How it works:
The server reads JSON-RPC requests from stdin, processes them, and writes responses to stdout. It implements three protocol methods:
initialize— Handshake returning protocol version, server capabilities, and server metadatatools/list— Returns the full tool schema definitions (names, descriptions, input JSON Schemas)tools/call— Dispatches tool invocations to the underlying search/DB functions
Available Tools:
| Tool | Input | Returns | Implementation |
|---|---|---|---|
search_cves |
{ query: string, mode?: 'semantic'|'fulltext'|'regex'|'hybrid', limit?: number } |
Ranked CVE results with similarity scores and match types | Delegates to @semantic-cve/search |
get_cve |
{ id: string } |
Full CVE detail including description, CVSS, vendor, product, references | Calls getCveById() from @semantic-cve/db |
related_cves |
{ id: string, limit?: number } |
Semantically similar CVEs via vector proximity | Loads source CVE embedding, runs in-memory topK against all other vectors |
search_by_product |
{ vendor?: string, product?: string, limit?: number } |
CVEs matching vendor/product | SQL LIKE query via searchByVendorProduct() |
search_by_cwe |
{ cwe: string, limit?: number } |
CVEs matching a CWE ID | Direct SQL lookup via searchByCwe() |
latest_cves |
{ days?: number } |
Recent CVEs from last N days | Date-filtered query via getLatestCves() |
Integration with AI agents:
The MCP server can be configured as a tool provider in any MCP-compatible client:
{
"mcpServers": {
"semantic-cve": {
"command": "bun",
"args": ["run", "--filter", "@semantic-cve/mcp", "dev"],
"cwd": "/path/to/semantic-cve"
}
}
}Dependencies: @semantic-cve/search, @semantic-cve/db, @semantic-cve/shared (zero additional runtime dependencies)
bun run --filter @semantic-cve/mcp dev # development (restarts on file changes)
bun run --filter @semantic-cve/mcp build # tsc compilationShared TypeScript types, constants, and enums used across all apps and packages.
Exports:
CveRecord— Full CVE database row typeSearchResult— Search result with score, match type, and CVE dataSearchQuery— Standardized search query parametersEMBEDDING_DIMENSION—1024(matchesqwen3-embedding:0.6b)EMBEDDING_MODEL—'qwen3-embedding:0.6b'HYBRID_WEIGHTS—{ semantic: 0.7, fts: 0.3 }CVE_CACHE_DIR/CVE_LIST_REPO— Path and repo constants for sync scripts
SQLite database layer using better-sqlite3. Provides schema initialization (via automatic migration on first connection), query functions, and connection management.
Schema:
| Table | Purpose |
|---|---|
cves |
Core CVE metadata: CVE ID, description, CWE, vendor, product, affected versions, CVSS score/severity, dates, raw JSON |
cve_fts |
FTS5 virtual table over cves for full-text search on description, vendor, product, and CWE |
embeddings |
BLOB vector storage (1024-dim Float32Array per CVE) |
references_ |
CVE reference URLs with source and tags |
hashes |
SHA256 hashes for incremental sync change tracking |
Key queries:
upsertCve()/getCveById()/getCvesByIds()— CRUD operationssearchFts()— FTS5 full-text search with rank orderingsearchRegex()— REGEXP pattern matching on description/vendor/productsearchSemantic()— Loads all vectors, runstopKcosine similaritygetSimilarCves()— Loads source CVE vector, compares against all othersupsertEmbedding()/getHash()/upsertHash()— Embedding and sync state management
Triggers: Automatic FTS sync on CVE insert/update/delete via AFTER INSERT, AFTER DELETE, and AFTER UPDATE triggers.
Parses CVE JSON files from the cvelistV5 repository format.
Exports:
parseCveFile(raw: string)— Parses a CVE JSON string into structured fields: CVE ID, description, CWE, vendor, product, affected versions, CVSS score/severity, dates, referencessyntheticDoc(parsed)— Generates a combined embedding document from all fields:"CVE-2025-1234 Vendor: Apache Product: HTTP Server CWE: CWE-287 Severity: Critical Description: ... Affected Versions: 2.4.0 - 2.4.59"
The synthetic document approach dramatically improves semantic matching over description-only embeddings by encoding vendor, product, CWE, and severity context directly into the vector space.
Ollama embedding client. Connects to a local Ollama instance to generate 1024-dim embeddings.
Exports:
embed(text)— Single text embedding viaPOST /api/embeddingsembedBatch(texts)— Batch embedding viaPOST /api/embed(supports multiple inputs per request)healthCheck()— Checks if Ollama is reachable viaGET /api/tags
Configured via OLLAMA_URL environment variable (default: http://localhost:11434). Uses qwen3-embedding:0.6b (1024-dim) by default.
Four search engine implementations that combine the DB queries and embedding pipeline:
| Engine | Implementation | Latency Profile |
|---|---|---|
| Semantic | Embed query via Ollama → cosine similarity against all 354k vectors | ~500ms first query (model warmup), ~20ms subsequent (embedding cache) |
| Full-text | Direct FTS5 query | <50ms |
| Regex | SQL REGEXP on description/vendor/product |
<200ms |
| Hybrid | Runs semantic + FTS in parallel, scores combined as semantic * 0.7 + fts * 0.3 |
~ Same as semantic |
Includes an in-memory embedding cache (Map<string, Float32Array>) that deduplicates identical queries.
Low-level vector math operations over Float32Array:
cosineSim(a, b)— Cosine similarity between two vectorsarrToVec(arr)/bufToVec(buf)/vecToBuf(vec)— Conversion between number arrays, Float32Array, and Node.js BuffertopK(vecs, query, k)— Find top-K by cosine similarity from an array of{ id, vec }objectstopKFromBuf(items, query, k, dim)— Same but reads vectors directly from Buffer (avoids per-row Float32Array allocation)
Clones or pulls the cvelistV5 repository, walks all JSON files under cves/, computes SHA256 hashes, and upserts only changed CVEs into the database.
bun run sync:cvesKey behaviors:
- First run:
git clone --depth 1(downloads the full CVE dataset) - Subsequent runs:
git pull, then compares SHA256 hashes against stored values - Skips unchanged files entirely (fast incremental sync)
- Processes 354k+ CVEs with progress logging every 1000
Generates and stores embeddings for all CVEs that don't already have them.
bun run embed:cvesKey behaviors:
- Warms up the Ollama model on start
- Processes in batches of 500 (configurable via
BATCHconstant) - Uses
embedBatch()for efficient bulk embedding - Falls back to individual embedding per CVE on batch failure
- Designed to be interrupt-safe: resumes from where it left off
- Has a runtime limit (~9 minutes) for scheduled runs
Cron entry point that chains sync → embed:
bun run update:dailySuitable for cron or CI/CD scheduling (e.g., GitHub Actions):
name: Daily CVE Sync
on:
schedule: [{ cron: '0 6 * * *' }]
workflow_dispatch:
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: bun install
- run: bun run update:daily| Script | Purpose |
|---|---|
scripts/inspect.ts |
Quick DB inspection — prints CVE count, embedding count, hash count |
scripts/drop-embeddings.ts |
Resets the embeddings table (for full re-embedding) |
scripts/fix-cveids.ts |
Repairs CVE ID mismatches between cves and embeddings tables |
scripts/speedtest.ts |
Benchmarks search performance across all four modes |
git clone https://github.com/forloopcodes/semantic-cve
cd semantic-cve
bun install
cp .env.example .envollama pull qwen3-embedding:0.6b# Sync all CVE records from cvelistV5 into SQLite
bun run sync:cves
# Generate 1024-dim embeddings for every CVE
bun run embed:cvesThe sync processes 354k+ CVEs. The embedding step batch-processes them in groups of 500. Both scripts are resumable — interrupting and re-running picks up where you left off.
bun run --filter @semantic-cve/api devThe API starts on http://localhost:4000. Health check: curl http://localhost:4000/api/health.
bun run --filter @semantic-cve/web devOpens on http://localhost:3000.
bun run --filter @semantic-cve/mcp devThe MCP server listens on stdin/stdout. Configure it in any MCP-compatible client (Claude, Cursor, etc.):
{
"mcpServers": {
"semantic-cve": {
"command": "bun",
"args": ["run", "--filter", "@semantic-cve/mcp", "dev"],
"cwd": "/path/to/semantic-cve"
}
}
}Search mode determines how queries are matched against the CVE database:
The query is embedded via Ollama's qwen3-embedding:0.6b model (1024 dimensions), then compared against all stored CVE vectors using cosine similarity. Results are ranked by similarity score (0 to 1). Best for natural language descriptions.
Example: "Windows privilege escalation using symbolic links" — finds CVEs related to symlink-based privilege escalation even if the exact words differ.
Uses SQLite FTS5 to match query terms against description, vendor, product, and CWE fields. Results are ranked by FTS5 relevance score. Best for exact keyword or product name lookups.
Example: "CWE-79 XSS" — returns all CVEs with XSS-related descriptions tagged with CWE-79.
Applies SQL REGEXP pattern matching against description, vendor, and product fields. Results are unranked (score = 1). Best for power users with exact pattern requirements.
Example: "apache.*2\.4\.[0-9]+" — finds all CVEs mentioning Apache 2.4.x versions.
The default mode. Runs semantic and full-text search in parallel, then combines scores using a weighted formula:
final_score = semantic_score × 0.7 + fts_score × 0.3
This provides the best balance of semantic understanding and keyword precision. Recommended for general use.
| Component | Recommendation | Notes |
|---|---|---|
| Frontend | Cloudflare Pages | Static export via Next.js output: 'export'. Set NEXT_PUBLIC_API_URL to your tunnel or API endpoint |
| API | Local / VPS | Requires Ollama + SQLite. 2GB RAM minimum for embedding model |
| MCP Server | Co-located with API | Needs access to the same SQLite database and Ollama instance |
The frontend is deployed as a static site to Cloudflare Pages. Since the API runs locally (Ollama + SQLite), a Cloudflare Tunnel exposes it:
# Build and deploy frontend
cd apps/web
bun run build # static export to out/
bunx wrangler pages deploy out --branch production
# Expose local API via tunnel
cloudflared tunnel --url http://localhost:4000Set NEXT_PUBLIC_API_URL to the tunnel URL before building. Rebuild and redeploy when the tunnel URL changes.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-embedding:0.6b
# Clone and run
git clone https://github.com/forloopcodes/semantic-cve
cd semantic-cve
bun install
cp .env.example .env
bun run sync:cves
bun run embed:cves
bun run --filter @semantic-cve/api dev| Variable | Default | Description |
| --- | --- | --- | --- |
| OLLAMA_URL | http://localhost:11434 | Ollama server URL |
| API_PORT | 4000 | Fastify listen port |
| API_HOST | 0.0.0.0 | Fastify listen host |
| NEXT_PUBLIC_API_URL | http://localhost:4000 | API base URL (set to tunnel URL for production) |
| Component | Technology |
|---|---|
| Runtime | Node.js via Bun |
| Monorepo | Turborepo |
| API Framework | Fastify 5 |
| Frontend | Next.js 15 (static export) |
| Hosting | Cloudflare Pages |
| Styling | Tailwind CSS v4 |
| Database | SQLite via better-sqlite3 |
| Embedding Model | Ollama + qwen3-embedding:0.6b (1024-dim) |
| Vector Operations | Float32Array, cosine similarity (in-process) |
| Full-Text Search | SQLite FTS5 |
| MCP Protocol | stdio JSON-RPC 2.0 (zero external deps) |
| Language | TypeScript (strict mode) |
The CVE dataset is sourced from the official cvelistV5 repository maintained by the CVE Program. The sync script clones this repository (depth 1, ~1.5GB) and parses CVE JSON files following the CVE 5.0 schema. Incremental updates use SHA256 hash comparison to detect changes, so daily syncs are fast after the initial clone.
The project currently indexes 354,000+ CVE records with 1024-dimensional embeddings stored as BLOBs in SQLite (~1.4GB total for the vector store).