feat: PGVector-backed log indexing and semantic search (#24)#80
feat: PGVector-backed log indexing and semantic search (#24)#80khat190 wants to merge 2 commits into
Conversation
|
@khat190 is attempting to deploy a commit to the Deekshith Gowda HS's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThis PR implements semantic search over deployment logs by combining pgvector storage, Cohere embeddings, Groq summarization, and Inngest background jobs. Logs are automatically indexed when deployments go live and can be searched via authenticated API endpoints. A scheduled cron job re-indexes active sandboxes weekly. ChangesSemantic Log Search
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
lib/vector-indexer.ts (1)
65-88: ⚡ Quick winConsider adding a timeout to external API calls.
The
fetchcalls to Cohere lack a timeout. If the Cohere API is slow or unresponsive, the Inngest function could hang until the platform timeout kicks in, wasting resources. AddingAbortSignal.timeout()would allow graceful failure and retry.♻️ Proposed fix
async function embedTexts(texts: string[]): Promise<number[][]> { const response = await fetch("https://api.cohere.com/v1/embed", { method: "POST", headers: { Authorization: `Bearer ${getCohereApiKey()}`, "Content-Type": "application/json", "X-Client-Name": "secdev", }, body: JSON.stringify({ model: "embed-english-v3.0", texts, input_type: "search_document", truncate: "END", }), + signal: AbortSignal.timeout(30_000), // 30s timeout });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/vector-indexer.ts` around lines 65 - 88, The embedTexts function currently calls fetch without a timeout; update it to create an AbortSignal via AbortSignal.timeout(...) (or an AbortController with setTimeout) and pass the signal option into fetch to enforce a timeout (choose an appropriate duration, e.g., 10s). Ensure you clear any timers if using AbortController, handle aborts and propagate a clear error message (e.g., "Cohere embed request timed out") alongside other non-OK responses, and keep the rest of the response handling in embedTexts unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/api/logs/search/route.ts`:
- Around line 135-160: The handleTrigger function currently allows any
authenticated user to trigger indexing for any sandboxId; before calling
inngest.send, query your sandbox/store to verify ownership (e.g., fetch sandbox
by sandboxId and confirm its ownerId or userId matches the authenticated userId)
and return a 403/400 JSON response if the sandbox is missing or not owned by the
requester; ensure this ownership check happens in handleTrigger immediately
after validating sandboxId and before invoking inngest.send so only owners can
enqueue indexing events.
In `@lib/db.ts`:
- Around line 266-267: Update the outdated comment in the Log vector index table
block to reflect that the schema uses Cohere 1024-dim embeddings (vector(1024))
instead of "1536-dim OpenAI embeddings"; locate the comment text near the "Log
vector index table" header in lib/db.ts and change the wording to mention Cohere
1024-dim embeddings so it matches the schema and avoids confusion.
In `@lib/vector-indexer.ts`:
- Around line 156-175: indexSandbox currently calls getDb() and runs queries
without ensuring the schema exists; add an awaited call to ensureTables() at the
start of indexSandbox (before calling getDb()) so the DB schema is created
before any queries run, and import/require ensureTables (the function referenced
in lib/deployer.ts) if it isn’t already imported; ensureTables is awaited (await
ensureTables()) and then proceed to const sql = getDb().
---
Nitpick comments:
In `@lib/vector-indexer.ts`:
- Around line 65-88: The embedTexts function currently calls fetch without a
timeout; update it to create an AbortSignal via AbortSignal.timeout(...) (or an
AbortController with setTimeout) and pass the signal option into fetch to
enforce a timeout (choose an appropriate duration, e.g., 10s). Ensure you clear
any timers if using AbortController, handle aborts and propagate a clear error
message (e.g., "Cohere embed request timed out") alongside other non-OK
responses, and keep the rest of the response handling in embedTexts unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 65369052-7980-4098-b703-3bac9f7189d6
📒 Files selected for processing (5)
app/api/inngest/route.tsapp/api/logs/search/route.tslib/db.tslib/deployer.tslib/vector-indexer.ts
What changed
Enabled pgvector extension and added a new
deployment_log_vectorstable inlib/db.tsAdded database helper functions:
insertLogVectorsearchLogVectorspruneExpiredLogVectorsgetLastIndexedLogIdCreated
lib/vector-indexer.tswith two Inngest functions:indexDeploymentLogs(on-demand indexing)cronReindexLogs(daily re-indexing job)Created
app/api/logs/search/route.tswith:Updated
lib/deployer.tsto trigger log indexing when a deployment becomes liveUpdated
app/api/inngest/route.tsto register the new Inngest functionsWhy
Deployment logs were already being stored, but searching them required exact keyword matches. This feature introduces semantic search, allowing developers to find relevant logs using natural-language queries such as:
Logs are chunked, embedded using Cohere embeddings, and stored as vectors in PostgreSQL using pgvector. A retention policy automatically removes vectors older than 30 days to keep storage usage under control.
How to test
COHERE_API_KEYto.envRequest body:
{ "sandboxId": "<sandbox-id>" }AI Assistance
I used Claude AI as a learning and implementation aid while working with pgvector, embeddings, and Inngest workflows. All generated code was reviewed, tested, and modified as needed before submission.
The semantic search flow was tested locally using real Cohere embeddings and returned relevant results for sample deployment log queries.
Known trade-offs
##Closes #24
Summary by CodeRabbit