Skip to content

feat: PGVector-backed log indexing and semantic search (#24)#80

Open
khat190 wants to merge 2 commits into
deekshithgowda85:prodfrom
khat190:feat/pgvector-log-search
Open

feat: PGVector-backed log indexing and semantic search (#24)#80
khat190 wants to merge 2 commits into
deekshithgowda85:prodfrom
khat190:feat/pgvector-log-search

Conversation

@khat190
Copy link
Copy Markdown

@khat190 khat190 commented Jun 1, 2026

What changed

  • Enabled pgvector extension and added a new deployment_log_vectors table in lib/db.ts

  • Added database helper functions:

    • insertLogVector
    • searchLogVectors
    • pruneExpiredLogVectors
    • getLastIndexedLogId
  • Created lib/vector-indexer.ts with two Inngest functions:

    • indexDeploymentLogs (on-demand indexing)
    • cronReindexLogs (daily re-indexing job)
  • Created app/api/logs/search/route.ts with:

    • GET semantic search endpoint
    • POST semantic search endpoint
    • Manual indexing trigger endpoint
  • Updated lib/deployer.ts to trigger log indexing when a deployment becomes live

  • Updated app/api/inngest/route.ts to register the new Inngest functions

Why

Deployment logs were already being stored, but searching them required exact keyword matches. This feature introduces semantic search, allowing developers to find relevant logs using natural-language queries such as:

  • "memory error during build"
  • "deployment failed after install step"
  • "container startup issues"

Logs are chunked, embedded using Cohere embeddings, and stored as vectors in PostgreSQL using pgvector. A retention policy automatically removes vectors older than 30 days to keep storage usage under control.

How to test

  1. Add a valid COHERE_API_KEY to .env
  2. Trigger indexing:
POST /api/logs/search?action=trigger

Request body:

{
  "sandboxId": "<sandbox-id>"
}
  1. Perform a semantic search:
GET /api/logs/search?q=build+failed
  1. Verify that semantically related deployment logs are returned even when exact keywords do not match.

Tested locally: triggered indexing for a test sandbox, ran semantic
search for "build failed" and received correct matching log chunks
in the response.

AI Assistance

I used Claude AI as a learning and implementation aid while working with pgvector, embeddings, and Inngest workflows. All generated code was reviewed, tested, and modified as needed before submission.
The semantic search flow was tested locally using real Cohere embeddings and returned relevant results for sample deployment log queries.

Known trade-offs

  • Uses Cohere embeddings instead of OpenAI embeddings because Cohere provides a free tier without requiring a credit card.
  • Embeddings are 1024-dimensional instead of 1536-dimensional, reducing storage requirements at the cost of some representational capacity.
  • Daily re-indexing introduces a small amount of background processing overhead.

##Closes #24

Summary by CodeRabbit

  • New Features
    • Added semantic log search to find similar deployment logs by content similarity
    • Logs automatically indexed for search when deployments transition to live status
    • Manual log indexing trigger available with configurable retention settings

@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

@khat190 is attempting to deploy a commit to the Deekshith Gowda HS's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements semantic search over deployment logs by combining pgvector storage, Cohere embeddings, Groq summarization, and Inngest background jobs. Logs are automatically indexed when deployments go live and can be searched via authenticated API endpoints. A scheduled cron job re-indexes active sandboxes weekly.

Changes

Semantic Log Search

Layer / File(s) Summary
Vector database schema and operations
lib/db.ts
Enables pgvector extension and creates deployment_log_vectors table with vector embeddings, TTL-based expiry, and HNSW indexing. Exports LogVector interface and CRUD functions: insertLogVector, searchLogVectors (cosine similarity with optional sandbox filter), pruneExpiredLogVectors, and getLastIndexedLogId for incremental indexing.
Indexing pipeline with embeddings
lib/vector-indexer.ts
Implements Cohere API integration for text embeddings (search_document mode for chunks, search_query mode for queries), optional Groq-based chunk summarization, and core indexSandbox routine that fetches new logs, chunks them, embeds in batch, and persists to database. Exports two Inngest functions: indexDeploymentLogs (on-demand) and cronReindexLogs (weekly re-index of active sandboxes).
Deployment-triggered indexing
lib/deployer.ts
Extends runDeploymentPipeline to accept userId and emits fire-and-forget Inngest event log/index.requested with sandbox, user, and 30-day TTL when deployment transitions to live status.
Search API and Inngest wiring
app/api/logs/search/route.ts, app/api/inngest/route.ts
Adds authenticated GET/POST /api/logs/search supporting query embedding, per-user vector search with optional sandbox scoping, and ?action=trigger to manually request indexing. Registers indexDeploymentLogs and cronReindexLogs with Inngest route handler.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

  • deekshithgowda85/SecDev#24: The PR directly implements the pgvector-backed log indexing and search feature described in this issue, including vector embeddings, Inngest orchestration, and the search API.

Possibly related PRs

  • deekshithgowda85/SecDev#2: Both PRs modify app/api/inngest/route.ts to register additional Inngest functions with the Next.js route handler wiring.

Suggested labels

gssoc:approved, quality:clean

Poem

🐰 Log chunks meet embeddings bright,
Cohere dreams in vectors' light,
Inngest orchestrates the dance,
Search semantic at a glance—
Logs transformed to insight's flight! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the primary feature: PGVector-backed log indexing and semantic search capabilities being added to the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
lib/vector-indexer.ts (1)

65-88: ⚡ Quick win

Consider adding a timeout to external API calls.

The fetch calls to Cohere lack a timeout. If the Cohere API is slow or unresponsive, the Inngest function could hang until the platform timeout kicks in, wasting resources. Adding AbortSignal.timeout() would allow graceful failure and retry.

♻️ Proposed fix
 async function embedTexts(texts: string[]): Promise<number[][]> {
   const response = await fetch("https://api.cohere.com/v1/embed", {
     method: "POST",
     headers: {
       Authorization: `Bearer ${getCohereApiKey()}`,
       "Content-Type": "application/json",
       "X-Client-Name": "secdev",
     },
     body: JSON.stringify({
       model: "embed-english-v3.0",
       texts,
       input_type: "search_document",
       truncate: "END",
     }),
+    signal: AbortSignal.timeout(30_000), // 30s timeout
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/vector-indexer.ts` around lines 65 - 88, The embedTexts function
currently calls fetch without a timeout; update it to create an AbortSignal via
AbortSignal.timeout(...) (or an AbortController with setTimeout) and pass the
signal option into fetch to enforce a timeout (choose an appropriate duration,
e.g., 10s). Ensure you clear any timers if using AbortController, handle aborts
and propagate a clear error message (e.g., "Cohere embed request timed out")
alongside other non-OK responses, and keep the rest of the response handling in
embedTexts unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/api/logs/search/route.ts`:
- Around line 135-160: The handleTrigger function currently allows any
authenticated user to trigger indexing for any sandboxId; before calling
inngest.send, query your sandbox/store to verify ownership (e.g., fetch sandbox
by sandboxId and confirm its ownerId or userId matches the authenticated userId)
and return a 403/400 JSON response if the sandbox is missing or not owned by the
requester; ensure this ownership check happens in handleTrigger immediately
after validating sandboxId and before invoking inngest.send so only owners can
enqueue indexing events.

In `@lib/db.ts`:
- Around line 266-267: Update the outdated comment in the Log vector index table
block to reflect that the schema uses Cohere 1024-dim embeddings (vector(1024))
instead of "1536-dim OpenAI embeddings"; locate the comment text near the "Log
vector index table" header in lib/db.ts and change the wording to mention Cohere
1024-dim embeddings so it matches the schema and avoids confusion.

In `@lib/vector-indexer.ts`:
- Around line 156-175: indexSandbox currently calls getDb() and runs queries
without ensuring the schema exists; add an awaited call to ensureTables() at the
start of indexSandbox (before calling getDb()) so the DB schema is created
before any queries run, and import/require ensureTables (the function referenced
in lib/deployer.ts) if it isn’t already imported; ensureTables is awaited (await
ensureTables()) and then proceed to const sql = getDb().

---

Nitpick comments:
In `@lib/vector-indexer.ts`:
- Around line 65-88: The embedTexts function currently calls fetch without a
timeout; update it to create an AbortSignal via AbortSignal.timeout(...) (or an
AbortController with setTimeout) and pass the signal option into fetch to
enforce a timeout (choose an appropriate duration, e.g., 10s). Ensure you clear
any timers if using AbortController, handle aborts and propagate a clear error
message (e.g., "Cohere embed request timed out") alongside other non-OK
responses, and keep the rest of the response handling in embedTexts unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 65369052-7980-4098-b703-3bac9f7189d6

📥 Commits

Reviewing files that changed from the base of the PR and between 2d65a13 and d0ad7f2.

📒 Files selected for processing (5)
  • app/api/inngest/route.ts
  • app/api/logs/search/route.ts
  • lib/db.ts
  • lib/deployer.ts
  • lib/vector-indexer.ts

Comment thread app/api/logs/search/route.ts
Comment thread lib/db.ts
Comment thread lib/vector-indexer.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: PGVector-backed log indexing and search

1 participant