PromptOps is a production-minded prompt versioning, evaluation, and testing platform for teams shipping AI features. It combines Git-like prompt lifecycle management, a runnable playground, batch evaluations on datasets, pluggable scoring, queue-backed execution, and an operational UI for comparing versions and promoting winners safely.
Shipping LLM features without versioning and evaluation creates silent regressions: prompts drift, models change, and nobody can prove which variant actually worked. PromptOps makes prompt changes reviewable, measurable, and reversible with audit trails, structured runs, and explicit promotion rules.
- Email/password auth with JWT access tokens and rotating refresh sessions
- Prompt registry with tags, categories, and ownership
- Versioned prompts with variables, model settings, changelog, and lifecycle states
- Datasets and test cases with optional expectations (keywords, JSON schema, reference text)
- Playground runs with compiled preview, async execution via BullMQ, persisted runs
- Batch evaluations across many cases and prompt versions with progress tracking
- Pluggable evaluators (keyword, regex, JSON schema, similarity, heuristic, LLM-judge adapter)
- Manual review scores stored per case result
- Experiment views with charts and per-case inspection
- Single active version per prompt with promotion history and audit logs
- Health, readiness, and liveness endpoints; structured logging; centralized API errors
+------------------+
| Next.js (web) |
+--------+---------+
| REST + JWT
v
+----------------+ +-------+--------+ +----------------+
| Redis <----+ Fastify API +---->| PostgreSQL |
| (BullMQ broker)| +-------+--------+ | (Prisma) |
+-------+--------+ | +----------------+
^ |
| +--------v---------+
+------------+ Worker service |
| (BullMQ cons.) |
+--------+---------+
|
+--------v---------+
| OpenAI-compatible|
| provider (HTTP) |
+------------------+
apps/api Fastify REST API (/api/v1)
apps/worker BullMQ workers (prompt-run, evaluation-run, evaluation-case)
apps/web Next.js App Router UI (shadcn/ui, Recharts)
packages/db Prisma schema + client export
packages/shared Template compilation, diffs, promotion helpers, scoring utils
packages/evals Evaluator implementations + runner
packages/ai OpenAI-compatible HTTP provider abstraction
packages/config Zod-backed env loading + pagination helpers
packages/logger Pino logger factory
packages/types Shared API typing helpers
- Node.js 20+
- pnpm 9+
- Docker (optional, for compose stack)
cp .env.example .env
pnpm install
docker compose up -d postgres redis
pnpm db:generate
pnpm db:migrate
pnpm dev- API:
http://localhost:4000 - Web:
http://localhost:3000
| Script | Purpose |
|---|---|
pnpm dev |
API + worker + web concurrently |
pnpm build |
Production build for all packages |
pnpm lint / pnpm typecheck |
Quality gates |
pnpm test |
Unit tests (Vitest) |
pnpm test:e2e |
Playwright (install browsers first) |
pnpm db:migrate |
Prisma migrate (dev) |
pnpm db:migrate:deploy |
Prisma migrate (CI/prod) |
pnpm db:generate |
Regenerate Prisma client |
See .env.example. Critical values:
DATABASE_URL,REDIS_URLJWT_ACCESS_SECRET,JWT_REFRESH_SECRET(each ≥ 32 characters)OPENAI_API_KEY/OPENAI_BASE_URLfor real model calls (worker + playground)NEXT_PUBLIC_API_URLfor the browser-facing API base URLCORS_ORIGIN(comma-separated allowed origins)
Prisma migrations live in packages/db/prisma/migrations. For a fresh database:
pnpm db:migrate:deploy- Unit tests: template compilation, weighted scoring, promotion rules, diff helpers, password hashing, JWT signing, evaluator primitives (
packages/shared,packages/evals,apps/api). - Integration tests: run against a real Postgres + Redis instance (see CI workflow). Extend
apps/apiwithfastify.injectsuites wired toDATABASE_URLwhen you add full HTTP integration coverage. - E2E: Playwright smoke under
apps/web/e2e(pnpm test:e2eafterpnpm exec playwright install).
Base path: /api/v1
- Auth: register, login, refresh, logout, me
- Prompts + versions: CRUD, clone, promote, diff
- Datasets + cases: CRUD
- Playground run, runs list/detail
- Evaluations: create, detail, results pagination, manual review
- System:
/health,/live,/ready - Dashboard aggregate:
/stats
All list endpoints accept pagination (page, pageSize, sort, order, q where applicable).
Evaluators implement a small interface: they accept structured context (output, case input, expectations) and return weighted score components. runEvaluators aggregates with configurable weights stored on each EvaluationRun in evaluatorConfig. Adding a new evaluator means a new module under packages/evals plus registration in the worker case processor.
sequenceDiagram
participant U as User
participant API as Fastify API
participant DB as Postgres
U->>API: POST /prompts/:id/versions
API->>DB: insert PromptVersion + variables
API->>DB: audit VERSION_CREATE
API-->>U: 201 + version payload
sequenceDiagram
participant U as User
participant API as API
participant Q as Redis/BullMQ
participant W as Worker
participant P as Provider
U->>API: POST /evaluations
API->>Q: enqueue evaluation-run
API-->>U: 201 + run id
W->>Q: fan out evaluation-case jobs
W->>P: completions per case/version
W->>W: run evaluators + persist EvaluationCaseResult
sequenceDiagram
participant W as Worker
participant E as Evaluators
participant DB as Postgres
W->>E: runEvaluators(...)
E-->>W: per-evaluator scores + weighted total
W->>DB: upsert EvaluationCaseResult (autoScores, total)
sequenceDiagram
participant U as User
participant API as API
participant DB as Postgres
U->>API: POST /prompt-versions/:id/promote
API->>DB: transaction demote other ACTIVE, set target ACTIVE
API->>DB: insert PromotionHistory + audit VERSION_PROMOTE
API-->>U: updated version
docker compose up --build starts Postgres, Redis, API, worker, and web. Run migrations inside the API container on first boot:
docker compose run --rm api sh -lc "pnpm db:migrate:deploy"- Passwords hashed with bcrypt; refresh tokens stored hashed (SHA-256) server-side
- JWT secrets must be strong and unique per environment
- Rate limiting + request size limits on the API
- Structured logging redacts common secret-bearing fields
- Provider keys belong in environment or a future secrets vault—never in prompts or datasets
- Horizontally scale
apps/workerbehind the same Redis URL; BullMQ coordinates concurrency - Read replicas can back analytics-style queries (dashboard aggregates) as traffic grows
- Partition evaluation case jobs by tenant or prompt for noisy-neighbor isolation
- Move provider calls to dedicated inference gateways for token accounting and policy enforcement
- Postgres JSON fields for flexible evaluator payloads instead of over-normalized score tables—fast to ship, easy to query for v1, can be split later
- OpenAI-compatible HTTP keeps the dependency surface small; native SDKs can wrap the same interface later
- LocalStorage tokens in the web app simplify the portfolio demo; production would move access tokens to HTTP-only cookies
- Multi-tenant workspaces and RBAC
- Encrypted provider credential storage per user/team
- Real-time evaluation progress via WebSocket or SSE
- Golden-run baselines and automatic regression alerts
- Durable idempotency keys for provider calls
Add screenshots under docs/screenshots/ (dashboard, evaluation detail, playground) when you capture them.
GitHub Actions workflow (.github/workflows/ci.yml) installs dependencies, applies migrations to a service container database, runs lint, typecheck, tests, and production builds.