Skip to content

batincelik/prompt-version-testing

Repository files navigation

PromptOps

PromptOps is a production-minded prompt versioning, evaluation, and testing platform for teams shipping AI features. It combines Git-like prompt lifecycle management, a runnable playground, batch evaluations on datasets, pluggable scoring, queue-backed execution, and an operational UI for comparing versions and promoting winners safely.

Why it matters

Shipping LLM features without versioning and evaluation creates silent regressions: prompts drift, models change, and nobody can prove which variant actually worked. PromptOps makes prompt changes reviewable, measurable, and reversible with audit trails, structured runs, and explicit promotion rules.

Key features

  • Email/password auth with JWT access tokens and rotating refresh sessions
  • Prompt registry with tags, categories, and ownership
  • Versioned prompts with variables, model settings, changelog, and lifecycle states
  • Datasets and test cases with optional expectations (keywords, JSON schema, reference text)
  • Playground runs with compiled preview, async execution via BullMQ, persisted runs
  • Batch evaluations across many cases and prompt versions with progress tracking
  • Pluggable evaluators (keyword, regex, JSON schema, similarity, heuristic, LLM-judge adapter)
  • Manual review scores stored per case result
  • Experiment views with charts and per-case inspection
  • Single active version per prompt with promotion history and audit logs
  • Health, readiness, and liveness endpoints; structured logging; centralized API errors

Architecture (ASCII)

                    +------------------+
                    |   Next.js (web)  |
                    +--------+---------+
                             |  REST + JWT
                             v
+----------------+ +-------+--------+ +----------------+
|     Redis      <----+  Fastify API   +---->| PostgreSQL    |
| (BullMQ broker)| +-------+--------+     |  (Prisma)      |
+-------+--------+ | +----------------+
        ^                     |
        |            +--------v---------+
        +------------+  Worker service |
                     |  (BullMQ cons.)   |
                     +--------+---------+
                              |
 +--------v---------+
                     | OpenAI-compatible|
                     | provider (HTTP)  |
                     +------------------+

Monorepo layout

apps/api          Fastify REST API (/api/v1)
apps/worker       BullMQ workers (prompt-run, evaluation-run, evaluation-case)
apps/web          Next.js App Router UI (shadcn/ui, Recharts)
packages/db       Prisma schema + client export
packages/shared   Template compilation, diffs, promotion helpers, scoring utils
packages/evals    Evaluator implementations + runner
packages/ai       OpenAI-compatible HTTP provider abstraction
packages/config   Zod-backed env loading + pagination helpers
packages/logger   Pino logger factory
packages/types    Shared API typing helpers

Prerequisites

  • Node.js 20+
  • pnpm 9+
  • Docker (optional, for compose stack)

Setup

cp .env.example .env
pnpm install
docker compose up -d postgres redis
pnpm db:generate
pnpm db:migrate
pnpm dev
  • API: http://localhost:4000
  • Web: http://localhost:3000

Scripts

Script Purpose
pnpm dev API + worker + web concurrently
pnpm build Production build for all packages
pnpm lint / pnpm typecheck Quality gates
pnpm test Unit tests (Vitest)
pnpm test:e2e Playwright (install browsers first)
pnpm db:migrate Prisma migrate (dev)
pnpm db:migrate:deploy Prisma migrate (CI/prod)
pnpm db:generate Regenerate Prisma client

Environment variables

See .env.example. Critical values:

  • DATABASE_URL, REDIS_URL
  • JWT_ACCESS_SECRET, JWT_REFRESH_SECRET (each ≥ 32 characters)
  • OPENAI_API_KEY / OPENAI_BASE_URL for real model calls (worker + playground)
  • NEXT_PUBLIC_API_URL for the browser-facing API base URL
  • CORS_ORIGIN (comma-separated allowed origins)

Database migrations

Prisma migrations live in packages/db/prisma/migrations. For a fresh database:

pnpm db:migrate:deploy

Testing strategy

  • Unit tests: template compilation, weighted scoring, promotion rules, diff helpers, password hashing, JWT signing, evaluator primitives (packages/shared, packages/evals, apps/api).
  • Integration tests: run against a real Postgres + Redis instance (see CI workflow). Extend apps/api with fastify.inject suites wired to DATABASE_URL when you add full HTTP integration coverage.
  • E2E: Playwright smoke under apps/web/e2e (pnpm test:e2e after pnpm exec playwright install).

API surface

Base path: /api/v1

  • Auth: register, login, refresh, logout, me
  • Prompts + versions: CRUD, clone, promote, diff
  • Datasets + cases: CRUD
  • Playground run, runs list/detail
  • Evaluations: create, detail, results pagination, manual review
  • System: /health, /live, /ready
  • Dashboard aggregate: /stats

All list endpoints accept pagination (page, pageSize, sort, order, q where applicable).

Evaluation engine

Evaluators implement a small interface: they accept structured context (output, case input, expectations) and return weighted score components. runEvaluators aggregates with configurable weights stored on each EvaluationRun in evaluatorConfig. Adding a new evaluator means a new module under packages/evals plus registration in the worker case processor.

Sequence flows

Create prompt version

sequenceDiagram
  participant U as User
  participant API as Fastify API
  participant DB as Postgres
  U->>API: POST /prompts/:id/versions
  API->>DB: insert PromptVersion + variables
  API->>DB: audit VERSION_CREATE
  API-->>U: 201 + version payload
Loading

Run evaluation job

sequenceDiagram
  participant U as User
  participant API as API
  participant Q as Redis/BullMQ
  participant W as Worker
  participant P as Provider
  U->>API: POST /evaluations
  API->>Q: enqueue evaluation-run
  API-->>U: 201 + run id
  W->>Q: fan out evaluation-case jobs
  W->>P: completions per case/version
  W->>W: run evaluators + persist EvaluationCaseResult
Loading

Score outputs

sequenceDiagram
  participant W as Worker
  participant E as Evaluators
  participant DB as Postgres
  W->>E: runEvaluators(...)
  E-->>W: per-evaluator scores + weighted total
  W->>DB: upsert EvaluationCaseResult (autoScores, total)
Loading

Promote version

sequenceDiagram
  participant U as User
  participant API as API
  participant DB as Postgres
  U->>API: POST /prompt-versions/:id/promote
  API->>DB: transaction demote other ACTIVE, set target ACTIVE
  API->>DB: insert PromotionHistory + audit VERSION_PROMOTE
  API-->>U: updated version
Loading

Docker

docker compose up --build starts Postgres, Redis, API, worker, and web. Run migrations inside the API container on first boot:

docker compose run --rm api sh -lc "pnpm db:migrate:deploy"

Security notes

  • Passwords hashed with bcrypt; refresh tokens stored hashed (SHA-256) server-side
  • JWT secrets must be strong and unique per environment
  • Rate limiting + request size limits on the API
  • Structured logging redacts common secret-bearing fields
  • Provider keys belong in environment or a future secrets vault—never in prompts or datasets

Scaling considerations

  • Horizontally scale apps/worker behind the same Redis URL; BullMQ coordinates concurrency
  • Read replicas can back analytics-style queries (dashboard aggregates) as traffic grows
  • Partition evaluation case jobs by tenant or prompt for noisy-neighbor isolation
  • Move provider calls to dedicated inference gateways for token accounting and policy enforcement

Tradeoffs

  • Postgres JSON fields for flexible evaluator payloads instead of over-normalized score tables—fast to ship, easy to query for v1, can be split later
  • OpenAI-compatible HTTP keeps the dependency surface small; native SDKs can wrap the same interface later
  • LocalStorage tokens in the web app simplify the portfolio demo; production would move access tokens to HTTP-only cookies

Future improvements

  • Multi-tenant workspaces and RBAC
  • Encrypted provider credential storage per user/team
  • Real-time evaluation progress via WebSocket or SSE
  • Golden-run baselines and automatic regression alerts
  • Durable idempotency keys for provider calls

Screenshots

Add screenshots under docs/screenshots/ (dashboard, evaluation detail, playground) when you capture them.

CI

GitHub Actions workflow (.github/workflows/ci.yml) installs dependencies, applies migrations to a service container database, runs lint, typecheck, tests, and production builds.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors