Stop LLM regressions in CI in 2 minutes. 🚀
No infra. No lock-in. Remove anytime.
EvalGate = CI for AI behavior. Block regressions before they reach production.
LLMs don't fail like traditional software — they drift silently. A prompt tweak or model swap can degrade quality by 15% and you won't notice until users complain. EvalGate turns evaluations into CI gates so regressions never reach production.
Choose one of two paths based on your needs:
- Path A: Local Gate (no account, no API key)
- Path B: Platform Gate (dashboard, history, PR annotations)
npx @evalgate/sdk init
git pushThat's it. evalgate init detects your Node project, runs your tests to create a baseline, installs a GitHub Actions workflow, and prints what to commit. Open a PR and CI blocks regressions automatically.
Prove it: examples/init-demo shows the exact files generated, sample baseline artifact, and step summary — falsifiable in under 2 minutes.
pip install pauly4010-evalgate-sdkfrom evalgate_sdk import AIEvalClient, expect
from evalgate_sdk.types import CreateTraceParams
# Local assertions — no API key needed
result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed) # True
# Platform: trace and evaluate with API key
client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))Same CI gate, same quality checks. Python SDK has full parity with TypeScript: assertions, test suites, OpenAI/Anthropic tracing, LangChain/CrewAI/AutoGen integrations, and regression gates. Python CLI: pip install "pauly4010-evalgate-sdk[cli]" → evalgate init, evalgate run, evalgate gate, evalgate ci (docs).
- CI runs
npx evalgate gate - Gate runs your tests and compares against the baseline
- If tests regress → CI blocks the merge
- If tests pass → merge proceeds
- A regression report is uploaded as a CI artifact
npx @evalgate/sdk init # scaffold everything
npx evalgate gate # run gate locally
npx evalgate baseline update # update baseline after intentional changesWorks for any Node.js project with a test script.
npx evalgate init # creates evalgate.config.json
# paste evaluationId from dashboard
npx evalgate check --format github --onFail importAdds quality score tracking, baseline comparisons, trace coverage, and PR annotations.
When CI fails, don't guess — follow the guided flow:
npx evalgate doctor # preflight: is everything wired correctly?
npx evalgate check # run the gate (writes .evalgate/last-report.json)
npx evalgate explain # what failed, why, and how to fix itcheck automatically saves a report artifact. explain reads it with zero flags and prints:
- Top failing test cases with input/expected/actual
- What changed from baseline (score, pass rate, safety)
- Root cause classification (prompt drift, retrieval drift, safety regression, …)
- Suggested fixes with exact commands
Works offline. No API calls needed for explain.
| Command | Network | Notes |
|---|---|---|
gate |
Offline | Runs tests locally, compares to evals/baseline.json |
check |
Online | Requires API key + evaluationId; fetches quality, posts annotations |
import |
Online | Sends run data to platform (e.g. --onFail import) |
traces |
Online | Sends spans to platform |
explain |
Offline | Reads .evalgate/last-report.json or evals/regression-report.json |
Buyers can trust: gate and explain never phone home.
See it in action (click to expand)
GitHub Actions step summary — gate result at a glance:
evalgate explain terminal output — root causes + fix commands:
rm evalgate.config.json evals/ .github/workflows/evalgate-gate.ymlNo account cancellation. No data export. Your tests keep working.
Live demo: https://evalgate.com
Open source. Production-ready. 1.4k+ npm downloads/month · Used by developers building AI systems that ship to production. Terms of Service · Privacy Policy
Python package note: the official Python SDK is currently published as
pauly4010-evalgate-sdkunder a personal publisher account while PyPI organization publishing is being configured. This does not affect package functionality, update delivery, or CI usage.
| Capability | Status |
|---|---|
CI regression gate (evalgate ci, evalgate gate) |
Production |
TypeScript SDK (@evalgate/sdk) |
Production (v3.0.2) |
Python SDK (pauly4010-evalgate-sdk) |
Production |
| Multi-tenant auth & RBAC | Production |
| Evaluation engine (template library across 17 categories, 4 types) | Production |
| Audit logging & governance presets | Production |
| Observability (traces, spans, cost tracking) | Production |
| Three-layer scoring (reasoning / action / outcome) | Beta |
| Multi-judge aggregation (6 strategies, transparency audit) | Beta |
| Behavioral drift detection (6 signal types) | Beta |
| Dataset coverage model (gap detection, configurable seed phrases) | Beta |
| EvalCase generation from traces (deduplication, quality scoring) | Beta |
| Failure detection + classification (8 categories, confidence) | Beta |
| Metric DAG safety validator | Beta |
| Regression attribution engine | Beta |
| Self-hosted Docker | Beta |
| Advanced product analytics | Planned |
| Additional SDKs (Go, Rust) | Roadmap |
Add to your .github/workflows/evalgate-gate.yml:
name: EvalGate CI
on: [push, pull_request]
jobs:
evalgate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- run: npx @evalgate/sdk ci --format github --write-results --base main
- uses: actions/upload-artifact@v4
if: always()
with:
name: evalgate-results
path: .evalgate/That's it! Your CI now:
- ✅ Discovers evaluation specs automatically with
evalgate discover - ✅ Runs only impacted specs (smart caching)
- ✅ Compares results against base branch with
evalgate impact-analysis - ✅ Posts rich summary in PR with regressions
- ✅ Exits with proper codes (0=clean, 1=regressions, 2=config)
Docs: Features · CI Quickstart · Quickstart · Architecture · Regression Gate · Baseline Contract · CI Artifacts · AI Assistant Integration · Contributor Map · Releasing · All Docs
EvalGate is CI for AI behavior. Same gates, same quality checks — whether you use Node, Python, or the REST API.
Advanced LLM judge reliability with statistical rigor:
- TPR/TNR Computation — Calculate true positive rate and true negative rate from labeled dataset vs judge verdicts
- Rogan-Gladen Correction — Apply statistical correction when discriminative power > 0.05 for more accurate judge assessments
- Bootstrap Confidence Intervals — Generate confidence intervals with n >= 30 samples using deterministic seeding for reproducible results
- Guardrails — Automatic skip of correction when near-random detection, skip CI when sample size insufficient
- Configurable Thresholds — Set judge.tprMin, judge.tnrMin, judge.minLabeledSamples, judge.bootstrapSeed
npx evalgate judge-credibility --labeled-dataset .evalgate/golden/labeled.jsonl
# Outputs: TPR, TNR, corrected estimates, confidence intervals, warningsSystematic failure analysis and impact ranking:
- Failure-Modes Taxonomy — Define app-specific failure categories for consistent classification
- Interactive CLI Labeling —
evalgate labelcommand for pass/fail + failure mode classification - Impact Ranking — Frequency × weight = impact prioritization for systematic triage
- Canonical Labeled Dataset — Standard
.evalgate/golden/labeled.jsonlformat for golden labeling - Failure Mode Alerts — Configurable thresholds (maxPercent, maxCount, weight) for automated alerts
npx evalgate analyze --labeled-dataset .evalgate/golden/labeled.jsonl
# Outputs: Failure mode frequencies, impact rankings, alert recommendationsBudget control for evaluation pipelines:
- withCostTier() — Chain with any assertion method:
expect().withCostTier('code'|'medium'|'llm') - Budget Enforcement — Prevent cost overruns by setting maximum cost tiers per evaluation
- Works with .not — Full compatibility with negation:
expect().not.withCostTier('llm')
import { expect } from '@evalgate/sdk';
// Budget-controlled assertions
await expect(response).to.contain("Paris").withCostTier('medium');
await expect(code).to.beValid().withCostTier('code');
await expect(summary).to.beFactuallyConsistent().withCostTier('llm');evalgate discover is the spec compiler frontend that finds evaluation specs, normalizes identities, and produces metadata that powers manifest generation, impact analysis, and intelligent execution.
npx evalgate discover # Find all eval specs in project
npx evalgate discover --manifest # Generate .evalgate/manifest.json
npx evalgate impact-analysis --base main # Show what changed vs base branchKey Features:
- File Discovery: Recursive search with cross-platform pattern matching
- Identity Normalization: Stable spec IDs and canonical file paths
- Metadata Generation: Project metadata, execution mode, categorization
- Incremental Caching: File hashes for smart re-execution
- Impact Analysis: Modal-like perceived speed via incremental intelligence
Architecture:
evalgate discover → manifest.json → impact-analysis → run → diff
- Zero-config scaffolder —
npx evalgate initdetects repo, creates baseline, installs CI workflow - Built-in gate — works with any
npm test/pnpm test/yarn test - Advanced gate — golden eval scores, confidence tests, p95 latency, cost tracking
- GitHub Step Summary — delta tables, pass/fail icons, artifact upload
- Baseline governance — CODEOWNERS, label gates, anti-cheat guards (Baseline Contract)
- Four evaluation types: Unit Tests, Human Evaluation, LLM Judge, A/B Testing
- Template library across chatbots, RAG, code-gen, adversarial, multimodal, and industry domains
- Visual evaluation builder — compose evals with drag-and-drop, no code required
- Quality score dashboard — pass rates, trends, and drill-down into failures
- Full TypeScript SDK —
@evalgate/sdkwith CLI, regression gate, traces, evaluations, LLM judge - Python SDK —
pauly4010-evalgate-sdkwith assertions, test workflows, OpenAI/Anthropic/LangChain/CrewAI/AutoGen integrations, and CLI (evalgate run,evalgate gate,evalgate ci) - CLI commands —
evalgate init,evalgate gate,evalgate baseline,evalgate discover,evalgate impact-analysis,evalgate check,evalgate doctor,evalgate explain,evalgate print-config,evalgate share - Programmatic exports — gate exit codes, categories, report types via
@evalgate/sdk/regression - API keys — scoped keys for CI/CD and production
- Node.js >= 20
- pnpm >= 10 (
npm install -g pnpm)
git clone https://github.com/evalgate/ai-evaluation-platform.git
cd ai-evaluation-platform
pnpm install
cp .env.example .env.local
# Edit .env.local with your PostgreSQL, OAuth, and auth secrets
pnpm db:migrate
pnpm devThe app will be available at http://localhost:3000.
Note: The TypeScript SDK (
@evalgate/sdk) is published to npm separately. For SDK consumers,npm install @evalgate/sdkis the correct install command. The Python SDK is available viapip install pauly4010-evalgate-sdk.
ai-evaluation-platform/
├── src/app/ # Next.js App Router pages
│ ├── api/ # REST API routes (55+ endpoints)
│ │ ├── evaluations/ # Eval CRUD, runs, test-cases, publish
│ │ ├── llm-judge/ # LLM Judge evaluate, configs, alignment
│ │ ├── traces/ # Distributed tracing + spans
│ │ └── ...
├── src/packages/sdk/ # TypeScript SDK (@evalgate/sdk)
├── src/packages/sdk-python/ # Python SDK (evalgate-sdk on PyPI)
├── src/lib/ # Core services, utilities, templates
├── src/db/ # Database layer (Drizzle ORM schema)
└── drizzle/ # Database migrations
Contributions are welcome! Please use pnpm for all local development. Run tests with pnpm test before submitting.
pnpm install # Install dependencies
pnpm dev # Start dev server
pnpm test # Run tests (temp DB per worker, migrations in setup)
pnpm build # Production buildOpen an issue or submit a pull request at https://github.com/evalgate/ai-evaluation-platform.
MIT