Evaluation framework for TypeScript AI agents — define suites, run batch evaluations, and report accuracy, latency, cost, and more.
agent-eval-ts helps you measure and compare AI agent behavior: exact output checks, semantic similarity (bag-of-words cosine by default, or your own embeddings), JSON Schema validation, tool-call sequences, latency, token usage, and cost logging. It runs locally, produces JSON / Markdown / HTML / JUnit reports, supports optional LLM-as-judge (OpenAI-compatible), caching, multi-model comparison, and regression detection against a saved baseline.
cd agent-eval-ts
npm installRequires Node.js 22+. In CI, use npm ci with the committed package-lock.json for reproducible installs.
Optional API keys (see .env.example):
cp .env.example .env
# edit .env — used by CLI live mode, REST API, and LLM-as-judgeSee tests/agent.suite.ts:
import { defineSuite } from "./src/index.js"; // NodeNext: `.js` resolves to `.ts`; or use "agent-eval-ts" when published
export const mathSuite = defineSuite({
name: "Math Agent",
description: "Tests basic arithmetic capabilities",
testCases: [
{
id: "add",
input: "What is 2 + 2?",
expected: "4",
metrics: ["exactMatch"],
},
{
id: "multiply",
input: "Calculate 15 * 3",
expected: "45",
metrics: ["exactMatch", "latency"],
},
{
id: "wordProblem",
input:
"If a train travels 60 miles per hour for 2.5 hours, how far does it go?",
expected: "150 miles",
metrics: ["semanticSimilarity"],
threshold: 0.85,
},
],
});import { evaluate } from "./src/index.js";
import { mathSuite } from "./tests/agent.suite.js";
const results = await evaluate({
suite: mathSuite,
agent: async (input) => {
// plug in your agent; return string or { text, toolCalls?, usage?, costUsd? }
return "4";
},
options: {
concurrency: 5,
cacheResults: true,
cacheDir: "cache",
},
});
console.log(`Pass rate: ${results.passRate}%`);
console.table(results.summary);For assertions in your own unit tests, use Vitest’s expect (import { expect } from "vitest") — it is not re-exported from this package so CLI and production bundles do not load Vitest at runtime.
npm run dev
# or one-shot (no watch): npm startUse npm start or npm run dev so the server runs with tsx and can load TypeScript suite files. Running node dist/api/server.js directly is only suitable if your suite file imports compiled output (for example from dist/index.js).
GET /health— liveness checkPOST /evaluate— JSON body:{ "suitePath": "./tests/agent.suite.ts", "model": "openai/gpt-4o", "mock": true }mock: true— always uses the built-in mock agent.mock: false— requiresOPENAI_API_KEY; otherwise returns 500.mockomitted — uses OpenAI when the key is set, otherwise the mock agent.
# Mock agent (no API keys)
npm run cli -- eval --suite tests/agent.suite.ts --mock
# Live OpenAI (requires OPENAI_API_KEY)
npm run cli -- eval --suite tests/agent.suite.ts --model gpt-4o-mini
# Write JUnit for CI
npm run cli -- eval --suite tests/agent.suite.ts --mock --format junit --out reports/junit.xml
# Write a report with default text format (Markdown file)
npm run cli -- eval --suite tests/agent.suite.ts --mock --out reports/summary.mdThe CLI is implemented with tsx so TypeScript suite files resolve cleanly. The --out path is resolved from the current working directory; parent directories are created as needed. If you pass --out without --format, the file is Markdown. After npm run build, you can run node dist/cli/index.js with suites that import from ../dist/index.js (or your package name).
| Feature | Description |
|---|---|
| Test suite runner | defineSuite with cases, expected values, and per-case metrics |
| Metrics | exactMatch, includes, semanticSimilarity, jsonSchema, toolCalls, latency, tokenUsage, cost |
| Model comparison | compareModels across multiple model IDs with the same suite |
| Reports | reportJson, reportMarkdown, reportHtml, reportJUnit |
| Caching | Optional disk cache keyed by suite name, test id, and input |
| LLM judge | Optional OpenAI-compatible judge (judge + OPENAI_API_KEY) |
| Regression | detectRegression(baseline, current) for pass→fail changes |
| Exporters | LangfuseExporter (optional debug hook when keys set), WandbExporter (no-op stub; extend with W&B SDK if needed) |
TestCase.metrics— defaults to["exactMatch"]if omitted.threshold— used bysemanticSimilarity(default 0.85 if omitted).evaluate({ options: { embed } })— optionalembed(text) => Promise<number[]>. When set,semanticSimilarityuses cosine similarity on embeddings; shorter vectors are zero-padded to the longer length. When omitted, a bag-of-words cosine is used instead.jsonSchema— required on the case when using thejsonSchemametric.expectedToolCalls— required fortoolCalls; the runner compares them totoolCallsreturned from the agent (AgentOutput.toolCalls).
import { registerMetric } from "./src/index.js";
registerMetric("startsWithCapital", {
compute: (output) => {
const passed = /^[A-Z]/.test(output);
return { passed, score: passed ? 1 : 0 };
},
});await evaluate({
suite,
agent,
judge: {
judgeModel: "gpt-4o",
rubric: "Rate helpfulness 0–1. Pass if score >= 0.7.",
},
});If OPENAI_API_KEY is unset, the judge does not call the API and the run is treated as passing the judge step (so local runs without keys still complete).
import { compareModels } from "./src/index.js";
const { summary } = await compareModels({
suite: mathSuite,
models: ["openai/gpt-4o", "openai/gpt-4o-mini"],
agentFactory: (model) => async (input) => {
/* build agent for `model` */
return "…";
},
});
console.table(summary);import { detectRegression } from "./src/index.js";
import { readFileSync } from "node:fs";
const baseline = JSON.parse(readFileSync("baseline.json", "utf8"));
const current = await evaluate({ suite, agent });
const regressions = detectRegression(baseline, current);docker compose up --build
curl -s -X POST http://localhost:3000/evaluate \
-H "Content-Type: application/json" \
-d '{"suitePath": "./tests/agent.suite.ts", "model": "openai/gpt-4o", "mock": true}'| Script | Purpose |
|---|---|
npm run build |
Compile src/ → dist/ |
npm run dev |
tsx watch REST API |
npm start |
REST API (tsx, no watch) |
npm run cli |
CLI entry (tsx src/cli/index.ts) |
npm test |
Unit tests (suite loader, metrics, runner, reporters, compare, regression) |
npm run test:watch |
Vitest in watch mode |
npm run test:integration |
Optional live OpenAI test (skipped without OPENAI_API_KEY) |
npm run test:regression |
Regression tests only |
npm run eval:ci |
Tests + build + mock CLI eval (CI-friendly) |
npm run clean |
Remove dist/ (run npm run build to restore) |
npm publish |
Runs prepublishOnly → npm run build, then packs files from package.json |
agent-eval-ts/
├── src/
│ ├── evaluator/
│ │ ├── Runner.ts
│ │ ├── metrics/
│ │ ├── judge/
│ │ ├── reporters/
│ │ ├── regression/
│ │ └── exporters/
│ ├── cli/
│ ├── api/
│ ├── runtime/ # shared suite loading + mock / OpenAI agents
│ └── index.ts
├── tests/
├── reports/ # generated reports (gitignored except .gitkeep)
├── .env.example
├── .github/workflows/eval.yml
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── package.json
├── package-lock.json
├── tsconfig.json
├── vitest.config.ts
└── vitest.integration.config.ts
See LICENSE (MIT).
GitHub Actions (.github/workflows/eval.yml) runs npm run eval:ci on push and pull requests. Uploads the reports/ folder as an artifact when present.