agent-eval-ts

Evaluation framework for TypeScript AI agents — define suites, run batch evaluations, and report accuracy, latency, cost, and more.

What is this?

agent-eval-ts helps you measure and compare AI agent behavior: exact output checks, semantic similarity (bag-of-words cosine by default, or your own embeddings), JSON Schema validation, tool-call sequences, latency, token usage, and cost logging. It runs locally, produces JSON / Markdown / HTML / JUnit reports, supports optional LLM-as-judge (OpenAI-compatible), caching, multi-model comparison, and regression detection against a saved baseline.

Installation

cd agent-eval-ts
npm install

Requires Node.js 22+. In CI, use npm ci with the committed package-lock.json for reproducible installs.

Optional API keys (see .env.example):

cp .env.example .env
# edit .env — used by CLI live mode, REST API, and LLM-as-judge

Quick start

1. Define a test suite

See tests/agent.suite.ts:

import { defineSuite } from "./src/index.js"; // NodeNext: `.js` resolves to `.ts`; or use "agent-eval-ts" when published

export const mathSuite = defineSuite({
  name: "Math Agent",
  description: "Tests basic arithmetic capabilities",
  testCases: [
    {
      id: "add",
      input: "What is 2 + 2?",
      expected: "4",
      metrics: ["exactMatch"],
    },
    {
      id: "multiply",
      input: "Calculate 15 * 3",
      expected: "45",
      metrics: ["exactMatch", "latency"],
    },
    {
      id: "wordProblem",
      input:
        "If a train travels 60 miles per hour for 2.5 hours, how far does it go?",
      expected: "150 miles",
      metrics: ["semanticSimilarity"],
      threshold: 0.85,
    },
  ],
});

2. Run an evaluation in code

import { evaluate } from "./src/index.js";
import { mathSuite } from "./tests/agent.suite.js";

const results = await evaluate({
  suite: mathSuite,
  agent: async (input) => {
    // plug in your agent; return string or { text, toolCalls?, usage?, costUsd? }
    return "4";
  },
  options: {
    concurrency: 5,
    cacheResults: true,
    cacheDir: "cache",
  },
});

console.log(`Pass rate: ${results.passRate}%`);
console.table(results.summary);

For assertions in your own unit tests, use Vitest’s expect (import { expect } from "vitest") — it is not re-exported from this package so CLI and production bundles do not load Vitest at runtime.

3. Development server (REST API)

npm run dev
# or one-shot (no watch): npm start

Use npm start or npm run dev so the server runs with tsx and can load TypeScript suite files. Running node dist/api/server.js directly is only suitable if your suite file imports compiled output (for example from dist/index.js).

GET /health — liveness check
POST /evaluate — JSON body: { "suitePath": "./tests/agent.suite.ts", "model": "openai/gpt-4o", "mock": true }
- mock: true — always uses the built-in mock agent.
- mock: false — requires OPENAI_API_KEY; otherwise returns 500.
- mock omitted — uses OpenAI when the key is set, otherwise the mock agent.

4. CLI

# Mock agent (no API keys)
npm run cli -- eval --suite tests/agent.suite.ts --mock

# Live OpenAI (requires OPENAI_API_KEY)
npm run cli -- eval --suite tests/agent.suite.ts --model gpt-4o-mini

# Write JUnit for CI
npm run cli -- eval --suite tests/agent.suite.ts --mock --format junit --out reports/junit.xml

# Write a report with default text format (Markdown file)
npm run cli -- eval --suite tests/agent.suite.ts --mock --out reports/summary.md

The CLI is implemented with tsx so TypeScript suite files resolve cleanly. The --out path is resolved from the current working directory; parent directories are created as needed. If you pass --out without --format, the file is Markdown. After npm run build, you can run node dist/cli/index.js with suites that import from ../dist/index.js (or your package name).

Features

Feature	Description
Test suite runner	`defineSuite` with cases, expected values, and per-case metrics
Metrics	`exactMatch`, `includes`, `semanticSimilarity`, `jsonSchema`, `toolCalls`, `latency`, `tokenUsage`, `cost`
Model comparison	`compareModels` across multiple model IDs with the same suite
Reports	`reportJson`, `reportMarkdown`, `reportHtml`, `reportJUnit`
Caching	Optional disk cache keyed by suite name, test id, and input
LLM judge	Optional OpenAI-compatible judge (`judge` + `OPENAI_API_KEY`)
Regression	`detectRegression(baseline, current)` for pass→fail changes
Exporters	`LangfuseExporter` (optional debug hook when keys set), `WandbExporter` (no-op stub; extend with W&B SDK if needed)

API reference

`defineSuite`

TestCase.metrics — defaults to ["exactMatch"] if omitted.
threshold — used by semanticSimilarity (default 0.85 if omitted).
evaluate({ options: { embed } }) — optional embed(text) => Promise<number[]>. When set, semanticSimilarity uses cosine similarity on embeddings; shorter vectors are zero-padded to the longer length. When omitted, a bag-of-words cosine is used instead.
jsonSchema — required on the case when using the jsonSchema metric.
expectedToolCalls — required for toolCalls; the runner compares them to toolCalls returned from the agent (AgentOutput.toolCalls).

Custom metrics

import { registerMetric } from "./src/index.js";

registerMetric("startsWithCapital", {
  compute: (output) => {
    const passed = /^[A-Z]/.test(output);
    return { passed, score: passed ? 1 : 0 };
  },
});

LLM-as-judge

await evaluate({
  suite,
  agent,
  judge: {
    judgeModel: "gpt-4o",
    rubric: "Rate helpfulness 0–1. Pass if score >= 0.7.",
  },
});

If OPENAI_API_KEY is unset, the judge does not call the API and the run is treated as passing the judge step (so local runs without keys still complete).

Compare models

import { compareModels } from "./src/index.js";

const { summary } = await compareModels({
  suite: mathSuite,
  models: ["openai/gpt-4o", "openai/gpt-4o-mini"],
  agentFactory: (model) => async (input) => {
    /* build agent for `model` */
    return "…";
  },
});
console.table(summary);

Regression detection

import { detectRegression } from "./src/index.js";
import { readFileSync } from "node:fs";

const baseline = JSON.parse(readFileSync("baseline.json", "utf8"));
const current = await evaluate({ suite, agent });
const regressions = detectRegression(baseline, current);

Docker

docker compose up --build
curl -s -X POST http://localhost:3000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"suitePath": "./tests/agent.suite.ts", "model": "openai/gpt-4o", "mock": true}'

Scripts

Script	Purpose
`npm run build`	Compile `src/` → `dist/`
`npm run dev`	`tsx watch` REST API
`npm start`	REST API (tsx, no watch)
`npm run cli`	CLI entry (`tsx src/cli/index.ts`)
`npm test`	Unit tests (suite loader, metrics, runner, reporters, compare, regression)
`npm run test:watch`	Vitest in watch mode
`npm run test:integration`	Optional live OpenAI test (skipped without `OPENAI_API_KEY`)
`npm run test:regression`	Regression tests only
`npm run eval:ci`	Tests + build + mock CLI eval (CI-friendly)
`npm run clean`	Remove `dist/` (run `npm run build` to restore)
`npm publish`	Runs `prepublishOnly` → `npm run build`, then packs `files` from `package.json`

Project layout

agent-eval-ts/
├── src/
│   ├── evaluator/
│   │   ├── Runner.ts
│   │   ├── metrics/
│   │   ├── judge/
│   │   ├── reporters/
│   │   ├── regression/
│   │   └── exporters/
│   ├── cli/
│   ├── api/
│   ├── runtime/       # shared suite loading + mock / OpenAI agents
│   └── index.ts
├── tests/
├── reports/          # generated reports (gitignored except .gitkeep)
├── .env.example
├── .github/workflows/eval.yml
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── package.json
├── package-lock.json
├── tsconfig.json
├── vitest.config.ts
└── vitest.integration.config.ts

License

See LICENSE (MIT).

CI

GitHub Actions (.github/workflows/eval.yml) runs npm run eval:ci on push and pull requests. Uploads the reports/ folder as an artifact when present.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-eval-ts

What is this?

Installation

Quick start

1. Define a test suite

2. Run an evaluation in code

3. Development server (REST API)

4. CLI

Features

API reference

`defineSuite`

Custom metrics

LLM-as-judge

Compare models

Regression detection

Docker

Scripts

Project layout

License

CI

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
reports		reports
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts
vitest.integration.config.ts		vitest.integration.config.ts

Folders and files

Latest commit

History

Repository files navigation

agent-eval-ts

What is this?

Installation

Quick start

1. Define a test suite

2. Run an evaluation in code

3. Development server (REST API)

4. CLI

Features

API reference

defineSuite

Custom metrics

LLM-as-judge

Compare models

Regression detection

Docker

Scripts

Project layout

License

CI

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`defineSuite`

Packages