Skip to content

catello09/agent-eval-ts

Repository files navigation

agent-eval-ts

Evaluation framework for TypeScript AI agents — define suites, run batch evaluations, and report accuracy, latency, cost, and more.

TypeScript License: MIT Node.js Vitest

What is this?

agent-eval-ts helps you measure and compare AI agent behavior: exact output checks, semantic similarity (bag-of-words cosine by default, or your own embeddings), JSON Schema validation, tool-call sequences, latency, token usage, and cost logging. It runs locally, produces JSON / Markdown / HTML / JUnit reports, supports optional LLM-as-judge (OpenAI-compatible), caching, multi-model comparison, and regression detection against a saved baseline.

Installation

cd agent-eval-ts
npm install

Requires Node.js 22+. In CI, use npm ci with the committed package-lock.json for reproducible installs.

Optional API keys (see .env.example):

cp .env.example .env
# edit .env — used by CLI live mode, REST API, and LLM-as-judge

Quick start

1. Define a test suite

See tests/agent.suite.ts:

import { defineSuite } from "./src/index.js"; // NodeNext: `.js` resolves to `.ts`; or use "agent-eval-ts" when published

export const mathSuite = defineSuite({
  name: "Math Agent",
  description: "Tests basic arithmetic capabilities",
  testCases: [
    {
      id: "add",
      input: "What is 2 + 2?",
      expected: "4",
      metrics: ["exactMatch"],
    },
    {
      id: "multiply",
      input: "Calculate 15 * 3",
      expected: "45",
      metrics: ["exactMatch", "latency"],
    },
    {
      id: "wordProblem",
      input:
        "If a train travels 60 miles per hour for 2.5 hours, how far does it go?",
      expected: "150 miles",
      metrics: ["semanticSimilarity"],
      threshold: 0.85,
    },
  ],
});

2. Run an evaluation in code

import { evaluate } from "./src/index.js";
import { mathSuite } from "./tests/agent.suite.js";

const results = await evaluate({
  suite: mathSuite,
  agent: async (input) => {
    // plug in your agent; return string or { text, toolCalls?, usage?, costUsd? }
    return "4";
  },
  options: {
    concurrency: 5,
    cacheResults: true,
    cacheDir: "cache",
  },
});

console.log(`Pass rate: ${results.passRate}%`);
console.table(results.summary);

For assertions in your own unit tests, use Vitest’s expect (import { expect } from "vitest") — it is not re-exported from this package so CLI and production bundles do not load Vitest at runtime.

3. Development server (REST API)

npm run dev
# or one-shot (no watch): npm start

Use npm start or npm run dev so the server runs with tsx and can load TypeScript suite files. Running node dist/api/server.js directly is only suitable if your suite file imports compiled output (for example from dist/index.js).

  • GET /health — liveness check
  • POST /evaluate — JSON body: { "suitePath": "./tests/agent.suite.ts", "model": "openai/gpt-4o", "mock": true }
    • mock: true — always uses the built-in mock agent.
    • mock: false — requires OPENAI_API_KEY; otherwise returns 500.
    • mock omitted — uses OpenAI when the key is set, otherwise the mock agent.

4. CLI

# Mock agent (no API keys)
npm run cli -- eval --suite tests/agent.suite.ts --mock

# Live OpenAI (requires OPENAI_API_KEY)
npm run cli -- eval --suite tests/agent.suite.ts --model gpt-4o-mini

# Write JUnit for CI
npm run cli -- eval --suite tests/agent.suite.ts --mock --format junit --out reports/junit.xml

# Write a report with default text format (Markdown file)
npm run cli -- eval --suite tests/agent.suite.ts --mock --out reports/summary.md

The CLI is implemented with tsx so TypeScript suite files resolve cleanly. The --out path is resolved from the current working directory; parent directories are created as needed. If you pass --out without --format, the file is Markdown. After npm run build, you can run node dist/cli/index.js with suites that import from ../dist/index.js (or your package name).

Features

Feature Description
Test suite runner defineSuite with cases, expected values, and per-case metrics
Metrics exactMatch, includes, semanticSimilarity, jsonSchema, toolCalls, latency, tokenUsage, cost
Model comparison compareModels across multiple model IDs with the same suite
Reports reportJson, reportMarkdown, reportHtml, reportJUnit
Caching Optional disk cache keyed by suite name, test id, and input
LLM judge Optional OpenAI-compatible judge (judge + OPENAI_API_KEY)
Regression detectRegression(baseline, current) for pass→fail changes
Exporters LangfuseExporter (optional debug hook when keys set), WandbExporter (no-op stub; extend with W&B SDK if needed)

API reference

defineSuite

  • TestCase.metrics — defaults to ["exactMatch"] if omitted.
  • threshold — used by semanticSimilarity (default 0.85 if omitted).
  • evaluate({ options: { embed } }) — optional embed(text) => Promise<number[]>. When set, semanticSimilarity uses cosine similarity on embeddings; shorter vectors are zero-padded to the longer length. When omitted, a bag-of-words cosine is used instead.
  • jsonSchema — required on the case when using the jsonSchema metric.
  • expectedToolCalls — required for toolCalls; the runner compares them to toolCalls returned from the agent (AgentOutput.toolCalls).

Custom metrics

import { registerMetric } from "./src/index.js";

registerMetric("startsWithCapital", {
  compute: (output) => {
    const passed = /^[A-Z]/.test(output);
    return { passed, score: passed ? 1 : 0 };
  },
});

LLM-as-judge

await evaluate({
  suite,
  agent,
  judge: {
    judgeModel: "gpt-4o",
    rubric: "Rate helpfulness 0–1. Pass if score >= 0.7.",
  },
});

If OPENAI_API_KEY is unset, the judge does not call the API and the run is treated as passing the judge step (so local runs without keys still complete).

Compare models

import { compareModels } from "./src/index.js";

const { summary } = await compareModels({
  suite: mathSuite,
  models: ["openai/gpt-4o", "openai/gpt-4o-mini"],
  agentFactory: (model) => async (input) => {
    /* build agent for `model` */
    return "…";
  },
});
console.table(summary);

Regression detection

import { detectRegression } from "./src/index.js";
import { readFileSync } from "node:fs";

const baseline = JSON.parse(readFileSync("baseline.json", "utf8"));
const current = await evaluate({ suite, agent });
const regressions = detectRegression(baseline, current);

Docker

docker compose up --build
curl -s -X POST http://localhost:3000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"suitePath": "./tests/agent.suite.ts", "model": "openai/gpt-4o", "mock": true}'

Scripts

Script Purpose
npm run build Compile src/dist/
npm run dev tsx watch REST API
npm start REST API (tsx, no watch)
npm run cli CLI entry (tsx src/cli/index.ts)
npm test Unit tests (suite loader, metrics, runner, reporters, compare, regression)
npm run test:watch Vitest in watch mode
npm run test:integration Optional live OpenAI test (skipped without OPENAI_API_KEY)
npm run test:regression Regression tests only
npm run eval:ci Tests + build + mock CLI eval (CI-friendly)
npm run clean Remove dist/ (run npm run build to restore)
npm publish Runs prepublishOnlynpm run build, then packs files from package.json

Project layout

agent-eval-ts/
├── src/
│   ├── evaluator/
│   │   ├── Runner.ts
│   │   ├── metrics/
│   │   ├── judge/
│   │   ├── reporters/
│   │   ├── regression/
│   │   └── exporters/
│   ├── cli/
│   ├── api/
│   ├── runtime/       # shared suite loading + mock / OpenAI agents
│   └── index.ts
├── tests/
├── reports/          # generated reports (gitignored except .gitkeep)
├── .env.example
├── .github/workflows/eval.yml
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── package.json
├── package-lock.json
├── tsconfig.json
├── vitest.config.ts
└── vitest.integration.config.ts

License

See LICENSE (MIT).

CI

GitHub Actions (.github/workflows/eval.yml) runs npm run eval:ci on push and pull requests. Uploads the reports/ folder as an artifact when present.

About

Agent evaluation & benchmarking for TypeScript: test suites, LLM metrics, caching, OpenAI-compatible judge, JUnit/HTML/MD reports, Docker, GitHub Actions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors