Catch tool-call regressions before they hit production.
Most agent failures don't look like bad text. They look like this:
- yesterday your agent read context, validated input, then wrote changes
- today, after a prompt tweak, it writes too early
- the final answer still looks plausible
- production is now broken
TracePact catches that. Record a known-good run, replay it in CI without API calls, and diff against new runs to see exactly what changed — which tools, in what order, with what arguments.
# 1. Record a baseline (one-time, live)
npx tracepact run --live --record
# 2. Change your prompt, model, or tool wiring
# 3. Record again and diff
npx tracepact run --live --record
npx tracepact diff cassettes/before.json cassettes/after.json
# 4. CI: fail on behavioral regressions
npx tracepact diff baseline.json latest.json --fail-on warn
# Ignore noisy args (timestamps, request IDs)
npx tracepact diff baseline.json latest.json --ignore-keys timestamp,requestId
# Ignore tools you don't care about
npx tracepact diff baseline.json latest.json --ignore-tools read_file Comparing cassettes
A: cassettes/before.json
B: cassettes/after.json
3 changes detected:
- read_file (seq 1) (removed)
+ write_file (seq 3) (added)
~ bash.cmd: "npm test" -> "npm run build"
Summary: 1 removed, 1 added, 1 arg changed [BLOCK]
You changed the prompt. The output still looks fine. But the agent stopped reading the config before deploying and switched from running tests to running builds. TracePact caught it.
Teams already try to catch this, but usually in fragile ways:
- manually reviewing traces in agent UIs
- parsing raw session logs after tests
- writing custom hooks to extract tool calls
- comparing old vs new runs by hand
- debugging regressions only after a user reports them
TracePact turns that into deterministic tests and replayable behavior contracts.
TracePact is designed for assertions like these:
- read before write
- validate input before mutation
- never call shell for read-only tasks
- never call destructive tools without confirmation
- look up an existing record before creating a new one
- query the database before writing cache
- run tests before finishing a code-editing task
- inspect logs before restart actions
- do not write outside allowed paths
- do not call sensitive tools in low-trust flows
These are often easier and more stable than trying to assert that an entire response is "good."
A coding agent should read enough context before editing code.
import { describe, expect, test } from 'vitest';
import { TraceBuilder } from '@tracepact/vitest';
describe('refactor agent', () => {
test('reads context before editing code', () => {
const trace = new TraceBuilder()
.addCall('read_file', { path: 'src/service.ts' }, '...')
.addCall('read_file', { path: 'src/types.ts' }, '...')
.addCall('write_file', { path: 'src/service.ts', content: '...' })
.addCall('run_tests', {}, 'PASS')
.build();
expect(trace).toHaveCalledToolsInOrder([
'read_file', 'read_file', 'write_file', 'run_tests',
]);
expect(trace).toHaveToolCallCount('read_file', 2);
expect(trace).toNotHaveCalledTool('bash');
});
});This test fails immediately if a prompt or model change causes the agent to write before reading, skip required steps, or introduce a forbidden tool call.
No API calls. No tokens. Deterministic. Runs in milliseconds.
Capture a known-good run once, then replay it to detect drift caused by changes to system prompts, model choice, tool descriptions, agent logic, or MCP server wiring.
import { runSkill } from '@tracepact/vitest';
// Record (requires TRACEPACT_LIVE=1)
const result = await runSkill(skill, {
prompt: 'deploy to staging',
record: './cassettes/deploy.json',
sandbox,
});
// Replay (no API key needed, instant)
const replayed = await runSkill(skill, {
prompt: 'deploy to staging',
replay: './cassettes/deploy.json',
});
expect(replayed.trace).toHaveCalledTool('deploy', { env: 'staging' });Or use the CLI with automatic cassette recording:
# Record all tests (cassettes saved automatically)
npx tracepact run --live --record
# Replay all tests (zero tokens, instant)
npx tracepact run --replay ./cassettesTracePact is especially useful for agents that use multiple tools, operate across several steps, mutate files or systems, and can silently regress after prompt or model updates.
Agents that read files, search code, edit code, run tests, use shell, open PRs.
Typical contracts: read context before writing, do not use shell unless required, run tests before completion, never edit restricted files.
Agents that use GitHub, Jira, Slack, docs, or internal APIs via MCP servers.
Typical contracts: use the correct system for the correct task, do not update tickets before validating context.
Agents that inspect logs, query metrics, read runbooks, restart services.
Typical contracts: inspect before acting, never restart before checking evidence, require confirmation for destructive steps.
Agents that create tickets, update CRM records, reconcile data, route tasks.
Typical contracts: validate required fields before mutation, look up existing records before creating new ones, avoid duplicate side effects.
TracePact is not primarily for pure chatbots, style or tone evaluation, open-ended creative tasks, or systems where only the final text matters.
Use TracePact for behavioral guarantees. Use semantic or judge-based evals for response quality. They complement each other — TracePact includes a Promptfoo adapter for exactly this.
// Did it call the right tools?
expect(trace).toHaveCalledTool('read_file', { path: 'config.json' });
expect(trace).toNotHaveCalledTool('bash');
expect(trace).toHaveToolCallCount('read_file', 3);
// In the right order?
expect(trace).toHaveCalledToolsInOrder(['read_file', 'write_file']);
expect(trace).toHaveCalledToolsInStrictOrder(['read_file', 'write_file']);
expect(trace).toHaveFirstCalledTool('read_file');
expect(trace).toHaveLastCalledTool('write_file');
// With the right side effects?
expect(trace).toHaveFileWritten('output.ts', /export/);
// Conditional contracts
import { when, calledTool } from '@tracepact/core';
when(trace, calledTool('write_file'), toHaveCalledTool(trace, 'read_file'));If your agent calls MCP servers, TracePact traces which server handled each tool call:
expect(trace).toHaveCalledMcpTool('filesystem', 'read_text_file');
expect(trace).toHaveCalledMcpServer('database');
expect(trace).toHaveCalledMcpToolsInOrder([
{ server: 'filesystem', tool: 'read_text_file' },
{ server: 'database', tool: 'query' },
]);
expect(trace).toNotHaveCalledMcpTool('filesystem', 'write_file');You can also connect to real MCP servers for integration tests:
import { connectMcp, MockSandbox } from '@tracepact/vitest';
const fs = await connectMcp({
server: 'filesystem',
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-filesystem', '/project'],
});
const sandbox = new MockSandbox(fs.handlers, fs.sources);
const result = await sandbox.executeTool('read_text_file', {
path: '/project/README.md',
});
expect(sandbox.getTrace()).toHaveCalledMcpServer('filesystem');Because outputs can remain plausible while behavior quietly changes.
You can change the prompt, the model, the tool schema, the routing logic, or the retry policy — and still get a response that looks acceptable. Meanwhile the agent may have skipped validation, touched the wrong tool, reordered critical steps, or mutated state too early.
Output evals answer "did it say the right thing?" TracePact answers "did it do the right thing?"
For simple flows, you can. But the hard part is not a single mock. The hard part is having a reusable way to work with traces, assertions, replay, diffing, MCP workflows, and CI-friendly regression checks across an agent that keeps evolving.
TracePact becomes useful when your agent behavior stops being trivial.
npm install @tracepact/core @tracepact/vitest @tracepact/cli
npx tracepact init # interactive setup
npx tracepact # run all testsnpx tracepact # run all tests
npx tracepact run --live # run against real LLM APIs
npx tracepact run --record # record cassettes (implies --live)
npx tracepact run --replay ./dir # replay without API calls
npx tracepact diff a.json b.json # compare two cassettes
npx tracepact diff a.json b.json --fail-on warn # fail CI on any drift
npx tracepact diff a.json b.json --fail-on block # fail only on structural changes
npx tracepact diff a.json b.json --ignore-keys timestamp # skip noisy args
npx tracepact diff a.json b.json --ignore-tools read_file # skip tools you don't care about
npx tracepact audit # static analysis (no API key)
npx tracepact capture # auto-generate tests from live run
npx tracepact init # interactive setup
npx tracepact doctor # environment health check
TracePact ships an MCP server for Claude Code, Cursor, and Windsurf:
{
"mcpServers": {
"tracepact": {
"command": "npx",
"args": ["@tracepact/mcp-server"]
}
}
}| Tool | What it does |
|---|---|
tracepact_audit |
Static analysis of tool definitions |
tracepact_run |
Execute the test suite |
tracepact_capture |
Auto-generate tests from a cassette |
tracepact_replay |
Replay a cassette without API calls |
tracepact_diff |
Compare two cassettes for behavioral drift |
tracepact_list_tests |
Discover test files and cassettes |
| Package | Description |
|---|---|
@tracepact/core |
Trace model, matchers, sandboxes, drivers, cassettes, redaction |
@tracepact/vitest |
Vitest plugin, runSkill(), test annotations |
@tracepact/cli |
CLI commands |
@tracepact/promptfoo |
Promptfoo provider + assertion adapter |
@tracepact/mcp-server |
MCP server for IDE integration |
- Record & replay (cassettes)
- Tool call matchers and MCP tracing
- Behavioral drift detection (diff)
- Promptfoo integration
- MCP server for IDEs
-
tracepact diffCLI command - Diff policy:
--ignore-keys,--ignore-tools, severity levels (--fail-on warn|block) -
tracepact show— visual trace timeline - Invariant discovery (analyze traces, suggest contracts)
- SDK adapters (Vercel AI SDK, Anthropic SDK, OpenAI SDK)
- Quick Start
- Mock vs Live Testing
- CI Integration
- Cassettes (Record & Replay)
- Assertion Reference
- CLI Reference
- Promptfoo Integration
- IDE Setup (MCP)
See CONTRIBUTING.md for development setup and coding standards.