TracePact

Catch tool-call regressions before they hit production.

Most agent failures don't look like bad text. They look like this:

yesterday your agent read context, validated input, then wrote changes
today, after a prompt tweak, it writes too early
the final answer still looks plausible
production is now broken

TracePact catches that. Record a known-good run, replay it in CI without API calls, and diff against new runs to see exactly what changed — which tools, in what order, with what arguments.

# 1. Record a baseline (one-time, live)
npx tracepact run --live --record

# 2. Change your prompt, model, or tool wiring

# 3. Record again and diff
npx tracepact run --live --record
npx tracepact diff cassettes/before.json cassettes/after.json

# 4. CI: fail on behavioral regressions
npx tracepact diff baseline.json latest.json --fail-on warn

# Ignore noisy args (timestamps, request IDs)
npx tracepact diff baseline.json latest.json --ignore-keys timestamp,requestId

# Ignore tools you don't care about
npx tracepact diff baseline.json latest.json --ignore-tools read_file

  Comparing cassettes
  A: cassettes/before.json
  B: cassettes/after.json

  3 changes detected:

  - read_file (seq 1) (removed)
  + write_file (seq 3) (added)
  ~ bash.cmd: "npm test" -> "npm run build"

  Summary: 1 removed, 1 added, 1 arg changed  [BLOCK]

You changed the prompt. The output still looks fine. But the agent stopped reading the config before deploying and switched from running tests to running builds. TracePact caught it.

The problem teams solve manually today

Teams already try to catch this, but usually in fragile ways:

manually reviewing traces in agent UIs
parsing raw session logs after tests
writing custom hooks to extract tool calls
comparing old vs new runs by hand
debugging regressions only after a user reports them

TracePact turns that into deterministic tests and replayable behavior contracts.

Typical behavior contracts

TracePact is designed for assertions like these:

read before write
validate input before mutation
never call shell for read-only tasks
never call destructive tools without confirmation
look up an existing record before creating a new one
query the database before writing cache
run tests before finishing a code-editing task
inspect logs before restart actions
do not write outside allowed paths
do not call sensitive tools in low-trust flows

These are often easier and more stable than trying to assert that an entire response is "good."

Example: catch a regression after a prompt change

A coding agent should read enough context before editing code.

import { describe, expect, test } from 'vitest';
import { TraceBuilder } from '@tracepact/vitest';

describe('refactor agent', () => {
  test('reads context before editing code', () => {
    const trace = new TraceBuilder()
      .addCall('read_file', { path: 'src/service.ts' }, '...')
      .addCall('read_file', { path: 'src/types.ts' }, '...')
      .addCall('write_file', { path: 'src/service.ts', content: '...' })
      .addCall('run_tests', {}, 'PASS')
      .build();

    expect(trace).toHaveCalledToolsInOrder([
      'read_file', 'read_file', 'write_file', 'run_tests',
    ]);
    expect(trace).toHaveToolCallCount('read_file', 2);
    expect(trace).toNotHaveCalledTool('bash');
  });
});

This test fails immediately if a prompt or model change causes the agent to write before reading, skip required steps, or introduce a forbidden tool call.

No API calls. No tokens. Deterministic. Runs in milliseconds.

Record once, replay in CI

Capture a known-good run once, then replay it to detect drift caused by changes to system prompts, model choice, tool descriptions, agent logic, or MCP server wiring.

import { runSkill } from '@tracepact/vitest';

// Record (requires TRACEPACT_LIVE=1)
const result = await runSkill(skill, {
  prompt: 'deploy to staging',
  record: './cassettes/deploy.json',
  sandbox,
});

// Replay (no API key needed, instant)
const replayed = await runSkill(skill, {
  prompt: 'deploy to staging',
  replay: './cassettes/deploy.json',
});

expect(replayed.trace).toHaveCalledTool('deploy', { env: 'staging' });

Or use the CLI with automatic cassette recording:

# Record all tests (cassettes saved automatically)
npx tracepact run --live --record

# Replay all tests (zero tokens, instant)
npx tracepact run --replay ./cassettes

Good fit

TracePact is especially useful for agents that use multiple tools, operate across several steps, mutate files or systems, and can silently regress after prompt or model updates.

Coding agents

Agents that read files, search code, edit code, run tests, use shell, open PRs.

Typical contracts: read context before writing, do not use shell unless required, run tests before completion, never edit restricted files.

Internal developer assistants

Agents that use GitHub, Jira, Slack, docs, or internal APIs via MCP servers.

Typical contracts: use the correct system for the correct task, do not update tickets before validating context.

Ops and incident-response agents

Agents that inspect logs, query metrics, read runbooks, restart services.

Typical contracts: inspect before acting, never restart before checking evidence, require confirmation for destructive steps.

Workflow automation agents

Agents that create tickets, update CRM records, reconcile data, route tasks.

Typical contracts: validate required fields before mutation, look up existing records before creating new ones, avoid duplicate side effects.

Less useful for

TracePact is not primarily for pure chatbots, style or tone evaluation, open-ended creative tasks, or systems where only the final text matters.

Use TracePact for behavioral guarantees. Use semantic or judge-based evals for response quality. They complement each other — TracePact includes a Promptfoo adapter for exactly this.

Tool call assertions

// Did it call the right tools?
expect(trace).toHaveCalledTool('read_file', { path: 'config.json' });
expect(trace).toNotHaveCalledTool('bash');
expect(trace).toHaveToolCallCount('read_file', 3);

// In the right order?
expect(trace).toHaveCalledToolsInOrder(['read_file', 'write_file']);
expect(trace).toHaveCalledToolsInStrictOrder(['read_file', 'write_file']);
expect(trace).toHaveFirstCalledTool('read_file');
expect(trace).toHaveLastCalledTool('write_file');

// With the right side effects?
expect(trace).toHaveFileWritten('output.ts', /export/);

// Conditional contracts
import { when, calledTool } from '@tracepact/core';
when(trace, calledTool('write_file'), toHaveCalledTool(trace, 'read_file'));

MCP tracing

If your agent calls MCP servers, TracePact traces which server handled each tool call:

expect(trace).toHaveCalledMcpTool('filesystem', 'read_text_file');
expect(trace).toHaveCalledMcpServer('database');
expect(trace).toHaveCalledMcpToolsInOrder([
  { server: 'filesystem', tool: 'read_text_file' },
  { server: 'database', tool: 'query' },
]);
expect(trace).toNotHaveCalledMcpTool('filesystem', 'write_file');

You can also connect to real MCP servers for integration tests:

import { connectMcp, MockSandbox } from '@tracepact/vitest';

const fs = await connectMcp({
  server: 'filesystem',
  command: 'npx',
  args: ['-y', '@modelcontextprotocol/server-filesystem', '/project'],
});

const sandbox = new MockSandbox(fs.handlers, fs.sources);
const result = await sandbox.executeTool('read_text_file', {
  path: '/project/README.md',
});

expect(sandbox.getTrace()).toHaveCalledMcpServer('filesystem');

Why not just use output evals?

Because outputs can remain plausible while behavior quietly changes.

You can change the prompt, the model, the tool schema, the routing logic, or the retry policy — and still get a response that looks acceptable. Meanwhile the agent may have skipped validation, touched the wrong tool, reordered critical steps, or mutated state too early.

Output evals answer "did it say the right thing?" TracePact answers "did it do the right thing?"

Why not just hand-roll mocks?

For simple flows, you can. But the hard part is not a single mock. The hard part is having a reusable way to work with traces, assertions, replay, diffing, MCP workflows, and CI-friendly regression checks across an agent that keeps evolving.

TracePact becomes useful when your agent behavior stops being trivial.

Quickstart

npm install @tracepact/core @tracepact/vitest @tracepact/cli
npx tracepact init          # interactive setup
npx tracepact               # run all tests

CLI

npx tracepact                         # run all tests
npx tracepact run --live              # run against real LLM APIs
npx tracepact run --record            # record cassettes (implies --live)
npx tracepact run --replay ./dir      # replay without API calls
npx tracepact diff a.json b.json      # compare two cassettes
npx tracepact diff a.json b.json --fail-on warn    # fail CI on any drift
npx tracepact diff a.json b.json --fail-on block   # fail only on structural changes
npx tracepact diff a.json b.json --ignore-keys timestamp  # skip noisy args
npx tracepact diff a.json b.json --ignore-tools read_file # skip tools you don't care about
npx tracepact audit                   # static analysis (no API key)
npx tracepact capture                 # auto-generate tests from live run
npx tracepact init                    # interactive setup
npx tracepact doctor                  # environment health check

IDE integration (MCP)

TracePact ships an MCP server for Claude Code, Cursor, and Windsurf:

{
  "mcpServers": {
    "tracepact": {
      "command": "npx",
      "args": ["@tracepact/mcp-server"]
    }
  }
}

Tool	What it does
`tracepact_audit`	Static analysis of tool definitions
`tracepact_run`	Execute the test suite
`tracepact_capture`	Auto-generate tests from a cassette
`tracepact_replay`	Replay a cassette without API calls
`tracepact_diff`	Compare two cassettes for behavioral drift
`tracepact_list_tests`	Discover test files and cassettes

Packages

Package	Description
`@tracepact/core`	Trace model, matchers, sandboxes, drivers, cassettes, redaction
`@tracepact/vitest`	Vitest plugin, `runSkill()`, test annotations
`@tracepact/cli`	CLI commands
`@tracepact/promptfoo`	Promptfoo provider + assertion adapter
`@tracepact/mcp-server`	MCP server for IDE integration

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.changeset		.changeset
.github		.github
.husky		.husky
docs		docs
examples		examples
packages		packages
.editorconfig		.editorconfig
.gitignore		.gitignore
.npmrc		.npmrc
.nvmrc		.nvmrc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
TODO.md		TODO.md
biome.json		biome.json
llms-full.txt		llms-full.txt
llms.txt		llms.txt
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TracePact

The problem teams solve manually today

Typical behavior contracts

Example: catch a regression after a prompt change

Record once, replay in CI

Good fit

Coding agents

Internal developer assistants

Ops and incident-response agents

Workflow automation agents

Less useful for

Tool call assertions

MCP tracing

Why not just use output evals?

Why not just hand-roll mocks?

Quickstart

CLI

IDE integration (MCP)

Packages

Roadmap

Documentation

Contributing

License

About

Uh oh!

Releases 35

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TracePact

The problem teams solve manually today

Typical behavior contracts

Example: catch a regression after a prompt change

Record once, replay in CI

Good fit

Coding agents

Internal developer assistants

Ops and incident-response agents

Workflow automation agents

Less useful for

Tool call assertions

MCP tracing

Why not just use output evals?

Why not just hand-roll mocks?

Quickstart

CLI

IDE integration (MCP)

Packages

Roadmap

Documentation

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 35

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages