agent-replay

Record, replay, and test AI agent sessions for behavioral regression testing.

The Problem

When you change a prompt, switch models, or update system instructions, you have no reliable way to know if your agent still behaves correctly. Manual testing is slow and inconsistent. Unit tests catch code bugs but not behavioral regressions.

agent-replay solves this by letting you:

Record real agent sessions as test fixtures
Replay them against different models or prompts
Compare behavioral differences automatically
Test with assertions that catch regressions

This is like snapshot testing, but for AI agent behavior.

Installation

npm install -g agent-replay

Or use in a project:

npm install agent-replay

Quick Start

1. Record a Session

Start the recording proxy:

agent-replay record --port 8080 --name my-session

Point your application at the proxy:

OPENAI_BASE_URL=http://localhost:8080/v1 node my-agent.js

Press Ctrl+C to stop. Your session is saved to ./recordings/my-session.json.

2. Replay Against a Different Model

agent-replay replay ./recordings/my-session.json --model gpt-4o

This replays every turn and compares the responses.

3. Create Behavioral Tests

Initialize a test suite:

agent-replay init

Edit tests/agent.test.yaml:

name: Agent Behavioral Tests

tests:
  - name: Should use search tool for factual questions
    session: ./recordings/factual-question.json
    assertions:
      - type: tool_called
        value: search
        message: Should call search for factual questions

  - name: Should not hallucinate URLs
    session: ./recordings/url-question.json
    assertions:
      - type: content_not_contains
        value: "example.com"

  - name: Should greet politely
    session: ./recordings/greeting.json
    assertions:
      - type: content_contains
        turn: 0
        value: "Hello"

4. Run Tests

agent-replay test "tests/*.yaml"

Output:

Running: Agent Behavioral Tests
  ✓ Should use search tool for factual questions (1234ms)
  ✓ Should not hallucinate URLs (987ms)
  ✗ Should greet politely (654ms)
    - Response should contain "Hello"

  2 passed, 1 failed, 0 skipped

CLI Commands

record

Start a recording proxy to capture LLM API calls.

agent-replay record [options]

Options:
  -p, --port <port>     Proxy port (default: 8080)
  -t, --target <url>    Target API URL (default: https://api.openai.com)
  -o, --output <dir>    Output directory (default: ./recordings)
  -n, --name <name>     Session name

replay

Replay a recorded session and compare results.

agent-replay replay <session> [options]

Options:
  -m, --model <model>   Model to use for replay
  -b, --base-url <url>  API base URL
  -k, --api-key <key>   API key
  -o, --output <file>   Output file for results
  --json                Output as JSON
  --stop-at <turn>      Stop after N turns

compare

Compare two session recordings directly.

agent-replay compare <session1> <session2>

test

Run behavioral tests from YAML test suites.

agent-replay test <pattern> [options]

Options:
  -m, --model <model>      Model to use
  -f, --filter <name>      Filter tests by name
  --bail                   Stop on first failure
  --reporter <type>        Output format (console, json, junit)
  -o, --output <file>      Output file for reports

info

Show information about a recorded session.

agent-replay info <session>

Programmatic Usage

import { Recorder, Replayer, TestRunner } from 'agent-replay';

// Record sessions programmatically
const recorder = new RecorderHook('my-session');

// ... make API calls and record them ...
recorder.recordTurn(request, response, latencyMs);
recorder.save();

// Replay a session
const replayer = new Replayer({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY,
});

const result = await replayer.replay(session);
console.log(`Match rate: ${result.summary.match_rate * 100}%`);

// Run tests
const runner = new TestRunner({ model: 'gpt-4' });
const results = await runner.runSuite(suite);

Test Assertions

Type	Description
`content_contains`	Response content includes the value
`content_not_contains`	Response content does not include the value
`tool_called`	A specific tool was called
`tool_not_called`	A specific tool was not called
`tool_args_match`	Tool arguments match expected values
`finish_reason`	Response has specific finish reason
`custom`	Custom JavaScript expression

Turn Selectors

turn: 0 - First turn
turn: 3 - Fourth turn (0-indexed)
turn: last - Last turn
turn: any - Any turn (default)

Session Format

Sessions are stored as JSON:

{
  "id": "rec-1234567890-abc123",
  "name": "my-session",
  "recorded_at": "2024-01-15T10:30:00Z",
  "provider": "openai",
  "turns": [
    {
      "index": 0,
      "timestamp": "2024-01-15T10:30:01Z",
      "request": {
        "model": "gpt-4",
        "messages": [...],
        "tools": [...]
      },
      "response": {
        "id": "chatcmpl-...",
        "choices": [...],
        "usage": {...}
      },
      "latency_ms": 1234
    }
  ]
}

CI Integration

Generate JUnit reports for CI systems:

agent-replay test "tests/*.yaml" --reporter junit -o test-results.xml

Example GitHub Actions workflow:

- name: Run Agent Tests
  run: |
    npm install -g agent-replay
    agent-replay test "tests/*.yaml" --reporter junit -o results.xml
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Upload Results
  uses: actions/upload-artifact@v3
  with:
    name: test-results
    path: results.xml

Use Cases

Prompt Engineering: Test prompt changes against recorded sessions to catch regressions before deploying.

Model Migration: Compare behavior between GPT-3.5 and GPT-4, or between different providers.

Tool Testing: Verify that agents call the right tools with correct arguments in various scenarios.

Compliance: Ensure agents don't produce certain outputs (PII, hallucinated URLs, etc.)

Debugging: Replay problematic sessions to reproduce and fix issues.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-replay

The Problem

Installation

Quick Start

1. Record a Session

2. Replay Against a Different Model

3. Create Behavioral Tests

4. Run Tests

CLI Commands

record

replay

compare

test

info

Programmatic Usage

Test Assertions

Turn Selectors

Session Format

CI Integration

Use Cases

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-replay

The Problem

Installation

Quick Start

1. Record a Session

2. Replay Against a Different Model

3. Create Behavioral Tests

4. Run Tests

CLI Commands

record

replay

compare

test

info

Programmatic Usage

Test Assertions

Turn Selectors

Session Format

CI Integration

Use Cases

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages