Skip to content

dabit3/agent-replay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-replay

Record, replay, and test AI agent sessions for behavioral regression testing.

The Problem

When you change a prompt, switch models, or update system instructions, you have no reliable way to know if your agent still behaves correctly. Manual testing is slow and inconsistent. Unit tests catch code bugs but not behavioral regressions.

agent-replay solves this by letting you:

  1. Record real agent sessions as test fixtures
  2. Replay them against different models or prompts
  3. Compare behavioral differences automatically
  4. Test with assertions that catch regressions

This is like snapshot testing, but for AI agent behavior.

Installation

npm install -g agent-replay

Or use in a project:

npm install agent-replay

Quick Start

1. Record a Session

Start the recording proxy:

agent-replay record --port 8080 --name my-session

Point your application at the proxy:

OPENAI_BASE_URL=http://localhost:8080/v1 node my-agent.js

Press Ctrl+C to stop. Your session is saved to ./recordings/my-session.json.

2. Replay Against a Different Model

agent-replay replay ./recordings/my-session.json --model gpt-4o

This replays every turn and compares the responses.

3. Create Behavioral Tests

Initialize a test suite:

agent-replay init

Edit tests/agent.test.yaml:

name: Agent Behavioral Tests

tests:
  - name: Should use search tool for factual questions
    session: ./recordings/factual-question.json
    assertions:
      - type: tool_called
        value: search
        message: Should call search for factual questions

  - name: Should not hallucinate URLs
    session: ./recordings/url-question.json
    assertions:
      - type: content_not_contains
        value: "example.com"

  - name: Should greet politely
    session: ./recordings/greeting.json
    assertions:
      - type: content_contains
        turn: 0
        value: "Hello"

4. Run Tests

agent-replay test "tests/*.yaml"

Output:

Running: Agent Behavioral Tests
  ✓ Should use search tool for factual questions (1234ms)
  ✓ Should not hallucinate URLs (987ms)
  ✗ Should greet politely (654ms)
    - Response should contain "Hello"

  2 passed, 1 failed, 0 skipped

CLI Commands

record

Start a recording proxy to capture LLM API calls.

agent-replay record [options]

Options:
  -p, --port <port>     Proxy port (default: 8080)
  -t, --target <url>    Target API URL (default: https://api.openai.com)
  -o, --output <dir>    Output directory (default: ./recordings)
  -n, --name <name>     Session name

replay

Replay a recorded session and compare results.

agent-replay replay <session> [options]

Options:
  -m, --model <model>   Model to use for replay
  -b, --base-url <url>  API base URL
  -k, --api-key <key>   API key
  -o, --output <file>   Output file for results
  --json                Output as JSON
  --stop-at <turn>      Stop after N turns

compare

Compare two session recordings directly.

agent-replay compare <session1> <session2>

test

Run behavioral tests from YAML test suites.

agent-replay test <pattern> [options]

Options:
  -m, --model <model>      Model to use
  -f, --filter <name>      Filter tests by name
  --bail                   Stop on first failure
  --reporter <type>        Output format (console, json, junit)
  -o, --output <file>      Output file for reports

info

Show information about a recorded session.

agent-replay info <session>

Programmatic Usage

import { Recorder, Replayer, TestRunner } from 'agent-replay';

// Record sessions programmatically
const recorder = new RecorderHook('my-session');

// ... make API calls and record them ...
recorder.recordTurn(request, response, latencyMs);
recorder.save();

// Replay a session
const replayer = new Replayer({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY,
});

const result = await replayer.replay(session);
console.log(`Match rate: ${result.summary.match_rate * 100}%`);

// Run tests
const runner = new TestRunner({ model: 'gpt-4' });
const results = await runner.runSuite(suite);

Test Assertions

Type Description
content_contains Response content includes the value
content_not_contains Response content does not include the value
tool_called A specific tool was called
tool_not_called A specific tool was not called
tool_args_match Tool arguments match expected values
finish_reason Response has specific finish reason
custom Custom JavaScript expression

Turn Selectors

  • turn: 0 - First turn
  • turn: 3 - Fourth turn (0-indexed)
  • turn: last - Last turn
  • turn: any - Any turn (default)

Session Format

Sessions are stored as JSON:

{
  "id": "rec-1234567890-abc123",
  "name": "my-session",
  "recorded_at": "2024-01-15T10:30:00Z",
  "provider": "openai",
  "turns": [
    {
      "index": 0,
      "timestamp": "2024-01-15T10:30:01Z",
      "request": {
        "model": "gpt-4",
        "messages": [...],
        "tools": [...]
      },
      "response": {
        "id": "chatcmpl-...",
        "choices": [...],
        "usage": {...}
      },
      "latency_ms": 1234
    }
  ]
}

CI Integration

Generate JUnit reports for CI systems:

agent-replay test "tests/*.yaml" --reporter junit -o test-results.xml

Example GitHub Actions workflow:

- name: Run Agent Tests
  run: |
    npm install -g agent-replay
    agent-replay test "tests/*.yaml" --reporter junit -o results.xml
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Upload Results
  uses: actions/upload-artifact@v3
  with:
    name: test-results
    path: results.xml

Use Cases

Prompt Engineering: Test prompt changes against recorded sessions to catch regressions before deploying.

Model Migration: Compare behavior between GPT-3.5 and GPT-4, or between different providers.

Tool Testing: Verify that agents call the right tools with correct arguments in various scenarios.

Compliance: Ensure agents don't produce certain outputs (PII, hallucinated URLs, etc.)

Debugging: Replay problematic sessions to reproduce and fix issues.

License

MIT

About

Record, replay, and test AI agent sessions for behavioral regression testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors