Record, replay, and test AI agent sessions for behavioral regression testing.
When you change a prompt, switch models, or update system instructions, you have no reliable way to know if your agent still behaves correctly. Manual testing is slow and inconsistent. Unit tests catch code bugs but not behavioral regressions.
agent-replay solves this by letting you:
- Record real agent sessions as test fixtures
- Replay them against different models or prompts
- Compare behavioral differences automatically
- Test with assertions that catch regressions
This is like snapshot testing, but for AI agent behavior.
npm install -g agent-replayOr use in a project:
npm install agent-replayStart the recording proxy:
agent-replay record --port 8080 --name my-sessionPoint your application at the proxy:
OPENAI_BASE_URL=http://localhost:8080/v1 node my-agent.jsPress Ctrl+C to stop. Your session is saved to ./recordings/my-session.json.
agent-replay replay ./recordings/my-session.json --model gpt-4oThis replays every turn and compares the responses.
Initialize a test suite:
agent-replay initEdit tests/agent.test.yaml:
name: Agent Behavioral Tests
tests:
- name: Should use search tool for factual questions
session: ./recordings/factual-question.json
assertions:
- type: tool_called
value: search
message: Should call search for factual questions
- name: Should not hallucinate URLs
session: ./recordings/url-question.json
assertions:
- type: content_not_contains
value: "example.com"
- name: Should greet politely
session: ./recordings/greeting.json
assertions:
- type: content_contains
turn: 0
value: "Hello"agent-replay test "tests/*.yaml"Output:
Running: Agent Behavioral Tests
✓ Should use search tool for factual questions (1234ms)
✓ Should not hallucinate URLs (987ms)
✗ Should greet politely (654ms)
- Response should contain "Hello"
2 passed, 1 failed, 0 skipped
Start a recording proxy to capture LLM API calls.
agent-replay record [options]
Options:
-p, --port <port> Proxy port (default: 8080)
-t, --target <url> Target API URL (default: https://api.openai.com)
-o, --output <dir> Output directory (default: ./recordings)
-n, --name <name> Session nameReplay a recorded session and compare results.
agent-replay replay <session> [options]
Options:
-m, --model <model> Model to use for replay
-b, --base-url <url> API base URL
-k, --api-key <key> API key
-o, --output <file> Output file for results
--json Output as JSON
--stop-at <turn> Stop after N turnsCompare two session recordings directly.
agent-replay compare <session1> <session2>Run behavioral tests from YAML test suites.
agent-replay test <pattern> [options]
Options:
-m, --model <model> Model to use
-f, --filter <name> Filter tests by name
--bail Stop on first failure
--reporter <type> Output format (console, json, junit)
-o, --output <file> Output file for reportsShow information about a recorded session.
agent-replay info <session>import { Recorder, Replayer, TestRunner } from 'agent-replay';
// Record sessions programmatically
const recorder = new RecorderHook('my-session');
// ... make API calls and record them ...
recorder.recordTurn(request, response, latencyMs);
recorder.save();
// Replay a session
const replayer = new Replayer({
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY,
});
const result = await replayer.replay(session);
console.log(`Match rate: ${result.summary.match_rate * 100}%`);
// Run tests
const runner = new TestRunner({ model: 'gpt-4' });
const results = await runner.runSuite(suite);| Type | Description |
|---|---|
content_contains |
Response content includes the value |
content_not_contains |
Response content does not include the value |
tool_called |
A specific tool was called |
tool_not_called |
A specific tool was not called |
tool_args_match |
Tool arguments match expected values |
finish_reason |
Response has specific finish reason |
custom |
Custom JavaScript expression |
turn: 0- First turnturn: 3- Fourth turn (0-indexed)turn: last- Last turnturn: any- Any turn (default)
Sessions are stored as JSON:
{
"id": "rec-1234567890-abc123",
"name": "my-session",
"recorded_at": "2024-01-15T10:30:00Z",
"provider": "openai",
"turns": [
{
"index": 0,
"timestamp": "2024-01-15T10:30:01Z",
"request": {
"model": "gpt-4",
"messages": [...],
"tools": [...]
},
"response": {
"id": "chatcmpl-...",
"choices": [...],
"usage": {...}
},
"latency_ms": 1234
}
]
}Generate JUnit reports for CI systems:
agent-replay test "tests/*.yaml" --reporter junit -o test-results.xmlExample GitHub Actions workflow:
- name: Run Agent Tests
run: |
npm install -g agent-replay
agent-replay test "tests/*.yaml" --reporter junit -o results.xml
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: test-results
path: results.xmlPrompt Engineering: Test prompt changes against recorded sessions to catch regressions before deploying.
Model Migration: Compare behavior between GPT-3.5 and GPT-4, or between different providers.
Tool Testing: Verify that agents call the right tools with correct arguments in various scenarios.
Compliance: Ensure agents don't produce certain outputs (PII, hallucinated URLs, etc.)
Debugging: Replay problematic sessions to reproduce and fix issues.
MIT