npm mcp-eval-runner package
A standardized testing harness for MCP servers and agent workflows. Define test cases as YAML fixtures (steps → expected tool calls → expected outputs), run regression suites directly from your MCP client, and get pass/fail results with diffs — without leaving Claude Code or Cursor.
Tool reference | Configuration | Fixture format | Contributing | Troubleshooting | Design principles
- YAML fixtures: Test cases are plain files in version control — diffable, reviewable, and shareable.
- Two execution modes: Live mode spawns a real MCP server and calls tools via stdio; simulation mode runs assertions against
expected_outputwithout a server. - Composable assertions: Combine
output_contains,output_not_contains,output_equals,output_matches,schema_match,tool_called, andlatency_underper step. - Step output piping: Reference a previous step's output in downstream inputs via
{{steps.<step_id>.output}}. - Regression reports: Compare the current run to any past run and surface what changed.
- Watch mode: Automatically reruns the affected fixture when files change.
- CI-ready: Includes a GitHub Action for running evals on every config change.
- Node.js v22.5.0 or newer.
- npm.
Add the following config to your MCP client:
{
"mcpServers": {
"eval-runner": {
"command": "npx",
"args": ["-y", "mcp-eval-runner@latest"]
}
}
}By default, eval fixtures are loaded from ./evals/ in the current working directory. To use a different path:
{
"mcpServers": {
"eval-runner": {
"command": "npx",
"args": ["-y", "mcp-eval-runner@latest", "--fixtures=~/my-project/evals"]
}
}
}Amp · Claude Code · Cline · Cursor · VS Code · Windsurf · Zed
Create a file at evals/smoke.yaml. Use live mode (recommended) by including a server block:
name: smoke
description: "Verify eval runner itself is working"
server:
command: node
args: ["dist/index.js"]
steps:
- id: list_check
description: "List available test cases"
tool: list_cases
input: {}
expect:
output_contains: "smoke"Then enter the following in your MCP client:
Run the eval suite.
Your client should return a pass/fail result for the smoke test.
Fixtures are YAML (or JSON) files placed in the fixtures directory. Each file defines one test case.
| Field | Required | Description |
|---|---|---|
name |
Yes | Unique name for the test case |
description |
No | Human-readable description |
server |
No | Server config — if present, runs in live mode; if absent, runs in simulation mode |
steps |
Yes | Array of steps to execute |
server:
command: node # executable to spawn
args: ["dist/index.js"] # arguments
env: # optional environment variables
MY_VAR: "value"When server is present the eval runner spawns the server as a child process, connects via MCP stdio transport, and calls each step's tool against the live server.
Each step has the following fields:
| Field | Required | Description |
|---|---|---|
id |
Yes | Unique identifier within the fixture (used for output piping) |
tool |
Yes | MCP tool name to call |
description |
No | Human-readable step description |
input |
No | Key-value map of arguments passed to the tool (default: {}) |
expected_output |
No | Literal string used as output in simulation mode |
expect |
No | Assertions evaluated against the step output |
Live mode — fixture has a server block:
- The server is spawned and each step calls the named tool via MCP stdio.
- Assertions run against the real tool response.
- Errors from the server cause the step (and by default the case) to fail immediately.
Simulation mode — no server block:
- No server is started.
- Each step's output is taken from
expected_output(or empty string if absent). - Assertions run against that static output.
- Useful for authoring and CI dry-runs, but
output_containsassertions will always fail ifexpected_outputis not set.
All assertions go inside a step's expect block:
expect:
output_contains: "substring" # output includes this text
output_not_contains: "error" # output must NOT include this text
output_equals: "exact string" # output exactly matches
output_matches: "regex pattern" # output matches a regular expression
tool_called: "tool_name" # verifies which tool was called
latency_under: 500 # latency in ms must be below this threshold
schema_match: # output (parsed as JSON) matches JSON Schema
type: object
required: [id]
properties:
id:
type: numberMultiple assertions in one expect block are all evaluated; the step fails if any assertion fails.
Reference the output of a previous step in a downstream step's input using {{steps.<step_id>.output}}:
steps:
- id: search_step
tool: search
input:
query: "mcp eval runner"
expected_output: "result: mcp-eval-runner v1.0"
expect:
output_contains: "mcp-eval-runner"
- id: summarize_step
tool: summarize
input:
text: "{{steps.search_step.output}}"
expected_output: "Summary: mcp-eval-runner v1.0"
expect:
output_contains: "Summary"Piping works in both live mode and simulation mode.
Fixtures created with the create_test_case tool do not include a server block. They always run in simulation mode. To use live mode, add a server block manually to the generated YAML file.
run_suite— execute all fixtures in the fixtures directory; returns a pass/fail summaryrun_case— run a single named fixture by namelist_cases— enumerate available fixtures with step counts and descriptions
create_test_case— create a new YAML fixture file (simulation mode; noserverblock)scaffold_fixture— generate a boilerplate fixture with placeholder steps and pre-filled assertion comments
regression_report— compare the current fixture state to the last run; surfaces regressions and fixescompare_results— diff two specific runs by run IDgenerate_html_report— generate a single-file HTML report for a completed run
evaluate_deployment_gate— CI gate; fails if recent pass rate drops below a configurable thresholddiscover_fixtures— discover fixture files across one or more directories (respectsFIXTURE_LIBRARY_DIRS)
Directory to load YAML/JSON eval fixture files from.
Type: string
Default: ./evals
Path to the SQLite database file used to store run history.
Type: string
Default: ~/.mcp/evals.db
Maximum time in milliseconds to wait for a single step before marking it as failed.
Type: number
Default: 30000
Watch the fixtures directory and rerun the affected fixture automatically when files change.
Type: boolean
Default: false
Output format for eval results.
Type: string
Choices: console, json, html
Default: console
Number of test cases to run in parallel.
Type: number
Default: 1
Start an HTTP server on this port instead of stdio transport.
Type: number
Default: disabled (uses stdio)
Pass flags via the args property in your JSON config:
{
"mcpServers": {
"eval-runner": {
"command": "npx",
"args": ["-y", "mcp-eval-runner@latest", "--watch", "--timeout=60000"]
}
}
}- No mocking: Live mode evals run against real servers. Correctness is non-negotiable.
- Fixtures are text: YAML/JSON in version control; no proprietary formats or databases.
- Dogfood-first: The eval runner's own smoke fixture tests the eval runner itself.
Before publishing a new version, verify the server with MCP Inspector to confirm all tools are exposed correctly and the protocol handshake succeeds.
Interactive UI (opens browser):
npm run build && npm run inspectCLI mode (scripted / CI-friendly):
# List all tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list
# List resources and prompts
npx @modelcontextprotocol/inspector --cli node dist/index.js --method resources/list
npx @modelcontextprotocol/inspector --cli node dist/index.js --method prompts/list
# Call a tool (example — replace with a relevant read-only tool for this plugin)
npx @modelcontextprotocol/inspector --cli node dist/index.js \
--method tools/call --tool-name list_cases
# Call a tool with arguments
npx @modelcontextprotocol/inspector --cli node dist/index.js \
--method tools/call --tool-name run_case --tool-arg name=smokeRun before publishing to catch regressions in tool registration and runtime startup.
New assertion types go in src/assertions.ts — implement the Assertion interface and add a test. Integration tests live under tests/ as unit tests and under evals/ as eval fixtures.
npm install && npm testThis plugin is available on:
Search for mcp-eval-runner.