Skip to content

Standalone MCP server for agent evaluation using Strands Evals SDK. pip install and add to your mcp.json.

Notifications You must be signed in to change notification settings

bannff/evals-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Evals Server

Standalone MCP server for agent evaluation, powered by the Strands Evals SDK.

Install

git clone https://github.com/bannff/evals-server.git
cd evals-server
pip install -e .

Setup

  1. Configure AWS credentials for Bedrock (default provider):
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
# or: aws configure
  1. Enable model access in the Bedrock console for Claude Sonnet (or your preferred model).

Usage

Add to your IDE's mcp.json:

{
  "mcpServers": {
    "evals": {
      "command": "evals-server",
      "args": []
    }
  }
}

Available MCP Tools

Tool Description
evals_create_suite Create an evaluation suite with test cases
evals_add_case Add a test case to a suite
evals_list_suites List all suites
evals_get_suite Get suite details
evals_list_evaluators List available LLMAJ evaluators
evals_run_experiment Run experiment with evaluators against an agent
evals_generate_experiment Auto-generate test cases from context
evals_list_runs List evaluation runs
evals_get_run Get run details

Quick Example

Once connected via MCP, an agent can:

# 1. List available evaluators
evals_list_evaluators()

# 2. Run an experiment
evals_run_experiment(
    cases=[
        {"name": "math", "input": {"query": "What is 2+2?"}, "expected_output": {"output": "4"}},
        {"name": "capital", "input": {"query": "Capital of France?"}, "expected_output": {"output": "Paris"}}
    ],
    evaluator_names=["output", "helpfulness"],
    model_id="us.anthropic.claude-sonnet-4-20250514",
    system_prompt="You are a helpful assistant."
)

# 3. Or auto-generate test cases
evals_generate_experiment(
    context="Agent with calculator and search tools",
    task_description="Math and research assistant",
    num_cases=10
)

About

Standalone MCP server for agent evaluation using Strands Evals SDK. pip install and add to your mcp.json.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages