aMaze-Test

Behavioral testing for AI agents.

Define how your agent is allowed to behave, run it unchanged, and get a policy-aware audit report with execution traces, mocks, assertions, and pass/fail results.

⚠️ Current scope:

Supports LangChain / LangGraph agents

Each run evaluates a single isolated agent interaction (one run = one test case)

⚡ 30-second example

Run your existing agent under a policy:

export AGENT_PROMPT="Search PDFs for data governance frameworks"

PYTHONPATH=src python -m amaze.amaze_runner \
  examples/agents/one_conversation_agent.py \
  examples/policies/policy.json

That’s it.

no changes to your agent code no special test harness no custom wrappers

📊 What you get

And more ...

Full execution trace (agent → LLM → tools)
Policy validation (allowed tools, call limits, execution graph)
Token usage and counters
Mocked vs real calls clearly marked
Assertion failures with exact context

🧠 Why not just tracing or evals?

Most tools show what happened.aMaze-Test verifies:
Did the agent behave the way it was supposed to?
Prevent tool misuse
Lock execution paths
Enforce token budgets
Test deterministic scenarios with mocks
Catch regressions automatically

🧩 Core concepts

Control-plane policy :

Define limits and boundaries:
allowed tools
max LLM/tool calls
token budgets
Graph policy

🧩 Define exact execution flow:

agent → llm → tool → finish
Mocks & assertions
mock LLM or tool responses
assert inputs/outputs
turn agent runs into deterministic tests

Overview

aMazeTest solves a core problem in LLM agent development: you can't unit-test agent behavior the same way you test regular code. LLM agents are non-deterministic, use external APIs, and their call sequences can vary per run.

aMazeTest provides:

Capability	Description
Policy enforcement	Declare which tools are allowed, how many LLM calls are permitted, token budgets
Graph validation	Verify the exact sequence of agent → LLM → tool → finish
Mocks	Replace LLM responses and tool outputs with deterministic values
Assertions	Check that LLM inputs/outputs and tool inputs/outputs match expected values
Audit reports	JSON + HTML reports of every call, with inputs, outputs, and token usage
GUI	Web interface for managing agents, policies, test cases, and running suites
Zero code changes	Standard LangChain agents run unmodified

⚠️ Current limitations:

Supports LangChain / LangGraph agents only

Single-conversation execution (one run = one test case)

How It Works

Your agent code
      │
      ▼
amaze_runner.py ──── loads policy.json
      │
      ├── mode=langchain  → monkey-patches BaseChatModel + BaseTool
      │                     (zero changes to agent code)
      │
      └── mode=annotations → reads @amaze_tool / @amaze_llm / @amaze_agent
                             (decorators on your functions)
      │
      ▼
RuntimeState ─── intercepts every LLM call and tool call
      │
      ├── applies mocks (replaces real calls with deterministic outputs)
      ├── runs assertions (checks inputs/outputs against expected values)
      ├── enforces limits (max LLM calls, max tool calls, token budgets)
      └── validates graph (exact call sequence)
      │
      ▼
audit_logs/ ─── JSON + HTML report per run

The runner is launched as:

export AGENT_PROMPT='read file noname.txt and if it contains word "bitcoin" search for bitcoin price today'
PYTHONPATH=src python -m amaze.amaze_runner examples/agents/my_agent.py examples/policies/my_policy.json

Exit code `0` = test passed. Exit code `1` = policy violated, assertion failed, or script error. HTMl report will be in /aMazeTest/reports/ directory

Quick Start

1. Install dependencies

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Configure environment

Copy the provided example file and fill in your values:

cp .env.example .env

.env is git-ignored and must never be committed. .env.example is committed and documents every variable:

Variable	Required	Description
`OPENAI_API_KEY`	Yes	OpenAI API key for LLM calls
`TAVILY_API_KEY`	No	Tavily key for the `web_search` tool
`LANGSMITH_API_KEY`	No	LangSmith tracing
`LANGSMITH_PROJECT`	No	LangSmith project name
`LANGSMITH_TRACING`	No	Set to `true` to enable LangSmith tracing
`DATA_DIR`	No	Directory containing PDFs for the `pdf_search` tool
`CHROMA_DIR`	No	Persistent Chroma vector store directory

3. Write a policy

{
  "mode": "control_plane",
  "allowed_tools": ["pdf_search"],
  "max_llm_calls": 2,
  "max_tool_calls": 1
}

4. Run a test

export AGENT_PROMPT="Search PDFs for data governance frameworks"
PYTHONPATH=src python -m amaze.amaze_runner examples/agents/one_conversation_agent.py examples/policies/policy.json

Output:

[aMaze] runner started
[aMaze] mode=langchain
...
============================================================
aMazeTest Run Report  [trace: 4c6a6a9d]
============================================================
LLM calls (direct):   1
LLM calls (indirect): 1
Tool calls:           1
  pdf_search: 1
Total tokens:         391
Per-turn breakdown (1 turn(s)):
  Turn 1: llm=1 tool=1 (pdf_search:1) tokens=391 seq=[agent, llm, tool:pdf_search, finish]

Result: PASSED
============================================================

Policy Reference

Policies are JSON files stored in the examples/policies/ directory. The runner loads them at test time.

Control Plane Mode

Validates limits and allowed tools without enforcing call order.

{
  "mode": "control_plane",
  "allowed_tools": ["pdf_search", "web_search"],
  "max_llm_calls": 3,
  "max_tool_calls": 4,
  "max_tool_calls_per_tool": {
    "web_search": 1,
    "pdf_search": 2
  },
  "max_tokens": 8000,
  "audit_file": "agent_audit.json"
}

Field	Type	Description
`allowed_tools`	`string[]`	Tool names the agent may call. Any other tool triggers a policy violation.
`max_llm_calls`	`int`	Maximum direct LLM calls per turn. Indirect calls (post-tool) are not counted.
`max_tool_calls`	`int`	Maximum total tool calls across all tools per turn.
`max_tool_calls_per_tool`	`object`	Per-tool call limits. Key is tool name, value is max calls.
`max_tokens`	`int`	Maximum total tokens (input + output) per conversation.
`audit_file`	`string`	Base filename for the audit JSON (written to `audit_logs/`).

Graph Mode

Validates the exact sequence of nodes (agent → llm → tool → finish). Any deviation is a failure.

{
  "mode": "graph",
  "nodes": ["agent", "llm", "tool:pdf_search", "finish"],
  "edges": [
    ["agent", "llm"],
    ["llm", "tool:pdf_search"],
    ["tool:pdf_search", "finish"]
  ],
  "ignore_internal_llm": true,
  "max_tokens": 5000,
  "audit_file": "agent_audit.json"
}

Field	Type	Description
`nodes`	`string[]`	All valid nodes. Must include `"agent"` as first node and `"finish"` as terminal. Tool nodes are prefixed: `"tool:pdf_search"`.
`edges`	`[string, string][]`	Allowed transitions. Each edge is `[from, to]`.
`ignore_internal_llm`	`bool`	When `true` (recommended), indirect LLM calls (post-tool responses) are not validated against the graph. Default: `true`.
`max_tokens`	`int`	Maximum total tokens per conversation.

Node naming convention:

"agent" — turn start (always first)
"llm" — any direct LLM call
"tool:name" — tool call, e.g. "tool:pdf_search", "tool:web_search"
"finish" — turn end (terminal node)

Mocks

Mocks replace real LLM/tool calls with deterministic outputs. This makes tests fast, free, and reproducible.

"mocks": [
  {
    "target": "llm",
    "match_contains": "pdf",
    "return_tool_call": {"tool": "pdf_search", "args": {"query": "data governance"}}
  },
  {
    "target": "tool:pdf_search",
    "output": "Data governance frameworks define policies. Source: framework.pdf, page 3."
  },
  {
    "target": "llm",
    "match_contains": "summarize",
    "return_ai_message": "Here is the summary..."
  }
]

Field	Type	Description
`target`	`string`	`"llm"` or `"tool:<name>"`
`match_contains`	`string`	Optional. Mock only applies when the call input contains this substring. Omit to match all calls.
`output`	`string`	For tool mocks: the string returned instead of calling the real tool.
`return_tool_call`	`object`	For LLM mocks: respond as if the LLM decided to call a tool. `{"tool": "...", "args": {...}}`
`return_ai_message`	`string`	For LLM mocks: respond with a plain text message (no tool call).

LLM mock behavior:

return_tool_call — the agent continues its ReAct loop and calls the named tool
return_ai_message — the agent produces a final answer and stops
Indirect LLM calls (after a tool executes) are never mocked — they always hit the real LLM

Assertions

Assertions check inputs and outputs at runtime. A failed assertion marks the run as FAILED.

"assertions": [
  {
    "target": "tool:pdf_search",
    "check": "input",
    "operator": "equals",
    "expected": "data governance frameworks",
    "description": "PDF search query must exactly match expected topic"
  },
  {
    "target": "tool:pdf_search",
    "check": "output",
    "operator": "contains",
    "expected": "page",
    "description": "Result must include a page reference"
  },
  {
    "target": "tool:pdf_search",
    "check": "output",
    "operator": "matches_regex",
    "expected": "Source: .+\\.pdf",
    "description": "Result must include a source filename"
  },
  {
    "target": "llm",
    "check": "input",
    "operator": "starts_with",
    "expected": "You are a helpful",
    "description": "System prompt must start correctly"
  }
]

Field	Type	Description
`target`	`string`	`"llm"` or `"tool:<name>"`
`check`	`"input"` \| `"output"`	Whether to check the call's input or output
`operator`	`string`	One of: `equals`, `contains`, `starts_with`, `matches_regex`
`expected`	`string`	The value to compare against
`description`	`string`	Optional human-readable label in the report

Running Tests

Single test via CLI

export AGENT_PROMPT="your prompt here"
PYTHONPATH=src python -m amaze.amaze_runner examples/agents/my_agent.py examples/policies/my_policy.json

System test scripts

The tests/system/ directory contains shell scripts for each scenario. Each script sets AGENT_PROMPT and runs the runner, then checks the exit code.

Before running, set your Python path. Open tests/system/run_all_tests.sh (and the autogen_system_test/ / crewai_system_test/ variants if needed) and update the PYTHON variable to point to the interpreter inside your virtual environment:

# In tests/system/run_all_tests.sh — update this line:
PYTHON="${PYTHON:-/home/ubuntu/venv/bin/python}"

You can also override it inline without editing the file:

PYTHON=/path/to/your/venv/bin/python bash tests/system/run_all_tests.sh

Run the tests:

# Run all system tests
bash tests/system/run_all_tests.sh

# Run a single test
bash tests/system/test_01_policy_cp_pass.sh
bash tests/system/test_01_policy_cp_fail.sh   # expects exit code 1

Test scripts are named test_NN_<policy>_<pass|fail>.sh:

_pass.sh — the agent should satisfy the policy (exit 0)
_fail.sh — the agent should violate the policy (exit 1 expected, test passes if exit 1)

Unit tests

/path/to/venv/bin/pytest tests/unit/test_framework.py -v

Audit reports

After each run, two files are written to audit_logs/:

<agent>_audit_<timestamp>.json — machine-readable full trace
<agent>_audit_<timestamp>.html — human-readable report with call timeline, token counts, assertion results

Annotation Mode

Standard usage requires zero changes to your agent code (monkey-patch mode). For agents where you control the LLM call explicitly (e.g. a custom ReAct loop), you can use annotation decorators instead:

from amaze.annotations import amaze_tool, amaze_llm, amaze_agent

llm = ChatOpenAI(model="gpt-4.1-mini").bind_tools([...])

@amaze_tool("web_search", description="Search the web for recent information.")
def web_search(query: str) -> str:
    return tavily.search(query)

@amaze_llm("gpt-4.1-mini")
def call_llm(messages: list):
    return llm.invoke(messages)   # explicit call — interceptable

@amaze_agent
def run_turn(prompt: str) -> str:
    messages = [HumanMessage(content=prompt)]
    while True:
        response = call_llm(messages)
        if not response.tool_calls:
            return response.content
        # dispatch tool calls...

The runner auto-detects annotation imports and activates annotation mode. Monkey-patching is skipped.

When to use annotations vs. monkey-patching:

	Monkey-patch (default)	Annotations
Agent code changes	None required	Add decorators
LangChain `create_react_agent`	Full support	Not needed
Custom ReAct loop	Full support	Recommended
LLM mock interception	Yes	Yes
AutoGen / CrewAI	Not supported	Tool calls only (LLM cannot be intercepted)

See examples/agents/langchain_annotated_agent.py for a complete example.

GUI

A web interface for managing the full testing workflow.

Start the server

Must be run from the project root — the server resolves paths relative to the working directory.

cd /path/to/aMazeTest
/path/to/venv/bin/uvicorn gui.server:app --reload --port 8080 --host 0.0.0.0

Update /path/to/venv to your virtual environment (e.g. /home/ubuntu/venv).

Open http://<your-server-ip>:8080 in your browser.

Features

Agents — register agent scripts (name + file path)
Policies — create and edit policies with a JSON editor; auto-syncs from the examples/policies/ directory
Test Cases — define test cases (agent + policy + prompt + expected outcome)
Suites — group test cases into suites for batch execution
Runs — execute single tests or full suites with live SSE log streaming
Audit Reports — view HTML reports directly from the run results page
MCP Servers — register MCP tool servers and discover their tools

REST API

All GUI data is accessible via REST. Base URL: http://localhost:8080

Agents

Method	Path	Body	Description
`GET`	`/api/agents`	—	List all agents
`POST`	`/api/agents`	`{name, file_path, description}`	Create agent
`PUT`	`/api/agents/{name}`	`{name, file_path, description}`	Update agent
`DELETE`	`/api/agents/{name}`	—	Delete agent

Policies

Method	Path	Body	Description
`GET`	`/api/policies`	—	List all policies (auto-imports from `examples/policies/` dir)
`GET`	`/api/policies/{name}`	—	Get single policy
`POST`	`/api/policies`	`{name, description, policy_json}`	Create policy + write `.json` file
`PUT`	`/api/policies/{name}`	`{name, description, policy_json}`	Update policy + write `.json` file
`DELETE`	`/api/policies/{name}`	—	Delete policy + remove `.json` file

Test Cases

Method	Path	Body	Description
`GET`	`/api/test-cases`	—	List all test cases
`GET`	`/api/test-cases/{name}`	—	Get single test case
`POST`	`/api/test-cases`	`{name, agent_name, policy_name, prompt, expected_pass}`	Create test case
`PUT`	`/api/test-cases/{name}`	same	Update test case
`DELETE`	`/api/test-cases/{name}`	—	Delete test case

Suites

Method	Path	Body	Description
`GET`	`/api/suites`	—	List all suites
`GET`	`/api/suites/{name}`	—	Get suite with nested test cases
`POST`	`/api/suites`	`{name, description, test_case_names[]}`	Create suite
`PUT`	`/api/suites/{name}`	same	Update suite
`DELETE`	`/api/suites/{name}`	—	Delete suite

Runs

Method	Path	Body	Description
`POST`	`/api/runs/test`	`{test_case_name}`	Start a single test run; returns `run_id`
`GET`	`/api/runs/test/{run_id}`	—	Get run record and outcome
`GET`	`/api/runs/test/{run_id}/stream`	—	SSE stream of live log lines + final outcome
`POST`	`/api/runs/suite`	`{suite_name}`	Start a suite run; returns `suite_run_id`
`GET`	`/api/runs/suite/{suite_run_id}`	—	Get suite run with all child test runs
`GET`	`/api/runs/suite/{suite_run_id}/stream`	—	SSE stream with per-test events + summary
`GET`	`/api/runs/suite-history/{suite_name}`	—	Last 20 suite runs for a suite

MCP Servers

Method	Path	Body	Description
`GET`	`/api/mcp-servers`	—	List all MCP servers
`POST`	`/api/mcp-servers`	`{name, url, transport, notes}`	Register MCP server
`PUT`	`/api/mcp-servers/{name}`	same	Update MCP server
`DELETE`	`/api/mcp-servers/{name}`	—	Delete MCP server
`POST`	`/api/mcp-servers/{name}/fetch-tools`	—	Connect and discover available tools

Audit Reports

Method	Path	Description
`GET`	`/audit/{filename}.html`	Serve an HTML audit report from `audit_logs/`

MCP Server

aMazeTest ships with a FastMCP server exposing the same tools as the example agents:

cd examples/mcp_server
uvicorn server:app --port 8000

Available tools: pdf_search, web_search, dummy_email, file_read, multiply.

Register it in the GUI under MCP Servers with URL http://127.0.0.1:8000/mcp and transport streamable_http.

Project Structure

aMazeTest/
├── src/
│   └── amaze/
│       ├── amaze_runner.py      # Entry point — runs agent with policy enforcement
│       ├── instrumentation.py   # Monkey-patches LangChain classes (LLM + tool hooks)
│       ├── annotations.py       # Decorator-based instrumentation (@amaze_tool, @amaze_llm, @amaze_agent)
│       ├── policy.py            # Policy dataclasses (GraphPolicy, ControlPlanePolicy, mocks, assertions)
│       ├── state.py             # RuntimeState — tracks calls, validates policy, records audit data
│       └── reporting.py         # Generates JSON + HTML audit reports
│
├── examples/
│   ├── agents/
│   │   ├── one_conversation_agent.py       # Standard LangChain agent (monkey-patch mode)
│   │   ├── langchain_annotated_agent.py    # LangChain agent with explicit annotations + manual ReAct loop
│   │   ├── autogen_annotated_agent.py      # AutoGen agent (tool interception only — LLM not supported)
│   │   └── crewai_annotated_agent.py       # CrewAI agent (tool interception only — LLM not supported)
│   ├── policies/
│   │   ├── policy.json                # control_plane: pdf_search only
│   │   ├── policy_graph.json          # graph: agent→llm→pdf_search→finish
│   │   ├── policy_assert_output.json  # graph + assertions on tool output
│   │   ├── policy_assert_input.json   # control_plane + assertions on tool input
│   │   ├── policy_cp_graph.json       # control_plane: web_search + dummy_email
│   │   ├── policy_token_graph.json    # graph with token limit
│   │   └── policy_token_strict.json   # control_plane with strict token limit
│   └── mcp_server/
│       ├── server.py            # FastMCP server definition
│       └── tools/               # Tool implementations
│
├── tests/
│   ├── unit/
│   │   └── test_framework.py    # Pytest unit test suite
│   └── system/
│       ├── run_all_tests.sh     # ← update PYTHON path here for your environment
│       ├── test_01_policy_cp_pass.sh   # control_plane — should pass
│       ├── test_01_policy_cp_fail.sh   # control_plane — should fail (disallowed tool)
│       ├── test_02_policy_graph_*.sh   # graph validation
│       ├── test_03_policy_multiply_*.sh
│       ├── test_04_policy_assert_input_*.sh
│       ├── test_05_policy_assert_output_*.sh
│       ├── test_06_policy_cp_graph_*.sh
│       ├── test_07_policy_token_graph_*.sh
│       ├── test_08_policy_token_strict_*.sh
│       ├── autogen_system_test/        # Same tests for AutoGen agent
│       └── crewai_system_test/         # Same tests for CrewAI agent
│
├── gui/
│   ├── server.py            # FastAPI app
│   ├── database.py          # SQLite setup
│   ├── models.py            # Pydantic request models
│   ├── runner.py            # Async subprocess runner for GUI-triggered tests
│   ├── static/              # SPA frontend (index.html)
│   └── routers/
│       ├── agents.py
│       ├── policies.py
│       ├── test_cases.py
│       ├── suites.py
│       ├── runs.py
│       └── mcp_servers.py
│
├── audit_logs/              # Generated JSON + HTML reports (git-ignored)
├── .env                     # API keys and environment config (git-ignored)
├── requirements.txt         # Pinned Python dependencies
└── CLAUDE.md                # Developer notes for Claude Code

Supported Frameworks

Framework	LLM interception	Tool interception	Mocks	Notes
LangChain (monkey-patch)	Full	Full	Full	Default mode, zero agent code changes
LangChain (annotations)	Full	Full	Full	Use `@amaze_llm` when LLM call is explicit
LangGraph	Full	Full	Full	Patched via `Pregel.invoke/ainvoke`
AutoGen	Not supported	Partial (`@amaze_tool`)	Tools only	LLM called internally via OpenAI SDK
CrewAI	Not supported	Partial (`@amaze_tool`)	Tools only	LLM called internally via LiteLLM

AutoGen and CrewAI limitations are by design: those frameworks call the LLM internally, bypassing BaseChatModel. Full support would require framework-specific adapters (AutoGen register_reply hook, CrewAI custom LLM class).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs/assets		docs/assets
examples		examples
gui		gui
src/amaze		src/amaze
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

aMaze-Test

⚡ 30-second example

📊 What you get

And more ...

🧠 Why not just tracing or evals?

🧩 Core concepts

🧩 Define exact execution flow:

Table of Contents

Overview

How It Works

Exit code 0 = test passed. Exit code 1 = policy violated, assertion failed, or script error. HTMl report will be in /aMazeTest/reports/ directory

Quick Start

1. Install dependencies

2. Configure environment

3. Write a policy

4. Run a test

Policy Reference

Control Plane Mode

Graph Mode

Mocks

Assertions

Running Tests

Single test via CLI

System test scripts

Unit tests

Audit reports

Annotation Mode

GUI

Start the server

Features

REST API

Agents

Policies

Test Cases

Suites

Runs

MCP Servers

Audit Reports

MCP Server

Project Structure

Supported Frameworks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Exit code `0` = test passed. Exit code `1` = policy violated, assertion failed, or script error. HTMl report will be in /aMazeTest/reports/ directory

Packages