What this is: Automated tests for an AI agent that can call tools (for example: “read this file,” “call this API”). The tests check what the agent decided to do (which tools it picked, in what order, and what it said at the end)—not just the final text.
For QA: Think of it like testing business rules on decisions, not only checking a screen or a single return value. Failures attach traces (a step-by-step log) so you can see prompts, tool calls, and outputs in the Playwright HTML report.
This repo is a sample / reference project ("private": true in package.json—not published as a product). You can copy patterns from it into your own codebase.
| Folder / file | In plain terms |
|---|---|
framework/ |
Checks you reuse: “Was tool X used?” “Does the answer follow our rules?” Helpers include AgentAssert, HeuristicContractMatcher, and shared types for traces and contracts. |
index.ts |
Single entry point for importing the framework from this repo when you copy it in. |
examples/agent/ |
Demo agent (the app under test): chat loop, tool list, sample tools (file-reader, api-caller). Swap this for your real agent; keep the same trace shape if you keep these tests. |
tests/ |
Playwright test files plus tests/fixtures/setup.ts, which builds the demo agent for each scenario. |
What gets tested: Five patterns that cover tool choice, output rules, order of steps, staying inside allowed tools, and behavior when something fails.
Tool list: Tools are registered in a ToolRegistry (a named list the model sees and that your code runs when the model asks). There is no separate MCP server in the default demo—tools run inside the test process so runs are fast and stable.
“Behavior contracts”: Rules like “must include these fields,” “should mention these words,” “must not match these patterns.” The matcher is rule-based (keywords and patterns)—not a second AI judging the first, and not “true” semantic understanding. The confidence score is a rough pass/fail helper for tuning, not a scientific accuracy percentage. More detail: Heuristic matching: scope and limits below.
Why Playwright: Same tool as browser E2E tests, but here it drives API + agent flows. You get retries, timeouts per test, and HTML reports with attachments (good for flaky LLM output).
┌───────────────────────────────────────────────────────────────────────────────────┐
│ YOUR TEST FILE │
│ import { AgentAssert } from 'agent-assert' // or ../../framework/AgentAssert.js │
│ const trace = await agent.run("some prompt") │
│ AgentAssert.toolWasInvoked(trace, 'file-reader') │
│ AgentAssert.satisfiesContract(trace.output, CONTRACT) │
└─────────────────┬──────────────────────────┬──────────────────────────────────────┘
│ │
┌───────────▼─────────┐ ┌───────────▼─────────────┐
│ Agent (app under │ │ AgentAssert │
│ test) │ │ (checks / asserts) │
│ 1. Send prompt to │ │ │
│ LLM (Anthropic / │ │ - toolWasInvoked() │
│ OpenAI / Ollama)│ │ - satisfiesContract() │
│ 2. Get tool calls │ │ - boundaryNotViolated() │
│ from model │ │ │
│ 3. Execute tool │ │ - traceFollowsSequence()│
│ 4. Send result back │ │ │
│ 5. Build TRACE │ └───────────┬─────────────┘
└────────┬────────────┘ │
│ ┌────────────▼──────────────┐
┌──────────▼───────────┐ │ HeuristicContractMatcher │
│ ToolRegistry │ │ │
│ │ │ 1. Structure (fields) │
│ file-reader → run() │ │ 2. Keywords (word checks) │
│ api-caller → run() │ │ 3. Forbidden (patterns) │
└──────────────────────┘ │ 4. Optional custom rule │
│ │
│ → MatchResult + score │
└───────────────────────────┘
flowchart TB
subgraph T["1 Test"]
A["test calls agent.run(prompt)"]
end
subgraph G["2 Agent loop"]
B["Send prompt plus tool defs from ToolRegistry to LLM API"]
C{"Model response?"}
D["Tool calls: Anthropic tool_use / OpenAI function calls e.g. file-reader path X"]
E["ToolRegistry.execute returns ToolResult"]
F["Send tool results back to model"]
H["Final text answer"]
end
subgraph TR["3 Trace and assertions"]
I["Build AgentTrace: reasoning, tool_call, tool_result, output"]
J["Test passes trace to AgentAssert"]
K["AgentAssert plus HeuristicContractMatcher vs BehaviorContract"]
L["MatchResult: confidence and per-layer detail"]
end
A --> B
B --> C
C -->|tools| D
D --> E
E --> F
F --> C
C -->|done| H
H --> I
I --> J
J --> K
K --> L
- The test starts the agent with a plain-language task (
agent.run(prompt)). - The agent sends that task to the LLM (Anthropic, OpenAI, or local Ollama) and includes the list of tools from the ToolRegistry.
- The model may answer with “please run tool X with these inputs” (the exact format differs by vendor; see diagram above).
- The agent runs the real tool code through the ToolRegistry and gets a ToolResult (success/failure + data).
- That result goes back to the model; steps 3–5 may repeat until the model stops asking for tools.
- The model eventually returns a final text answer (or stops).
- The agent assembles an
AgentTrace: a full log of reasoning, each tool call, each tool result, and final output. - The test sends that trace into AgentAssert helpers (pass/fail checks).
- For text rules, HeuristicContractMatcher compares the output to a BehaviorContract (keywords, patterns, etc.).
- You get a
MatchResult: pass/fail style outcome, a confidence score, and details per check layer (useful when debugging failures).
File: tests/behavioral/intent-routing.spec.ts
What it tests: For a user request like “read this log file,” did the agent actually call the right tool (e.g. file-reader) and not something else?
Why it matters: A bad answer can still sound fine. This checks the decision (which tool ran), not only the final text.
Key assertion:
const result = AgentAssert.toolWasInvoked(trace, 'file-reader', { filePath: /.*\.log$/ });
AgentAssert.expectMatched(result, 'file-reader should be invoked'); // embeds AgentAssert.formatResult(result) on failureWhat to look at in the code:
AgentAssert.toolWasInvoked()— finds tool_call steps in the traceparamMatchers— optional checks on arguments passed to the tool (e.g. file path)AgentAssert.expectNotMatched()— asserts a tool was not used (when the test expects a negative case)
File: tests/behavioral/output-contract.spec.ts
What it tests: Does the final answer follow agreed rules (required fields, must-include words, forbidden phrases)—without requiring an exact copy-paste string match?
Why it matters: LLM wording changes every run. Contracts check flexible rules instead of one frozen sentence. If wording drifts, add keywords, relax thresholds, or add a customValidator (see Heuristic matching: scope and limits).
Key assertion:
const result = AgentAssert.satisfiesContract(trace.output, BehaviorContract.SUMMARIZATION, 0.5);
AgentAssert.expectMatched(result, 'SUMMARIZATION contract should pass');What to look at in the code:
BehaviorContract.ts— ready-made rule sets (fields, keywords, forbidden patterns)HeuristicContractMatcher.evaluate()— runs structure + keywords + forbidden patterns (+ optional custom rule)minKeywordMatchRatio— how many of the listed keywords must appearforbiddenPatterns— if the output matches these, the contract fails
File: tests/behavioral/tool-invocation.spec.ts
What it tests: For workflows that need more than one tool (e.g. read file, then call API), did the agent follow a sensible order of steps?
Why it matters: UI tests usually see screens. Here you inspect the sequence of tool calls inside the trace—something classic UI checks do not cover.
Key assertion:
const result = AgentAssert.traceFollowsSequence(trace, [
{ type: 'tool_call', toolName: 'file-reader' },
{ type: 'tool_call', toolName: 'api-caller' },
]);
AgentAssert.expectMatched(result, 'trace should show file-reader then api-caller');What to look at in the code:
AgentAssert.traceFollowsSequence()— checks that steps appear in the right order (they do not have to be back-to-back)- The trace can include reasoning lines between tool calls
toolCallCountInRange()— optional check that the agent did not call tools too many or too few times
File: tests/boundary/hallucination-guard.spec.ts
What it tests: Did the agent only use tools it is allowed to use and avoid out-of-scope actions?
Why it matters: Models sometimes invent tool names or try extra actions. This pattern catches scope creep and fake tool calls.
Key assertion:
const result = AgentAssert.boundaryNotViolated(trace, ['file-reader']);
AgentAssert.expectMatched(result, 'only file-reader should be used');What to look at in the code:
boundaryNotViolated()— fails if any tool outside the allow-list was calledcreateFileOnlyAgent()— builds an agent with only certain tools registered (tight boundary)HALLUCINATION_PROMPTS— tricky prompts meant to provoke wrong or extra tool useSCOPE_BOUNDEDcontract — rules that flag forbidden wording when the task should stay narrow
File: tests/boundary/retry-behavior.spec.ts
What it tests: When a tool errors (file missing, API down), does the agent admit the failure and avoid claiming success? Does it retry within reasonable limits?
Why it matters: Many demos only test the happy path. Here failures are forced so you can see honest vs misleading behavior.
Key assertions (examples):
// Honest reporting: output mentions failure / contract passes
expect(mentionsFailure, 'output should reflect tool failure').toBe(true);
// No false success after API error
expect(claimsSuccess, 'must not claim ticket created when API failed').toBe(false);
// Retry cap: count only `file-reader` tool_call steps (metadata.toolCallCount includes all tools)
const n = trace.steps.filter(s => s.type === 'tool_call' && s.toolName === 'file-reader').length;
AgentAssert.expectMatched(
{
matched: n >= 1 && n <= 3,
confidence: n >= 1 && n <= 3 ? 1 : 0,
details: [`file-reader invocations: ${n} (expected: 1–3)`, '…'],
},
'file-reader retries on failure should stay between 1 and 3'
);What to look at in the code:
createFailingFileReaderAgent()— file-reader always fails on purpose (error-path testing)GRACEFUL_FAILUREcontract — expects honest error language; forbids false success claimsretryCountin trace metadata — counts tool results withsuccess: false- Cascade tests — what happens after the first tool fails (downstream steps)
Shared shapes for traces, contracts, and tools. Tests and the demo agent both use these types so “what we record” and “what we assert” stay in sync.
AgentTrace— the main object assertions read- Other types: steps, output, contracts, tool definitions, match results
Demo agent (the application under test). It is not the reusable assertion library. Flow in plain terms:
- Send the user message and tool list to the LLM (Anthropic or OpenAI-style API, including Ollama).
- Read the model reply: plain text and/or requests to call tools.
- Run the requested tools; send results back to the model.
- Repeat until the conversation finishes.
- Package everything into an
AgentTrace.
How to pick a provider: Set LLM_PROVIDER (or provider in code) to anthropic, openai, or ollama. Use ANTHROPIC_API_KEY / OPENAI_API_KEY as needed. For Ollama, the client uses OPENAI_BASE_URL, then OLLAMA_BASE_URL, then http://127.0.0.1:11434/v1. For LLM_PROVIDER=openai, only OPENAI_BASE_URL sets a custom base (Azure, proxy, etc.); OLLAMA_BASE_URL is not used so you do not accidentally send OpenAI traffic to Ollama.
If you edit the system prompt here, update test expectations and contracts so they still match what you want the agent to do.
Sample tools: JSON schema + run function. In the demo they run locally (real disk read; API tool uses mock responses). To use a real MCP server later, you would replace the run logic—not necessarily the schema.
Security: file-reader.ts blocks path tricks; read the file comments before changing paths.
Name → tool implementation. Also turns the same definitions into the Anthropic or OpenAI tool format the API expects.
Rule-based scoring (not “AI understands meaning”). Rough weights:
- Structure (~40%) — required fields (and optional length checks)
- Keywords (~35%, adjusted if you add a custom rule) — listed words must appear as substrings (after lowercasing)
- Forbidden (~25%) — if a regex matches, the contract fails that path
- Custom (optional) — your own function for extra checks
The confidence number is a blend of those checks—for pass/fail thresholds and tuning, not a true “semantic similarity” score.
Tuning (for less flaky runs):
- Lower
minKeywordMatchRatio— easier keyword pass - Lower the assertion threshold on
confidence— fewer false failures - Tighten
forbiddenPatterns— stricter “must not say” rules (test regexes carefully) - Add synonyms to keyword lists or use
customValidatorwhen wording varies a lot
| Topic | This repo | “Stronger” approaches (not built in) |
|---|---|---|
| Wording | Keyword substring checks | Second model as judge, specialized NLP models |
| Score | Weighted rules | Calibrated scores from another system |
| Same idea, different words | Can fail if keywords do not cover both phrasings | Embeddings, synonym lists, judges |
Not included here on purpose: a second LLM to grade answers, vector similarity, or heavy NLP—those add cost, latency, and ops. This matcher is meant to stay simple and inspectable.
Ready-made named rule sets for typical tasks: SUMMARIZATION, API_ACTION, MULTI_STEP, SCOPE_BOUNDED, GRACEFUL_FAILURE.
Main API for tests (each method returns a MatchResult):
toolWasInvoked(...)— was this tool used (and optional argument checks)?satisfiesContract(...)— does the final output follow the contract?boundaryNotViolated(...)— only tools from this allow-list?traceFollowsSequence(...)— did steps happen in this order?toolCallCountInRange(...)— tool call count within min/max?
Helpers:
formatResult(...)— readable dump for logs or failure messagesexpectMatched/expectNotMatched— Playwright-friendly helpers that attachformatResulton failure
Local runs: Copy .env.example to .env and set keys (provider, model, API keys, Ollama URL). Playwright loads these via tests/env-llm.ts so tests see the same settings as your IDE.
GitHub Actions: CI reads the same kind of values from repo Settings → Secrets and variables → Actions (no code edit needed). Workflow file: .github/workflows/ci.yml.
| Variable / secret | What it’s for |
|---|---|
LLM_PROVIDER |
Usually ollama in CI |
LLM_MODEL |
e.g. llama3.2:3b (local) or a *:cloud model |
OLLAMA_BASE_URL |
Optional; default http://127.0.0.1:11434/v1 |
OLLAMA_API_KEY |
Secret — needed only for Ollama Cloud models (*:cloud). Without it, you may see HTTP 500. Keys |
Tip: Put non-secret values in Variables; put keys in Secrets. If provider/model are unset in CI, defaults are ollama + llama3.2:3b.
Manual CI run: Actions → CI → Run workflow. Use Use workflow from to pick the branch. Optional Checkout ref overrides which git ref is checked out.
Local vs Cloud model names: Names like llama3.2:3b run on the runner. Names ending in *:cloud need OLLAMA_API_KEY.
retries: 1— one retry per failed test (helps with flaky LLM output)- Timeouts: 45s default; behavioral tests 120s locally, 300s in CI (
CI=true) because Ollama on CPU can be slow; boundary stays 45s trace: 'off'— no browser trace zip (these tests are not UI tests). Failures still get attachments fromregisterAgentTraceForDiagnosticsworkers: 3— parallel runs (watch API rate limits)- Report title — shows which provider/model ran
Test helpers: builds agents (createTestAgent, createFailingFileReaderAgent, …), test file paths, and registerAgentTraceForDiagnostics so traces show up in the HTML report when something fails.
- Node.js 18+
- Either API keys for a cloud LLM or a local Ollama install (free local models; see below)
cd agent-assert # or your clone folder name
npm install
npx playwright install # Playwright still expects browser binaries to be presentUse a .env file at the repo root (see .env.example), or export the same names in your terminal.
Anthropic (default if unset) — set the key; optionally set LLM_PROVIDER=anthropic:
export ANTHROPIC_API_KEY="sk-ant-..."
# optional explicit provider (defaults to anthropic)
export LLM_PROVIDER="anthropic"OpenAI — set OPENAI_API_KEY and LLM_PROVIDER=openai. Default model is gpt-4o if you omit LLM_MODEL:
export OPENAI_API_KEY="sk-..."
export LLM_PROVIDER="openai"
# optional:
export LLM_MODEL="gpt-4o-mini"Local Ollama — no cloud API key. Start Ollama, ollama pull <model>, then:
export LLM_PROVIDER="ollama"
export LLM_MODEL="qwen3.5:latest"
# optional if Ollama listens elsewhere:
# export OLLAMA_BASE_URL="http://127.0.0.1:11434/v1"
npx playwright testOr use LLM_PROVIDER=openai with OPENAI_BASE_URL pointing at Ollama (do not depend on OLLAMA_BASE_URL for that mode).
You can also set provider, apiKey, baseURL, and model in code when creating the Agent.
When tests use registerAgentTraceForDiagnostics, the report can include:
agent-run-summary.txt— every run: provider, model, tools used, output snippet- On failure:
agent-diagnostics.txt(full trace) andplaywright-failure.txt
Open the failing test → Attachments in the Playwright HTML report.
npx playwright testnpx playwright test tests/behavioral/intent-routing.spec.ts
npx playwright test tests/boundary/
npx playwright test --project=behavioral
npx playwright test --project=boundarynpx playwright show-report- Add
examples/agent/tools/your-tool.ts(same style asfile-reader.ts) - Register it on the ToolRegistry in test setup
- Add any mock data in
setup.tsif needed - Write tests with
AgentAssert.toolWasInvoked(trace, 'your-tool')
- Add a new entry in
BehaviorContract.ts - Set required fields, keywords, forbidden patterns
- Set
minKeywordMatchRatio(try ~0.2, then tune) - Optional:
customValidatorfor rules keywords cannot express - Add a test that calls
satisfiesContractwith your contract
- Add a static method on
AgentAssert.ts - Input:
AgentTraceorAgentOutput - Return:
MatchResult - Reuse
HeuristicContractMatcherwhere it fits - Put clear reason strings in
detailsfor failures
- Add a new code path in
examples/agent/agent.tsfor that API - If tool JSON differs, add a converter (like
toOpenAITools) onToolRegistry - Map responses into the same
TraceStepshapes as today framework/can stay the same—it only readsAgentTrace
- In each tool, replace local
executewith MCP client calls - Optional:
@modelcontextprotocol/sdkfor transport - Keep
inputSchemastable where possible - In tests, switch mock vs live in
setup.ts
| Jest / Vitest | Playwright (this repo) | |
|---|---|---|
| Isolation | Often one process | Workers (separate processes) |
| Retries | Extra setup | retries: 1 built in |
| HTML report | Add-on | Built in |
| Timeouts | Mostly suite-level | Per test + suite + global |
| This project | — | Custom attachments on failure (agent trace) |
| Later: real browser tests | Another tool | Same Playwright |
Using Playwright without a browser is intentional: reports, retries, and timeouts fit long, flaky LLM runs well.
Cloud LLM calls cost money per run. Local Ollama costs $0 but uses CPU/time.
Rough idea (Anthropic, example model): small tests ~$0.01–0.03 per run; multi-tool tests a bit more. Full suite scales with how many tests run and whether retries fire (up to about 2× if every test retries once).
OpenAI: See OpenAI pricing for your model.
Save money while developing:
- Use cheaper / smaller models (
LLM_MODEL, orAgentconfig) - Run one file or one test, not the whole suite
maxToolRoundsis already capped (default 10)
Timeouts (behavioral tests — 120s local, 300s on CI):
Cloud APIs and Ollama on CPU (especially in GitHub Actions) can be slow. CI already uses a longer timeout. If needed, increase behavioralTimeoutMs in playwright.config.ts. In CI, confirm ollama pull finished and the runner has enough RAM.
Flaky pass/fail:
Normal for LLMs. Try: relax minKeywordMatchRatio, add keywords, lower the confidence threshold, rely on retries: 1.
401 / wrong host:
Match LLM_PROVIDER to the key you use. For openai, custom bases use OPENAI_BASE_URL (not OLLAMA_BASE_URL). For local Ollama, use LLM_PROVIDER=ollama and the Ollama /v1 URL.
Ollama 500 with *:cloud models:
Set OLLAMA_API_KEY, or use a local model name without :cloud.
Ollama 400 — “does not support tools”:
Pick a model that supports tool calling (see Ollama models with Tools). Reasoning-only models (e.g. some deepseek-r1 tags) may fail here.
Output not JSON / taskType: "unknown":
The demo expects JSON; the parser in examples/agent/agent.ts strips common markdown fences. If it still fails, inspect raw text in the trace attachment.
File-reader "access denied":
Paths must stay under the allowed folder (path safety). Check FIXTURE_DIR and paths in the test.
"Tool X is not registered":
The model asked for a tool that was not registered—often a hallucinated tool name. Relevant checks: boundaryNotViolated, allow-lists in setup.