Skip to content

darshjme/verdict

verdict

Python MIT License Tests Throughput PyPI

Score your agents across 3 dimensions — without spinning up another LLM to do it.


The Problem with Testing Agents

Traditional unit tests work because functions are deterministic. add(2, 3) always returns 5.

Agents break this contract. An agent asked "book me a flight to Paris" might succeed by calling a travel API, searching for prices first, asking a clarifying question, or failing gracefully. All four can be correct. None can be validated with assert output == expected.

The field has tried:

  • Exact match — too brittle. Any rephrasing fails.
  • Regex / substring — misses semantic correctness entirely.
  • Human eval — gold standard, but slow and doesn't scale to CI/CD.
  • LLM-as-judge — powerful, but expensive and non-deterministic itself.

What's missing: a structured, multi-dimensional framework that scores what actually matters in production.


Architecture

flowchart LR
    A[AgentRun] --> B[verdict Scorer]
    B --> C[TaskScore\n0.0–1.0]
    B --> D[ReasoningScore\n0.0–1.0]
    B --> E[ToolScore\n0.0–1.0]
    C --> F[AggregateScore]
    D --> F
    E --> F
    F --> G{≥ threshold?}
    G -->|yes| H[✅ PASS]
    G -->|no| I[❌ FAIL]
Loading

Quick Start

git clone https://github.com/darshjme/verdict
cd verdict && pip install -e .
from agent_evals import AgentEvaluator, EvalCase

def my_agent(task: str) -> str:
    # your LLM agent here
    return call_llm(task)

evaluator = AgentEvaluator(my_agent, max_steps=50, cost_ceiling=1.0)

report = evaluator.evaluate([
    EvalCase(input="What is 2+2?", expected="4"),
    EvalCase(input="Summarise the README", expected="production agent evaluation"),
    EvalCase(input="Fix the auth bug", expected="returns 200 on valid token"),
])

print(f"Pass rate: {report.pass_rate:.0%}")         # Pass rate: 87%
print(f"Avg task score: {report.avg_task_score:.2f}")
print(f"Avg reasoning: {report.avg_reasoning_score:.2f}")
print(f"Avg tool use: {report.avg_tool_score:.2f}")

Three Scoring Dimensions

Dimension What It Measures How
Task Completion Did the agent actually accomplish the goal? Semantic match against expected outcome
Reasoning Quality Was the chain-of-thought coherent and efficient? Step analysis — no circular logic, no dead ends
Tool Use Accuracy Were the right tools called with the right args? Tool call inspection against expected tool usage

Each dimension returns a float between 0.0 and 1.0. The aggregate is a weighted mean.


Benchmark Pipeline

flowchart TD
    A[EvalCase batch\n44k cases] --> B[AgentEvaluator]
    B --> C{Parallel workers}
    C --> D[Worker 1]
    C --> E[Worker 2]
    C --> F[Worker N]
    D --> G[ScoreResult]
    E --> G
    F --> G
    G --> H[EvalReport\n44k evals/sec on commodity HW]
Loading

44,000 eval cases/sec on a standard CPU — no GPU required.


API Reference

Class Purpose
AgentEvaluator(agent_fn, max_steps, cost_ceiling) Main evaluator — wraps your agent
EvalCase(input, expected, tags=[]) A single test case
ScoreResult Per-case scores: task, reasoning, tool, aggregate
EvalReport Aggregate report: pass_rate, avg scores, failed cases
DimensionScorer Base class — extend to add custom scoring dimensions

Part of Arsenal

verdict · sentinel · herald · engram · arsenal
Repo Purpose
verdict ← you are here
sentinel ReAct guard patterns — stop runaway agents
herald Semantic task router — dispatch to specialists
engram Agent memory — short-term + episodic recall
arsenal Meta-hub — the full pipeline

License

MIT © Darshankumar Joshi · Built as part of the Arsenal toolkit.

About

Production-grade LLM agent evaluation framework. 3-dimensional scoring: task completion, reasoning quality, tool use accuracy.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages