Skip to content

Feature: performance evaluators for token efficiency, tool efficiency, and time-to-resolution #6

@henrikrexed

Description

@henrikrexed

Problem or motivation

agentevals currently focuses on correctness evaluation (tool trajectory matching, response quality). But in production, performance matters just as much — an agent that gets the right answer but burns 1M tokens and takes 5 minutes is not production-ready.

Proposed solution

Add three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed.

1. Token Efficiency (token_efficiency)

Scores how efficiently the agent used tokens relative to a budget.

evaluators:
  - name: token_efficiency
    type: builtin
    config:
      max_tokens: 200000        # budget
      weight_input: 0.7         # input tokens weighted more (they're the cost driver)
      weight_output: 0.3

Scoring:

  • Extracts gen_ai.usage.input_tokens + gen_ai.usage.output_tokens from trace spans
  • Score = 1.0 - (actual_tokens / max_tokens), clamped to [0, 1]
  • Score 1.0 = very efficient, 0.0 = budget exceeded

Why it matters: From our AI Agent Benchmark, token usage varied 8x across solutions for the same task (185K vs 1.6M tokens). This evaluator catches regressions.

2. Tool Efficiency (tool_efficiency)

Scores whether the agent used tools effectively — penalizes waste.

evaluators:
  - name: tool_efficiency
    type: builtin
    config:
      max_tool_calls: 15        # budget
      penalize_duplicates: true  # repeated identical calls
      penalize_errors: true      # failed tool calls

Scoring:

  • Count total tool call spans from trace
  • Identify duplicates (same tool + same args called twice)
  • Identify errors (tool spans with error status)
  • Score = (useful_calls / total_calls) * (1.0 - budget_overrun_penalty)

What it catches:

  • Agent stuck in a loop calling the same tool repeatedly
  • Agent calling tools it doesn't use the results from
  • Agent exceeding reasonable tool call limits

3. Time Efficiency (time_efficiency)

Scores how quickly the agent resolved relative to a budget.

evaluators:
  - name: time_efficiency
    type: builtin
    config:
      max_duration_s: 120       # budget in seconds

Scoring:

  • Extract root span duration from trace
  • Score = 1.0 - (actual_duration / max_duration), clamped to [0, 1]

Eval Set Integration

These evaluators can be combined with performance_budget in eval cases:

{
  "eval_id": "crashloop_diagnosis",
  "conversation": [...],
  "performance_budget": {
    "max_tokens": 200000,
    "max_duration_s": 120,
    "max_tool_calls": 10
  }
}

CI/CD Gating

agentevals run trace.json \
  --eval-set k8s-sre.json \
  -m tool_trajectory_avg_score \
  -m token_efficiency \
  -m tool_efficiency \
  -m time_efficiency \
  --threshold 0.7

Alternatives considered

No response

Additional context

Three new evaluators following the custom evaluator protocol, but shipped as builtins:

  • src/agentevals/evaluator/token_efficiency.py
  • src/agentevals/evaluator/tool_efficiency.py
  • src/agentevals/evaluator/time_efficiency.py

They use extract_performance_metrics() from trace_metrics.py which already extracts token counts, latencies, and tool calls from OTel spans.

Human confirmation

  • I am a human (not a bot, agent, or AI) filing this issue.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions