Feature: efficiency evaluator for carbon-aware agent evaluation ⁠


agentevals now has performance evaluators for tokens, tools, and time (PR #7). But none of them measure the environmental cost of an agent's work.

AI agents are heavy compute consumers — each LLM call burns GPU cycles, each tool call hits APIs, and inefficient agents multiply this waste. As AI agent adoption grows, we need a way to evaluate and gate deployments based on energy efficiency, not just correctness.

Teams building responsible AI need to answer: "Is this agent getting greener over time, or are we regressing?"

Proposed solution

Add an ⁠ energy_efficiency ⁠ evaluator that scores how energy-efficient an agent run was, using a compute cost model derived from trace data.

Scoring approach

Energy can't be measured directly from traces, but we can estimate relative compute cost from observable signals:


energy_score = 1.0 - (estimated_energy / energy_budget)

Where ⁠ estimated_energy ⁠ is calculated from:

Total tokens (input + output)
•⁠  ⁠Weight: High
•⁠  ⁠Rationale: Direct proxy for GPU compute time

Number of LLM calls
•⁠  ⁠Weight: Medium
•⁠  ⁠Rationale: Each call has fixed overhead (network, scheduling, KV cache init)

Model tier (small/medium/large)
•⁠  ⁠Weight: High
•⁠  ⁠Rationale: A 70B model uses ~10x the energy of a 7B model per token

Tool calls (external APIs)
•⁠  ⁠Weight: Low
•⁠  ⁠Rationale: Each HTTP call has network + server-side compute

Total duration
•⁠  ⁠Weight: Low
•⁠  ⁠Rationale: Longer runs = more idle compute reservation

Config


evaluators:
  - name: energy_efficiency
    type: builtin
    config:
      energy_budget: 1.0          # relative energy units (1.0 = baseline)
      model_tier: "large"         # small|medium|large|custom
      cost_per_1k_input_tokens: 0.5   # relative energy units
      cost_per_1k_output_tokens: 1.5  # output is ~3x more expensive (generation vs prefill)
      cost_per_llm_call: 0.1          # fixed overhead per call
      cost_per_tool_call: 0.05        # external API cost
      carbon_intensity_gco2_kwh: null  # optional: grid carbon intensity for CO2 estimation

Model tier energy multipliers

Based on published GPU benchmarks and inference costs:

⁠ **small** ⁠
•⁠  ⁠Examples: GPT-4o-mini, Claude Haiku, Llama 8B
•⁠  ⁠Multiplier: 1x

⁠ **medium** ⁠
•⁠  ⁠Examples: GPT-4o, Claude Sonnet, Llama 70B
•⁠  ⁠Multiplier: 5x

⁠ **large** ⁠
•⁠  ⁠Examples: GPT-4, Claude Opus, Llama 405B
•⁠  ⁠Multiplier: 15x

These are rough but useful for relative comparison between agent versions.

Output


{
  "score": 0.72,
  "details": {
    "estimated_energy_units": 0.28,
    "energy_budget": 1.0,
    "breakdown": {
      "token_compute": 0.18,
      "llm_call_overhead": 0.05,
      "tool_call_overhead": 0.03,
      "model_tier_multiplier": 5
    },
    "estimated_co2_grams": 2.4,
    "equivalent": "~1 Google search"
  }
}

CI/CD gating


agentevals run trace.json \
  --eval-set k8s-sre.json \
  -m energy_efficiency \
  --threshold 0.6

Teams can set energy budgets per task and fail builds when agents regress.

Future extensions

•⁠  ⁠Carbon intensity integration: Use real-time grid data (e.g., [Electricity Maps API](https://www.electricitymaps.com/)) to convert energy estimates to actual CO2
•⁠  ⁠Model-specific energy profiles: Allow custom Wh/token values from published model cards
•⁠  ⁠Comparative scoring: "This agent uses 3x less energy than baseline for the same task"
•⁠  ⁠Multi-run trending: Track energy efficiency over time across eval runs
•⁠  ⁠Hardware-aware scoring: Factor in GPU type (A100 vs H100 vs inference chips)

Why it matters


•⁠  ⁠Duplicate tool calls and retry loops multiply energy waste
•⁠  ⁠As agents scale to millions of daily invocations, this becomes significant

Additional context

This evaluator complements ⁠ token_efficiency ⁠, ⁠ tool_efficiency ⁠, and ⁠ time_efficiency ⁠ by combining multiple signals into a single energy-aware score. It goes beyond counting tokens to model the actual compute cost shape.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: efficiency evaluator for carbon-aware agent evaluation ⁠ #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: efficiency evaluator for carbon-aware agent evaluation ⁠ #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature: efficiency evaluator for carbon-aware agent evaluation ⁠ #8