agentevals now has performance evaluators for tokens, tools, and time (PR #7). But none of them measure the environmental cost of an agent's work.
AI agents are heavy compute consumers — each LLM call burns GPU cycles, each tool call hits APIs, and inefficient agents multiply this waste. As AI agent adoption grows, we need a way to evaluate and gate deployments based on energy efficiency, not just correctness.
Teams building responsible AI need to answer: "Is this agent getting greener over time, or are we regressing?"
Proposed solution
Add an energy_efficiency evaluator that scores how energy-efficient an agent run was, using a compute cost model derived from trace data.
Scoring approach
Energy can't be measured directly from traces, but we can estimate relative compute cost from observable signals:
energy_score = 1.0 - (estimated_energy / energy_budget)
Where estimated_energy is calculated from:
Total tokens (input + output)
• Weight: High
• Rationale: Direct proxy for GPU compute time
Number of LLM calls
• Weight: Medium
• Rationale: Each call has fixed overhead (network, scheduling, KV cache init)
Model tier (small/medium/large)
• Weight: High
• Rationale: A 70B model uses ~10x the energy of a 7B model per token
Tool calls (external APIs)
• Weight: Low
• Rationale: Each HTTP call has network + server-side compute
Total duration
• Weight: Low
• Rationale: Longer runs = more idle compute reservation
Config
evaluators:
- name: energy_efficiency
type: builtin
config:
energy_budget: 1.0 # relative energy units (1.0 = baseline)
model_tier: "large" # small|medium|large|custom
cost_per_1k_input_tokens: 0.5 # relative energy units
cost_per_1k_output_tokens: 1.5 # output is ~3x more expensive (generation vs prefill)
cost_per_llm_call: 0.1 # fixed overhead per call
cost_per_tool_call: 0.05 # external API cost
carbon_intensity_gco2_kwh: null # optional: grid carbon intensity for CO2 estimation
Model tier energy multipliers
Based on published GPU benchmarks and inference costs:
small
• Examples: GPT-4o-mini, Claude Haiku, Llama 8B
• Multiplier: 1x
medium
• Examples: GPT-4o, Claude Sonnet, Llama 70B
• Multiplier: 5x
large
• Examples: GPT-4, Claude Opus, Llama 405B
• Multiplier: 15x
These are rough but useful for relative comparison between agent versions.
Output
{
"score": 0.72,
"details": {
"estimated_energy_units": 0.28,
"energy_budget": 1.0,
"breakdown": {
"token_compute": 0.18,
"llm_call_overhead": 0.05,
"tool_call_overhead": 0.03,
"model_tier_multiplier": 5
},
"estimated_co2_grams": 2.4,
"equivalent": "~1 Google search"
}
}
CI/CD gating
agentevals run trace.json
--eval-set k8s-sre.json
-m energy_efficiency
--threshold 0.6
Teams can set energy budgets per task and fail builds when agents regress.
Future extensions
• Carbon intensity integration: Use real-time grid data (e.g., Electricity Maps API) to convert energy estimates to actual CO2
• Model-specific energy profiles: Allow custom Wh/token values from published model cards
• Comparative scoring: "This agent uses 3x less energy than baseline for the same task"
• Multi-run trending: Track energy efficiency over time across eval runs
• Hardware-aware scoring: Factor in GPU type (A100 vs H100 vs inference chips)
Why it matters
• Duplicate tool calls and retry loops multiply energy waste
• As agents scale to millions of daily invocations, this becomes significant
Additional context
This evaluator complements token_efficiency , tool_efficiency , and time_efficiency by combining multiple signals into a single energy-aware score. It goes beyond counting tokens to model the actual compute cost shape.
agentevals now has performance evaluators for tokens, tools, and time (PR #7). But none of them measure the environmental cost of an agent's work.
AI agents are heavy compute consumers — each LLM call burns GPU cycles, each tool call hits APIs, and inefficient agents multiply this waste. As AI agent adoption grows, we need a way to evaluate and gate deployments based on energy efficiency, not just correctness.
Teams building responsible AI need to answer: "Is this agent getting greener over time, or are we regressing?"
Proposed solution
Add an energy_efficiency evaluator that scores how energy-efficient an agent run was, using a compute cost model derived from trace data.
Scoring approach
Energy can't be measured directly from traces, but we can estimate relative compute cost from observable signals:
energy_score = 1.0 - (estimated_energy / energy_budget)
Where estimated_energy is calculated from:
Total tokens (input + output)
• Weight: High
• Rationale: Direct proxy for GPU compute time
Number of LLM calls
• Weight: Medium
• Rationale: Each call has fixed overhead (network, scheduling, KV cache init)
Model tier (small/medium/large)
• Weight: High
• Rationale: A 70B model uses ~10x the energy of a 7B model per token
Tool calls (external APIs)
• Weight: Low
• Rationale: Each HTTP call has network + server-side compute
Total duration
• Weight: Low
• Rationale: Longer runs = more idle compute reservation
Config
evaluators:
type: builtin
config:
energy_budget: 1.0 # relative energy units (1.0 = baseline)
model_tier: "large" # small|medium|large|custom
cost_per_1k_input_tokens: 0.5 # relative energy units
cost_per_1k_output_tokens: 1.5 # output is ~3x more expensive (generation vs prefill)
cost_per_llm_call: 0.1 # fixed overhead per call
cost_per_tool_call: 0.05 # external API cost
carbon_intensity_gco2_kwh: null # optional: grid carbon intensity for CO2 estimation
Model tier energy multipliers
Based on published GPU benchmarks and inference costs:
small
• Examples: GPT-4o-mini, Claude Haiku, Llama 8B
• Multiplier: 1x
medium
• Examples: GPT-4o, Claude Sonnet, Llama 70B
• Multiplier: 5x
large
• Examples: GPT-4, Claude Opus, Llama 405B
• Multiplier: 15x
These are rough but useful for relative comparison between agent versions.
Output
{
"score": 0.72,
"details": {
"estimated_energy_units": 0.28,
"energy_budget": 1.0,
"breakdown": {
"token_compute": 0.18,
"llm_call_overhead": 0.05,
"tool_call_overhead": 0.03,
"model_tier_multiplier": 5
},
"estimated_co2_grams": 2.4,
"equivalent": "~1 Google search"
}
}
CI/CD gating
agentevals run trace.json
--eval-set k8s-sre.json
-m energy_efficiency
--threshold 0.6
Teams can set energy budgets per task and fail builds when agents regress.
Future extensions
• Carbon intensity integration: Use real-time grid data (e.g., Electricity Maps API) to convert energy estimates to actual CO2
• Model-specific energy profiles: Allow custom Wh/token values from published model cards
• Comparative scoring: "This agent uses 3x less energy than baseline for the same task"
• Multi-run trending: Track energy efficiency over time across eval runs
• Hardware-aware scoring: Factor in GPU type (A100 vs H100 vs inference chips)
Why it matters
• Duplicate tool calls and retry loops multiply energy waste
• As agents scale to millions of daily invocations, this becomes significant
Additional context
This evaluator complements token_efficiency , tool_efficiency , and time_efficiency by combining multiple signals into a single energy-aware score. It goes beyond counting tokens to model the actual compute cost shape.