Skip to content

feat: track judge cost per report #174

@decko

Description

@decko

Summary

RAKI doesn't track how much the LLM judge calls cost to generate a report. Users can't answer "how much did this evaluation cost?" without checking their provider billing.

Current State

  • JudgeLogger (src/raki/metrics/ragas/llm_setup.py:141-171) logs metric/input/score/reason per call — no token counts, no cost
  • LLM clients created via create_ragas_llm() using AsyncAnthropicVertex, AsyncAnthropic, or genai.Client
  • Token usage from judge calls is completely opaque — Ragas/instructor discard the raw API response and return only parsed Pydantic models
  • The Anthropic SDK returns usage.input_tokens + usage.output_tokens on every Message response

Implementation Approach

1. Token Accumulator (cross-cutting, engine-level)

@dataclass
class TokenAccumulator:
    input_tokens: int = 0
    output_tokens: int = 0
    calls: int = 0

Owned by MetricsEngine.run(), not per-metric. Created once, injected into create_ragas_llm() via MetricConfig, read after all metrics complete.

2. Client monkey-patch (not a proxy class)

Patch client.messages.create in-place before passing to llm_factory(). This preserves client identity (instructor does structural checks on the type). The monkey-patch sits below both Ragas and instructor — transparent to both.

def patch_client_for_token_tracking(client, accumulator: TokenAccumulator):
    original_create = client.messages.create

    async def tracked_create(*args, **kwargs):
        response = await original_create(*args, **kwargs)
        if hasattr(response, "usage"):
            accumulator.input_tokens += response.usage.input_tokens
            accumulator.output_tokens += response.usage.output_tokens
            accumulator.calls += 1
        return response

    client.messages.create = tracked_create

No lock needed — asyncio is single-threaded/cooperative, += after await is atomic within the event loop.

3. Report output

Add to EvalReport config dict in MetricsEngine.run():

"judge_cost": {
  "input_tokens": 15000,
  "output_tokens": 3000,
  "calls": 24,
  "total_usd": null
}

total_usd is computed in the report layer using a pricing lookup (tokens × model price). Set to null if pricing for the model is unknown.

4. Google provider

For genai.Client, a separate patch function targeting the equivalent method surface. Both patch functions behind a common TokenAccumulator — the accumulator is provider-agnostic.

Files to Change

  • src/raki/metrics/ragas/llm_setup.py — add TokenAccumulator, patch_client_for_token_tracking(), apply patch in create_ragas_llm()
  • src/raki/metrics/protocol.py — add token_accumulator: TokenAccumulator | None = None to MetricConfig
  • src/raki/metrics/engine.py — create accumulator, inject into config, read totals into report config
  • src/raki/report/cli_summary.py — display judge cost in CLI summary
  • src/raki/report/html_report.py — display judge cost in HTML report
  • Tests for: accumulator, client patch, engine aggregation, report display

What NOT to do

  • Don't thread tokens through ScoringState or MetricResult — token tracking is a cross-cutting concern, not per-metric
  • Don't subclass or __getattr__-proxy the client — breaks instructor's structural checks
  • Don't estimate tokens post-hoc from logged text — inaccurate, can't capture output tokens
  • Don't put USD pricing logic in the metrics layer — belongs in report layer

Depends On

#182 (shared scoring loop) — already shipped (PR #201).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions