Summary
RAKI doesn't track how much the LLM judge calls cost to generate a report. Users can't answer "how much did this evaluation cost?" without checking their provider billing.
Current State
JudgeLogger (src/raki/metrics/ragas/llm_setup.py:141-171) logs metric/input/score/reason per call — no token counts, no cost
- LLM clients created via
create_ragas_llm() using AsyncAnthropicVertex, AsyncAnthropic, or genai.Client
- Token usage from judge calls is completely opaque — Ragas/instructor discard the raw API response and return only parsed Pydantic models
- The Anthropic SDK returns
usage.input_tokens + usage.output_tokens on every Message response
Implementation Approach
1. Token Accumulator (cross-cutting, engine-level)
@dataclass
class TokenAccumulator:
input_tokens: int = 0
output_tokens: int = 0
calls: int = 0
Owned by MetricsEngine.run(), not per-metric. Created once, injected into create_ragas_llm() via MetricConfig, read after all metrics complete.
2. Client monkey-patch (not a proxy class)
Patch client.messages.create in-place before passing to llm_factory(). This preserves client identity (instructor does structural checks on the type). The monkey-patch sits below both Ragas and instructor — transparent to both.
def patch_client_for_token_tracking(client, accumulator: TokenAccumulator):
original_create = client.messages.create
async def tracked_create(*args, **kwargs):
response = await original_create(*args, **kwargs)
if hasattr(response, "usage"):
accumulator.input_tokens += response.usage.input_tokens
accumulator.output_tokens += response.usage.output_tokens
accumulator.calls += 1
return response
client.messages.create = tracked_create
No lock needed — asyncio is single-threaded/cooperative, += after await is atomic within the event loop.
3. Report output
Add to EvalReport config dict in MetricsEngine.run():
"judge_cost": {
"input_tokens": 15000,
"output_tokens": 3000,
"calls": 24,
"total_usd": null
}
total_usd is computed in the report layer using a pricing lookup (tokens × model price). Set to null if pricing for the model is unknown.
4. Google provider
For genai.Client, a separate patch function targeting the equivalent method surface. Both patch functions behind a common TokenAccumulator — the accumulator is provider-agnostic.
Files to Change
src/raki/metrics/ragas/llm_setup.py — add TokenAccumulator, patch_client_for_token_tracking(), apply patch in create_ragas_llm()
src/raki/metrics/protocol.py — add token_accumulator: TokenAccumulator | None = None to MetricConfig
src/raki/metrics/engine.py — create accumulator, inject into config, read totals into report config
src/raki/report/cli_summary.py — display judge cost in CLI summary
src/raki/report/html_report.py — display judge cost in HTML report
- Tests for: accumulator, client patch, engine aggregation, report display
What NOT to do
- Don't thread tokens through
ScoringState or MetricResult — token tracking is a cross-cutting concern, not per-metric
- Don't subclass or
__getattr__-proxy the client — breaks instructor's structural checks
- Don't estimate tokens post-hoc from logged text — inaccurate, can't capture output tokens
- Don't put USD pricing logic in the metrics layer — belongs in report layer
Depends On
#182 (shared scoring loop) — already shipped (PR #201).
Summary
RAKI doesn't track how much the LLM judge calls cost to generate a report. Users can't answer "how much did this evaluation cost?" without checking their provider billing.
Current State
JudgeLogger(src/raki/metrics/ragas/llm_setup.py:141-171) logs metric/input/score/reason per call — no token counts, no costcreate_ragas_llm()usingAsyncAnthropicVertex,AsyncAnthropic, orgenai.Clientusage.input_tokens+usage.output_tokenson everyMessageresponseImplementation Approach
1. Token Accumulator (cross-cutting, engine-level)
Owned by
MetricsEngine.run(), not per-metric. Created once, injected intocreate_ragas_llm()viaMetricConfig, read after all metrics complete.2. Client monkey-patch (not a proxy class)
Patch
client.messages.createin-place before passing tollm_factory(). This preserves client identity (instructor does structural checks on the type). The monkey-patch sits below both Ragas and instructor — transparent to both.No lock needed — asyncio is single-threaded/cooperative,
+=afterawaitis atomic within the event loop.3. Report output
Add to
EvalReportconfig dict inMetricsEngine.run():total_usdis computed in the report layer using a pricing lookup (tokens × model price). Set tonullif pricing for the model is unknown.4. Google provider
For
genai.Client, a separate patch function targeting the equivalent method surface. Both patch functions behind a commonTokenAccumulator— the accumulator is provider-agnostic.Files to Change
src/raki/metrics/ragas/llm_setup.py— addTokenAccumulator,patch_client_for_token_tracking(), apply patch increate_ragas_llm()src/raki/metrics/protocol.py— addtoken_accumulator: TokenAccumulator | None = NonetoMetricConfigsrc/raki/metrics/engine.py— create accumulator, inject into config, read totals into report configsrc/raki/report/cli_summary.py— display judge cost in CLI summarysrc/raki/report/html_report.py— display judge cost in HTML reportWhat NOT to do
ScoringStateorMetricResult— token tracking is a cross-cutting concern, not per-metric__getattr__-proxy the client — breaks instructor's structural checksDepends On
#182 (shared scoring loop) — already shipped (PR #201).