feat: add prompt cache token support to cost telemetry#936
Conversation
…erative-computing#890) Record cache token costs accurately for Anthropic and OpenAI models. Previously, cache reads and writes were excluded from cost estimates and Anthropic cache_creation_input_tokens were excluded from the input token counter. - TokenMetricsPlugin: adds cache_creation_input_tokens to prompt_tokens for Anthropic (additive; not included in prompt_tokens by the API) - CostMetricsPlugin: extracts cached_tokens from prompt_tokens_details and prices cache reads and writes separately using the correct formula (prompt_tokens - cached_tokens) * full_rate + cached_tokens * cache_read_rate + cache_creation_tokens * cache_write_rate - builtin_pricing.json: adds cache_write_per_1m and cache_read_per_1m for all current Anthropic and OpenAI models - pricing.py: extends compute_cost() with cached_tokens and cache_creation_tokens params Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
695bf10 to
9ccfabf
Compare
planetf1
left a comment
There was a problem hiding this comment.
LGTM - just a few minor comments which could be followups
Suggestions
test/telemetry/test_metrics_plugins.py — add OpenAI-shaped path to cost plugin tests
test_cost_plugin_cache_tokens_forwarded only tests the full Anthropic shape (both prompt_tokens_details.cached_tokens and cache_creation_input_tokens populated). Worth adding a case for the OpenAI shape where cache_creation_input_tokens is absent, to explicitly verify cache_creation_tokens=0 is passed through:
async def test_cost_plugin_openai_cache_shape(cost_plugin):
"""OpenAI-shaped usage: cached_tokens present, no cache_creation_input_tokens."""
payload = _make_cost_payload(
usage={
"prompt_tokens": 100,
"completion_tokens": 20,
"total_tokens": 120,
"prompt_tokens_details": {"cached_tokens": 50},
}
)
with (
patch("mellea.telemetry.pricing.compute_cost", return_value=0.005) as mock_cost,
patch("mellea.telemetry.metrics.record_cost"),
):
await cost_plugin.record_cost_metrics(payload, {})
mock_cost.assert_called_once_with(
model="test-model",
input_tokens=50,
output_tokens=20,
cached_tokens=50,
cache_creation_tokens=0,
)
test/telemetry/test_pricing.py — add combined all-four-components cost test
All four cost components are currently tested in isolation. A single combined test would catch double-counting regressions that isolated tests can't see:
def test_compute_cost_all_components(fresh_registry):
# claude-sonnet-4-6: input=3.0, output=15.0, cache_write=3.75, cache_read=0.30
cost = fresh_registry.compute_cost(
"claude-sonnet-4-6",
input_tokens=1000,
output_tokens=500,
cached_tokens=200,
cache_creation_tokens=100,
)
expected = (
(1000 / 1e6) * 3.0 # input
+ (500 / 1e6) * 15.0 # output
+ (200 / 1e6) * 0.30 # cache read
+ (100 / 1e6) * 3.75 # cache write
)
assert cost is not None
assert abs(cost - expected) < 1e-12
---
Follow-up (not blocking)
mellea/telemetry/pricing.py — custom pricing file cannot override only cache rates
_validate_pricing_entry requires at least one of input_per_1m/output_per_1m to be present, so a custom pricing file that only overrides cache rates for a built-in model (e.g. {"claude-sonnet-4-6": {"cache_read_per_1m": 0.50}}) is silently rejected. The entry is dropped with only a log warning, which will confuse anyone trying to adjust just their cache pricing. Worth a follow-up issue to either document the constraint explicitly or support partial overrides that merge onto the built-in entry.|
Following above I noticed a discrepancy with anthropic models -- or rather specifically when calling the anthropic ui. here's the details -- I think there's a case here for some integration testing, though may be tricky to automate (followup issue?) - this needs working through in a little more detail: The problemThe comment at prompt_tokens = input_tokens + cache_creation_input_tokens + cache_read_input_tokens
It also preserves the raw cache_creation_input_tokens field in the dict (Pydantic extra="allow"). Both plugins read both values, double-counting. LiteLLM's own UI hit the same issue — see litellm#25735.
TokenMetricsPlugin (L64–67)
Adds cache_creation to already-inflated prompt_tokens. Reports e.g. 9100 tokens when actual total is 7100.
CostMetricsPlugin (L176–182)
input_tokens = prompt_tokens - cached_tokens still includes cache_creation, which is then charged again at write_rate. Cache-write tokens get priced at full_rate + write_rate (~65% cost overestimate for write-heavy requests).
Scope
Only affects Anthropic via LiteLLMBackend. Bedrock (uses OpenAIBackend), OpenAI, Ollama, HF, WatsonX are all fine.
Suggested fix
- TokenMetricsPlugin: use prompt_tokens as-is (already the total) — drop the + cache_creation
- CostMetricsPlugin: input_tokens = max(0, prompt_tokens - cached_tokens - cache_creation)
- Docstring on GenerationMetadata.usage: document the invariant that prompt_tokens follows OpenAI convention (total, not remainder) and that cached_tokens/cache_creation_input_tokens are subsets, not additive
- Tests: existing tests encode the wrong assumption (prompt_tokens=100 + cache_creation=50 → expects 150). Rewrite with realistic LiteLLM-normalised shapes so the double-counting would be caught |
planetf1
left a comment
There was a problem hiding this comment.
looks like this won't work for anthropic api (via litellm). As this is highlighted in this PR (cost files etc) will request changes, but it could be an additional PR if preferred.
|
Thanks for the thorough review! Test suggestions We looked at both suggested tests but found the paths are already covered: Custom pricing partial override The replace-entire-entry behaviour is intentional — we want custom files to be self-contained rather than implicitly inheriting built-in values that may change. We've improved the module docstring and the validation warning to make this constraint explicit. LiteLLM double-counting Good catch. We had caught that LiteLLM normalises
The existing test |
LiteLLM normalises Anthropic usage so that prompt_tokens already includes cache_creation_input_tokens and cache_read_input_tokens. Both plugins were treating prompt_tokens as raw base input and adding cache fields on top, causing double-counting. - TokenMetricsPlugin: drop the + cache_creation addition - CostMetricsPlugin: subtract both cached_tokens and cache_creation from prompt_tokens so write tokens are not billed at full rate and write rate - Update test_cost_plugin_cache_tokens_forwarded to use a realistic LiteLLM-normalised shape with the correct expected input_tokens value - Remove the now-redundant with-cache-creation token metrics parametrize case - Clarify pricing.py docs and validation warning around the replace-not-merge behaviour of custom pricing file entries Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…cope in docs Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
jakelorocco
left a comment
There was a problem hiding this comment.
I'm okay with this. I think we should be careful going forward because these costs could get arbitrarily complex.
I'm thinking we should actually bake the price update times into the record somewhere, so we can say "priced as of".
I'm down to add this, Claude had a couple idea I'll share below, but I also did some research and Claude did not find any other examples for a "priced as of" feature in other projects, in fact pricing metrics in other projects is not super common. So if we implemented this we would be the trailblazers.
As you mentioned we could always do this in a followup, I'm mixed on which way to do it and wether it's worth the overhead given the novelty anyways |
planetf1
left a comment
There was a problem hiding this comment.
Approved - one minor comment suggestion to try and clarify the anthropic difference -- but the fix looks good thanks
Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
c1622f5
Misc PR
Type of PR
Description
Prompt cache tokens were not correctly accounted for in the cost metrics pipeline, causing
mellea.llm.cost.usdto be inaccurate — cache reads (discounted) and writes (premium) were all priced at the full input rate.This PR fixes this by updating
CostMetricsPluginto correctly price cache tokens, adding cache pricing fields tobuiltin_pricing.jsonfor all current Anthropic and OpenAI models, and extendingcompute_cost()to acceptcached_tokensandcache_creation_tokensparams.The correct cost formula is:
(prompt_tokens - cached_tokens - cache_creation_tokens) * full_rate + cached_tokens * cache_read_rate + cache_creation_tokens * cache_write_rateNote for reviewers — Anthropic cache pricing and LiteLLM normalization:
CostMetricsPluginreads cache read tokens fromprompt_tokens_details.cached_tokens. This is intentional and correct for all backends, including Anthropic via LiteLLM. LiteLLM normalizes bothcache_read_input_tokensandcache_creation_input_tokensintoprompt_tokens(inflating it to the full total) and storescache_read_input_tokensinprompt_tokens_details.cached_tokensbefore the response reaches the plugin layer. Subtracting bothcached_tokensandcache_creationfromprompt_tokensgives the base input tokens billed at the full rate.Testing
Attribution