feat: add prompt cache token support to cost telemetry by ajbozarth · Pull Request #936 · generative-computing/mellea

ajbozarth · 2026-04-25T00:47:33Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes feat: add prompt cache token support to cost and token telemetry #890

Prompt cache tokens were not correctly accounted for in the cost metrics pipeline, causing mellea.llm.cost.usd to be inaccurate — cache reads (discounted) and writes (premium) were all priced at the full input rate.

This PR fixes this by updating CostMetricsPlugin to correctly price cache tokens, adding cache pricing fields to builtin_pricing.json for all current Anthropic and OpenAI models, and extending compute_cost() to accept cached_tokens and cache_creation_tokens params.

The correct cost formula is: (prompt_tokens - cached_tokens - cache_creation_tokens) * full_rate + cached_tokens * cache_read_rate + cache_creation_tokens * cache_write_rate

Note for reviewers — Anthropic cache pricing and LiteLLM normalization: CostMetricsPlugin reads cache read tokens from prompt_tokens_details.cached_tokens. This is intentional and correct for all backends, including Anthropic via LiteLLM. LiteLLM normalizes both cache_read_input_tokens and cache_creation_input_tokens into prompt_tokens (inflating it to the full total) and stores cache_read_input_tokens in prompt_tokens_details.cached_tokens before the response reaches the plugin layer. Subtracting both cached_tokens and cache_creation from prompt_tokens gives the base input tokens billed at the full rate.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

…erative-computing#890) Record cache token costs accurately for Anthropic and OpenAI models. Previously, cache reads and writes were excluded from cost estimates and Anthropic cache_creation_input_tokens were excluded from the input token counter. - TokenMetricsPlugin: adds cache_creation_input_tokens to prompt_tokens for Anthropic (additive; not included in prompt_tokens by the API) - CostMetricsPlugin: extracts cached_tokens from prompt_tokens_details and prices cache reads and writes separately using the correct formula (prompt_tokens - cached_tokens) * full_rate + cached_tokens * cache_read_rate + cache_creation_tokens * cache_write_rate - builtin_pricing.json: adds cache_write_per_1m and cache_read_per_1m for all current Anthropic and OpenAI models - pricing.py: extends compute_cost() with cached_tokens and cache_creation_tokens params Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

planetf1

LGTM - just a few minor comments which could be followups

Suggestions

test/telemetry/test_metrics_plugins.py — add OpenAI-shaped path to cost plugin tests

test_cost_plugin_cache_tokens_forwarded only tests the full Anthropic shape (both prompt_tokens_details.cached_tokens and cache_creation_input_tokens populated). Worth adding a case for the OpenAI shape where cache_creation_input_tokens is absent, to explicitly verify cache_creation_tokens=0 is passed through:

async def test_cost_plugin_openai_cache_shape(cost_plugin):
    """OpenAI-shaped usage: cached_tokens present, no cache_creation_input_tokens."""
    payload = _make_cost_payload(
        usage={
            "prompt_tokens": 100,
            "completion_tokens": 20,
            "total_tokens": 120,
            "prompt_tokens_details": {"cached_tokens": 50},
        }
    )
    with (
        patch("mellea.telemetry.pricing.compute_cost", return_value=0.005) as mock_cost,
        patch("mellea.telemetry.metrics.record_cost"),
    ):
        await cost_plugin.record_cost_metrics(payload, {})
        mock_cost.assert_called_once_with(
            model="test-model",
            input_tokens=50,
            output_tokens=20,
            cached_tokens=50,
            cache_creation_tokens=0,
        )

test/telemetry/test_pricing.py — add combined all-four-components cost test

All four cost components are currently tested in isolation. A single combined test would catch double-counting regressions that isolated tests can't see:

def test_compute_cost_all_components(fresh_registry):
    # claude-sonnet-4-6: input=3.0, output=15.0, cache_write=3.75, cache_read=0.30
    cost = fresh_registry.compute_cost(
        "claude-sonnet-4-6",
        input_tokens=1000,
        output_tokens=500,
        cached_tokens=200,
        cache_creation_tokens=100,
    )
    expected = (
        (1000 / 1e6) * 3.0    # input
        + (500 / 1e6) * 15.0  # output
        + (200 / 1e6) * 0.30  # cache read
        + (100 / 1e6) * 3.75  # cache write
    )
    assert cost is not None
    assert abs(cost - expected) < 1e-12

---
Follow-up (not blocking)

mellea/telemetry/pricing.py — custom pricing file cannot override only cache rates

_validate_pricing_entry requires at least one of input_per_1m/output_per_1m to be present, so a custom pricing file that only overrides cache rates for a built-in model (e.g. {"claude-sonnet-4-6": {"cache_read_per_1m": 0.50}}) is silently rejected. The entry is dropped with only a log warning, which will confuse anyone trying to adjust just their cache pricing. Worth a follow-up issue to either document the constraint explicitly or support partial overrides that merge onto the built-in entry.

planetf1 · 2026-04-27T15:23:44Z

Following above I noticed a discrepancy with anthropic models -- or rather specifically when calling the anthropic ui. here's the details -- I think there's a case here for some integration testing, though may be tricky to automate (followup issue?) - this needs working through in a little more detail:

The problem

The comment at metrics_plugins.py:64 says cache_creation_input_tokens is "not in prompt_tokens". That's true for the raw Anthropic API, but not after LiteLLM normalisation. LiteLLM inflates prompt_tokens to include both cache types before it reaches Mellea (transformation.py L1605–1628):

prompt_tokens = input_tokens + cache_creation_input_tokens + cache_read_input_tokens

It also preserves the raw cache_creation_input_tokens field in the dict (Pydantic extra="allow"). Both plugins read both values, double-counting. LiteLLM's own UI hit the same issue — see litellm#25735.

TokenMetricsPlugin (L64–67)

Adds cache_creation to already-inflated prompt_tokens. Reports e.g. 9100 tokens when actual total is 7100.

CostMetricsPlugin (L176–182)

input_tokens = prompt_tokens - cached_tokens still includes cache_creation, which is then charged again at write_rate. Cache-write tokens get priced at full_rate + write_rate (~65% cost overestimate for write-heavy requests).

Scope

Only affects Anthropic via LiteLLMBackend. Bedrock (uses OpenAIBackend), OpenAI, Ollama, HF, WatsonX are all fine.

Suggested fix

- TokenMetricsPlugin: use prompt_tokens as-is (already the total) — drop the + cache_creation
- CostMetricsPlugin: input_tokens = max(0, prompt_tokens - cached_tokens - cache_creation)
- Docstring on GenerationMetadata.usage: document the invariant that prompt_tokens follows OpenAI convention (total, not remainder) and that cached_tokens/cache_creation_input_tokens are subsets, not additive
- Tests: existing tests encode the wrong assumption (prompt_tokens=100 + cache_creation=50 → expects 150). Rewrite with realistic LiteLLM-normalised shapes so the double-counting would be caught

planetf1

looks like this won't work for anthropic api (via litellm). As this is highlighted in this PR (cost files etc) will request changes, but it could be an additional PR if preferred.

ajbozarth · 2026-04-27T17:43:27Z

Thanks for the thorough review!

Test suggestions

We looked at both suggested tests but found the paths are already covered: test_cost_plugin_records_cost_for_known_model already exercises the no-cache_creation_input_tokens path with cache_creation_tokens=0, and the per-component pricing tests in test_pricing.py already verify each rate in isolation. Adding the combined tests wouldn't catch anything new.

Custom pricing partial override

The replace-entire-entry behaviour is intentional — we want custom files to be self-contained rather than implicitly inheriting built-in values that may change. We've improved the module docstring and the validation warning to make this constraint explicit.

LiteLLM double-counting

Good catch. We had caught that LiteLLM normalises cache_read_input_tokens into prompt_tokens, but missed that it also folds in cache_creation_input_tokens (transformation.py L1612–1617). Fixed both plugins:

TokenMetricsPlugin: dropped the + cache_creation — prompt_tokens already is the total.
CostMetricsPlugin: changed input_tokens = prompt_tokens - cached_tokens to prompt_tokens - cached_tokens - cache_creation so cache-write tokens aren't double-billed.

The existing test test_cost_plugin_cache_tokens_forwarded was asserting the buggy expected value and has been corrected to use a realistic LiteLLM-normalised shape.

LiteLLM normalises Anthropic usage so that prompt_tokens already includes cache_creation_input_tokens and cache_read_input_tokens. Both plugins were treating prompt_tokens as raw base input and adding cache fields on top, causing double-counting. - TokenMetricsPlugin: drop the + cache_creation addition - CostMetricsPlugin: subtract both cached_tokens and cache_creation from prompt_tokens so write tokens are not billed at full rate and write rate - Update test_cost_plugin_cache_tokens_forwarded to use a realistic LiteLLM-normalised shape with the correct expected input_tokens value - Remove the now-redundant with-cache-creation token metrics parametrize case - Clarify pricing.py docs and validation warning around the replace-not-merge behaviour of custom pricing file entries Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

…cope in docs Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

jakelorocco

I'm okay with this. I think we should be careful going forward because these costs could get arbitrarily complex.

I'm thinking we should actually bake the price update times into the record somewhere, so we can say "priced as of".

ajbozarth · 2026-04-27T20:36:20Z

I'm thinking we should actually bake the price update times into the record somewhere, so we can say "priced as of".

I'm down to add this, Claude had a couple idea I'll share below, but I also did some research and Claude did not find any other examples for a "priced as of" feature in other projects, in fact pricing metrics in other projects is not super common. So if we implemented this we would be the trailblazers.

Top-level _updated: Add a single date to builtin_pricing.json representing when the whole file was last verified, emitted as a pricing.as_of OTel attribute on the cost counter. Simple to maintain, but becomes misleading once individual provider prices drift — the date would suggest the whole file is fresh even if only one provider was updated.
Per-entry as_of: Add a date field to each model entry in builtin_pricing.json so individual provider prices carry their own freshness timestamp, emitted as a pricing.as_of OTel attribute on the cost counter. Stays accurate as providers update independently, but adds a field to maintain on every price update and risks dates going stale if contributors forget to update them.

As you mentioned we could always do this in a followup, I'm mixed on which way to do it and wether it's worth the overhead given the novelty anyways

planetf1

Approved - one minor comment suggestion to try and clarify the anthropic difference -- but the fix looks good thanks

Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth requested a review from nrfulton as a code owner April 25, 2026 00:47

ajbozarth self-assigned this Apr 25, 2026

ajbozarth requested review from a team and jakelorocco as code owners April 25, 2026 00:47

github-actions Bot added the enhancement New feature or request label Apr 25, 2026

ajbozarth force-pushed the feat/890-cache-token-telemetry branch from 695bf10 to 9ccfabf Compare April 25, 2026 01:00

planetf1 approved these changes Apr 27, 2026

View reviewed changes

planetf1 self-requested a review April 27, 2026 15:03

planetf1 requested changes Apr 27, 2026

View reviewed changes

ajbozarth added 2 commits April 27, 2026 12:44

fix: restore TokenMetricsPlugin and clarify custom pricing override s…

361f611

…cope in docs Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth changed the title ~~feat: add prompt cache token support to cost and token telemetry~~ feat: add prompt cache token support to cost telemetry Apr 27, 2026

jakelorocco reviewed Apr 27, 2026

View reviewed changes

planetf1 approved these changes Apr 27, 2026

View reviewed changes

Comment thread mellea/core/base.py Outdated

docs: revert usage docstring to provider-agnostic wording

0363e68

Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth enabled auto-merge April 27, 2026 21:03

ajbozarth added this pull request to the merge queue Apr 27, 2026

Merged via the queue into generative-computing:main with commit c1622f5 Apr 27, 2026
7 checks passed

ajbozarth deleted the feat/890-cache-token-telemetry branch April 27, 2026 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add prompt cache token support to cost telemetry#936

feat: add prompt cache token support to cost telemetry#936
ajbozarth merged 4 commits intogenerative-computing:mainfrom
ajbozarth:feat/890-cache-token-telemetry

ajbozarth commented Apr 25, 2026 •

edited

Loading

Uh oh!

planetf1 left a comment

Uh oh!

planetf1 commented Apr 27, 2026

Uh oh!

planetf1 left a comment

Uh oh!

ajbozarth commented Apr 27, 2026

Uh oh!

jakelorocco left a comment

Uh oh!

ajbozarth commented Apr 27, 2026

Uh oh!

planetf1 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ajbozarth commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Attribution

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Suggestions

Uh oh!

planetf1 commented Apr 27, 2026

The problem

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

ajbozarth commented Apr 27, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

ajbozarth commented Apr 27, 2026

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajbozarth commented Apr 25, 2026 •

edited

Loading