Skip to content

feat: add prompt cache token support to cost telemetry#936

Merged
ajbozarth merged 4 commits intogenerative-computing:mainfrom
ajbozarth:feat/890-cache-token-telemetry
Apr 27, 2026
Merged

feat: add prompt cache token support to cost telemetry#936
ajbozarth merged 4 commits intogenerative-computing:mainfrom
ajbozarth:feat/890-cache-token-telemetry

Conversation

@ajbozarth
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth commented Apr 25, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Prompt cache tokens were not correctly accounted for in the cost metrics pipeline, causing mellea.llm.cost.usd to be inaccurate — cache reads (discounted) and writes (premium) were all priced at the full input rate.

This PR fixes this by updating CostMetricsPlugin to correctly price cache tokens, adding cache pricing fields to builtin_pricing.json for all current Anthropic and OpenAI models, and extending compute_cost() to accept cached_tokens and cache_creation_tokens params.

The correct cost formula is: (prompt_tokens - cached_tokens - cache_creation_tokens) * full_rate + cached_tokens * cache_read_rate + cache_creation_tokens * cache_write_rate

Note for reviewers — Anthropic cache pricing and LiteLLM normalization: CostMetricsPlugin reads cache read tokens from prompt_tokens_details.cached_tokens. This is intentional and correct for all backends, including Anthropic via LiteLLM. LiteLLM normalizes both cache_read_input_tokens and cache_creation_input_tokens into prompt_tokens (inflating it to the full total) and stores cache_read_input_tokens in prompt_tokens_details.cached_tokens before the response reaches the plugin layer. Subtracting both cached_tokens and cache_creation from prompt_tokens gives the base input tokens billed at the full rate.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

@ajbozarth ajbozarth requested a review from nrfulton as a code owner April 25, 2026 00:47
@ajbozarth ajbozarth self-assigned this Apr 25, 2026
@ajbozarth ajbozarth requested review from a team and jakelorocco as code owners April 25, 2026 00:47
@github-actions github-actions Bot added the enhancement New feature or request label Apr 25, 2026
…erative-computing#890)

Record cache token costs accurately for Anthropic and OpenAI models.
Previously, cache reads and writes were excluded from cost estimates and
Anthropic cache_creation_input_tokens were excluded from the input token
counter.

- TokenMetricsPlugin: adds cache_creation_input_tokens to prompt_tokens
  for Anthropic (additive; not included in prompt_tokens by the API)
- CostMetricsPlugin: extracts cached_tokens from prompt_tokens_details
  and prices cache reads and writes separately using the correct formula
  (prompt_tokens - cached_tokens) * full_rate + cached_tokens * cache_read_rate
  + cache_creation_tokens * cache_write_rate
- builtin_pricing.json: adds cache_write_per_1m and cache_read_per_1m
  for all current Anthropic and OpenAI models
- pricing.py: extends compute_cost() with cached_tokens and
  cache_creation_tokens params

Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth force-pushed the feat/890-cache-token-telemetry branch from 695bf10 to 9ccfabf Compare April 25, 2026 01:00
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just a few minor comments which could be followups

Suggestions

test/telemetry/test_metrics_plugins.py — add OpenAI-shaped path to cost plugin tests

test_cost_plugin_cache_tokens_forwarded only tests the full Anthropic shape (both prompt_tokens_details.cached_tokens and cache_creation_input_tokens populated). Worth adding a case for the OpenAI shape where cache_creation_input_tokens is absent, to explicitly verify cache_creation_tokens=0 is passed through:

async def test_cost_plugin_openai_cache_shape(cost_plugin):
    """OpenAI-shaped usage: cached_tokens present, no cache_creation_input_tokens."""
    payload = _make_cost_payload(
        usage={
            "prompt_tokens": 100,
            "completion_tokens": 20,
            "total_tokens": 120,
            "prompt_tokens_details": {"cached_tokens": 50},
        }
    )
    with (
        patch("mellea.telemetry.pricing.compute_cost", return_value=0.005) as mock_cost,
        patch("mellea.telemetry.metrics.record_cost"),
    ):
        await cost_plugin.record_cost_metrics(payload, {})
        mock_cost.assert_called_once_with(
            model="test-model",
            input_tokens=50,
            output_tokens=20,
            cached_tokens=50,
            cache_creation_tokens=0,
        )

test/telemetry/test_pricing.pyadd combined all-four-components cost test

All four cost components are currently tested in isolation. A single combined test would catch double-counting regressions that isolated tests can't see:

def test_compute_cost_all_components(fresh_registry):
    # claude-sonnet-4-6: input=3.0, output=15.0, cache_write=3.75, cache_read=0.30
    cost = fresh_registry.compute_cost(
        "claude-sonnet-4-6",
        input_tokens=1000,
        output_tokens=500,
        cached_tokens=200,
        cache_creation_tokens=100,
    )
    expected = (
        (1000 / 1e6) * 3.0    # input
        + (500 / 1e6) * 15.0  # output
        + (200 / 1e6) * 0.30  # cache read
        + (100 / 1e6) * 3.75  # cache write
    )
    assert cost is not None
    assert abs(cost - expected) < 1e-12

---
Follow-up (not blocking)

mellea/telemetry/pricing.pycustom pricing file cannot override only cache rates

_validate_pricing_entry requires at least one of input_per_1m/output_per_1m to be present, so a custom pricing file that only overrides cache rates for a built-in model (e.g. {"claude-sonnet-4-6": {"cache_read_per_1m": 0.50}}) is silently rejected. The entry is dropped with only a log warning, which will confuse anyone trying to adjust just their cache pricing. Worth a follow-up issue to either document the constraint explicitly or support partial overrides that merge onto the built-in entry.

@planetf1 planetf1 self-requested a review April 27, 2026 15:03
@planetf1
Copy link
Copy Markdown
Contributor

Following above I noticed a discrepancy with anthropic models -- or rather specifically when calling the anthropic ui. here's the details -- I think there's a case here for some integration testing, though may be tricky to automate (followup issue?) - this needs working through in a little more detail:

The problem

The comment at metrics_plugins.py:64 says cache_creation_input_tokens is "not in prompt_tokens". That's true for the raw Anthropic API, but not after LiteLLM normalisation. LiteLLM inflates prompt_tokens to include both cache types before it reaches Mellea (transformation.py L1605–1628):

prompt_tokens = input_tokens + cache_creation_input_tokens + cache_read_input_tokens

It also preserves the raw cache_creation_input_tokens field in the dict (Pydantic extra="allow"). Both plugins read both values, double-counting. LiteLLM's own UI hit the same issuesee litellm#25735.

TokenMetricsPlugin (L6467)

Adds cache_creation to already-inflated prompt_tokens. Reports e.g. 9100 tokens when actual total is 7100.

CostMetricsPlugin (L176182)

input_tokens = prompt_tokens - cached_tokens still includes cache_creation, which is then charged again at write_rate. Cache-write tokens get priced at full_rate + write_rate (~65% cost overestimate for write-heavy requests).

Scope

Only affects Anthropic via LiteLLMBackend. Bedrock (uses OpenAIBackend), OpenAI, Ollama, HF, WatsonX are all fine.

Suggested fix

- TokenMetricsPlugin: use prompt_tokens as-is (already the total) — drop the + cache_creation
- CostMetricsPlugin: input_tokens = max(0, prompt_tokens - cached_tokens - cache_creation)
- Docstring on GenerationMetadata.usage: document the invariant that prompt_tokens follows OpenAI convention (total, not remainder) and that cached_tokens/cache_creation_input_tokens are subsets, not additive
- Tests: existing tests encode the wrong assumption (prompt_tokens=100 + cache_creation=50expects 150). Rewrite with realistic LiteLLM-normalised shapes so the double-counting would be caught

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this won't work for anthropic api (via litellm). As this is highlighted in this PR (cost files etc) will request changes, but it could be an additional PR if preferred.

@ajbozarth
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review!

Test suggestions

We looked at both suggested tests but found the paths are already covered: test_cost_plugin_records_cost_for_known_model already exercises the no-cache_creation_input_tokens path with cache_creation_tokens=0, and the per-component pricing tests in test_pricing.py already verify each rate in isolation. Adding the combined tests wouldn't catch anything new.

Custom pricing partial override

The replace-entire-entry behaviour is intentional — we want custom files to be self-contained rather than implicitly inheriting built-in values that may change. We've improved the module docstring and the validation warning to make this constraint explicit.

LiteLLM double-counting

Good catch. We had caught that LiteLLM normalises cache_read_input_tokens into prompt_tokens, but missed that it also folds in cache_creation_input_tokens (transformation.py L1612–1617). Fixed both plugins:

  • TokenMetricsPlugin: dropped the + cache_creationprompt_tokens already is the total.
  • CostMetricsPlugin: changed input_tokens = prompt_tokens - cached_tokens to prompt_tokens - cached_tokens - cache_creation so cache-write tokens aren't double-billed.

The existing test test_cost_plugin_cache_tokens_forwarded was asserting the buggy expected value and has been corrected to use a realistic LiteLLM-normalised shape.

LiteLLM normalises Anthropic usage so that prompt_tokens already includes
cache_creation_input_tokens and cache_read_input_tokens. Both plugins were
treating prompt_tokens as raw base input and adding cache fields on top,
causing double-counting.

- TokenMetricsPlugin: drop the + cache_creation addition
- CostMetricsPlugin: subtract both cached_tokens and cache_creation from
  prompt_tokens so write tokens are not billed at full rate and write rate
- Update test_cost_plugin_cache_tokens_forwarded to use a realistic
  LiteLLM-normalised shape with the correct expected input_tokens value
- Remove the now-redundant with-cache-creation token metrics parametrize case
- Clarify pricing.py docs and validation warning around the replace-not-merge
  behaviour of custom pricing file entries

Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…cope in docs

Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth changed the title feat: add prompt cache token support to cost and token telemetry feat: add prompt cache token support to cost telemetry Apr 27, 2026
Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this. I think we should be careful going forward because these costs could get arbitrarily complex.

I'm thinking we should actually bake the price update times into the record somewhere, so we can say "priced as of".

@ajbozarth
Copy link
Copy Markdown
Contributor Author

I'm thinking we should actually bake the price update times into the record somewhere, so we can say "priced as of".

I'm down to add this, Claude had a couple idea I'll share below, but I also did some research and Claude did not find any other examples for a "priced as of" feature in other projects, in fact pricing metrics in other projects is not super common. So if we implemented this we would be the trailblazers.


  1. Top-level _updated: Add a single date to builtin_pricing.json representing when the whole file was last verified, emitted as a pricing.as_of OTel attribute on the cost counter. Simple to maintain, but becomes misleading once individual provider prices drift — the date would suggest the whole file is fresh even if only one provider was updated.
  2. Per-entry as_of: Add a date field to each model entry in builtin_pricing.json so individual provider prices carry their own freshness timestamp, emitted as a pricing.as_of OTel attribute on the cost counter. Stays accurate as providers update independently, but adds a field to maintain on every price update and risks dates going stale if contributors forget to update them.

As you mentioned we could always do this in a followup, I'm mixed on which way to do it and wether it's worth the overhead given the novelty anyways

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - one minor comment suggestion to try and clarify the anthropic difference -- but the fix looks good thanks

Comment thread mellea/core/base.py Outdated
Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth enabled auto-merge April 27, 2026 21:03
@ajbozarth ajbozarth added this pull request to the merge queue Apr 27, 2026
Merged via the queue into generative-computing:main with commit c1622f5 Apr 27, 2026
7 checks passed
@ajbozarth ajbozarth deleted the feat/890-cache-token-telemetry branch April 27, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add prompt cache token support to cost and token telemetry

3 participants