Skip to content

Conversation

@ethanndickson
Copy link
Member

@ethanndickson ethanndickson commented Dec 2, 2025

Problem

The application was severely underestimating costs for conversations involving tool calls.

Root Cause

The Vercel AI SDK provides two usage metrics:

  • streamResult.usage β€” Token usage from the last step only
  • streamResult.totalUsage β€” Sum of token usage across all steps

The application was using usage instead of totalUsage. For a conversation with 10 tool calls, only ~1/10th of actual consumption was reported. A $5 conversation would display as $0.50.

The Complicating Factor

Two UI elements use token data with different semantic requirements:

Display Needs Why
Cost Sum of all steps If model read context 10 times, you paid for 10 reads
Context window Last step only Shows "how full is the conversation now" for the next request

Simply switching to totalUsage would fix costs but break context display (showing 500% utilization after many tool calls).

Cache Creation Tokens

Anthropic's cache creation tokens (cacheCreationInputTokens) are:

  • Only in provider-specific metadata, not normalized usage
  • Need to be summed across all steps
  • Not automatically aggregated by the AI SDK

Even with totalUsage, cache creation costs were lost unless manually aggregated from each step's provider metadata.

Solution

Track both values with different semantic purposes:

For cost calculation:

  • usage / cumulativeUsage β€” total across all steps
  • providerMetadata / cumulativeProviderMetadata β€” aggregated cache creation tokens

For context window display:

  • contextUsage / lastContextUsage β€” last step only
  • contextProviderMetadata β€” last step only

Key Changes

  1. Backend (streamManager.ts): Use totalUsage for cost, track lastStepUsage for context, aggregate provider metadata across steps
  2. Types: Extended StreamEndEvent, MuxMetadata, UsageDeltaEvent with dual fields
  3. Frontend: StreamingMessageAggregator tracks both cumulative and per-step usage
  4. Store: WorkspaceUsageState provides usageHistory (cost) and lastContextUsage (context window)
  5. UI: Components use appropriate field for their purpose

Also Fixed

  • OpenAI cached token double-counting: Gateway models (mux-gateway:openai/gpt-5.1) weren't recognized as OpenAI, causing cached tokens to be counted in both "Cache Read" and "Input". Now normalizes gateway model strings before provider detection.

  • Google/Gemini cached token double-counting: Google, like OpenAI, reports inputTokens inclusive of cachedInputTokens. Extended the subtraction logic to handle Google models.


Generated with mux

When using mux-gateway (e.g., mux-gateway:openai/gpt-5.1), the OpenAI
provider detection failed because the model string doesn't start with
'openai:'. This caused cached tokens to be double-counted in cost
calculations, resulting in ~2.5x overestimation.

Fix: normalize gateway model strings before provider detection so
'mux-gateway:openai/gpt-5.1' correctly triggers OpenAI-specific
handling (subtract cachedInputTokens from inputTokens).
@ethanndickson ethanndickson changed the title πŸ€– fix: normalize gateway models for OpenAI cost calculation πŸ€– fix: accurate cost estimation for multi-step tool usage Dec 2, 2025
Google/Gemini, like OpenAI, reports inputTokens INCLUSIVE of
cachedInputTokens. Extend the subtraction logic to also handle
Google models to avoid double-counting cached tokens.
@ethanndickson
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with πŸ‘.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ethanndickson ethanndickson added this pull request to the merge queue Dec 2, 2025
github-merge-queue bot pushed a commit that referenced this pull request Dec 2, 2025
## Problem

The application was severely underestimating costs for conversations
involving tool calls.

### Root Cause

The Vercel AI SDK provides two usage metrics:
- `streamResult.usage` β€” Token usage from **the last step only**
- `streamResult.totalUsage` β€” Sum of token usage across **all steps**

The application was using `usage` instead of `totalUsage`. For a
conversation with 10 tool calls, only ~1/10th of actual consumption was
reported. A $5 conversation would display as $0.50.

### The Complicating Factor

Two UI elements use token data with different semantic requirements:

| Display | Needs | Why |
|---------|-------|-----|
| **Cost** | Sum of all steps | If model read context 10 times, you paid
for 10 reads |
| **Context window** | Last step only | Shows "how full is the
conversation now" for the next request |

Simply switching to `totalUsage` would fix costs but break context
display (showing 500% utilization after many tool calls).

### Cache Creation Tokens

Anthropic's cache creation tokens (`cacheCreationInputTokens`) are:
- Only in provider-specific metadata, not normalized usage
- Need to be summed across all steps
- Not automatically aggregated by the AI SDK

Even with `totalUsage`, cache creation costs were lost unless manually
aggregated from each step's provider metadata.

## Solution

Track both values with different semantic purposes:

**For cost calculation:**
- `usage` / `cumulativeUsage` β€” total across all steps
- `providerMetadata` / `cumulativeProviderMetadata` β€” aggregated cache
creation tokens

**For context window display:**
- `contextUsage` / `lastContextUsage` β€” last step only
- `contextProviderMetadata` β€” last step only

### Key Changes

1. **Backend** (`streamManager.ts`): Use `totalUsage` for cost, track
`lastStepUsage` for context, aggregate provider metadata across steps
2. **Types**: Extended `StreamEndEvent`, `MuxMetadata`,
`UsageDeltaEvent` with dual fields
3. **Frontend**: `StreamingMessageAggregator` tracks both cumulative and
per-step usage
4. **Store**: `WorkspaceUsageState` provides `usageHistory` (cost) and
`lastContextUsage` (context window)
5. **UI**: Components use appropriate field for their purpose

### Also Fixed

- **OpenAI cached token double-counting**: Gateway models
(`mux-gateway:openai/gpt-5.1`) weren't recognized as OpenAI, causing
cached tokens to be counted in both "Cache Read" and "Input". Now
normalizes gateway model strings before provider detection.

- **Google/Gemini cached token double-counting**: Google, like OpenAI,
reports `inputTokens` inclusive of `cachedInputTokens`. Extended the
subtraction logic to handle Google models.

---
_Generated with `mux`_
Merged via the queue into main with commit b3be437 Dec 2, 2025
13 checks passed
@ethanndickson ethanndickson deleted the fix-cost-estimation-tool-usage branch December 2, 2025 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant