feat: surface token usage metadata for billing (#907)#10
Merged
Conversation
Add `Usage.reasoning: number` (required, defaults to 0) so billing/metering consumers can distinguish pure-completion tokens from reasoning tokens. Per the OpenAI wire contract, reasoning is a subset of `output` rather than a sibling, so `totalTokens` keeps the existing invariant (input + output + cacheRead + cacheWrite) — no double-counting. Add two helpers: - `snapshotUsage(usage)` — two-level shallow copy used by event emits so consumers receive a stable value independent of subsequent accumulation. - `addUsage(target, delta)` — in-place additive accumulator for cumulative totals across turns. Returns the mutated target for chaining. Align `calculateUsageCost` to return `void` (matches the per-provider local copies; no caller used the previous `Usage['cost']` return). Export `ZERO_USAGE` from `tools/test/fixtures.ts` so workspace tests share a single canonical zero-usage literal instead of duplicating it.
…okens fallback - Stop double-counting reasoning by setting usage.output to completion_tokens (which already includes reasoning per OpenAI's wire contract) - Expose reasoning as a separate read-only count on usage.reasoning - Include cacheWrite in the totalTokens fallback when total_tokens is absent
The Ollama adapter previously assigned input/output/totalTokens but never ran the cost schedule, leaving cost.total at zero even when the model descriptor defined per-token rates. Apply the local calculateUsageCost helper after token assignment so the same Usage invariants hold across providers.
The Anthropic API does not expose a reasoning-token count even when extended thinking is enabled — thinking cost is server-side folded into output_tokens. Initialize usage.reasoning to 0 so the field is present and add a regression guard so we do not later populate it from a hallucinated payload field.
…ent_end Add totalUsage to AgentState and accumulate per-message Usage (tokens and cost.*) as each turn completes. Snapshot the rolling total onto turn_end and agent_end events so consumers can read a cumulative figure without re-walking messages[]. Reset on prompt() (matching stepCount semantics), preserve across continue().
… guard Surface the cumulative Usage from agent_end/turn_end as a top-level useChat field, null before the first event. Guard the setter on totalTokens so re-render does not fire when nothing changed. Reset to null on every new prompt(), mirroring the agent-side reset.
Capture the design decisions for the token usage metadata work that landed in this branch (reasoning-as-subset, no provider-named aliases, cumulative usage location, shared cost helper, etc.) as an append-only log next to REDESIGN_DECISIONS.md.
The merge order in createModel spread builtIn.compat first then this.compat,
so the adapter's generic default (maxTokensField: 'max_tokens') silently
clobbered the model-specific override that the built-in entry sets
('max_completion_tokens' for reasoning-capable models). Result: every
reasoning model loaded via createModel sent the wrong field name and OpenAI
returned 400 "Unsupported parameter: 'max_tokens'". Same bug applied to
headers. The mock-mode unit tests didn't catch it because the mocked fetch
never validated the request body — the live smoke test caught it on the
first real call.
Swap to: adapter defaults → built-in catalog → caller overrides, so the
most specific source wins. Adds two regression tests.
Empirical-verification pass for the LLM-call metadata work (issue #907).
Until now every assertion rested on mocked SSE streams; this adds live
provider eval suites that hit real endpoints and verify three load-bearing
claims against actual wire payloads:
- OpenAI: `completion_tokens` already includes `reasoning_tokens` (so
`output = completion_tokens`, no carve-out)
- OpenAI: `prompt_tokens_details.cached_tokens` populates on ≥1024-token
prefix matches and surfaces as `usage.cacheRead`
- Ollama: thinking content has no associated token count; `usage.reasoning`
must stay 0 even on a thinking-on Qwen3 turn
The reasoning-subset claim is the one that drove removing `+ reasoningTokens`
from the OpenAI output extractor — live verification confirms the wire shape
matches our assumption against `gpt-5.4-nano`.
Infrastructure:
- `tools/test/load-env.js` walks up to find a workspace `.env`; silent if
absent so CI is unaffected
- `tools/test/live.ts` provides `liveDescribe`/`requireEnv`/`suiteLevel`
helpers (OpenAI/Ollama/Agent suites use the equivalent inline pattern
for full TypeScript inference; the helpers are exported for future use)
- Each suite is gated by `<NAMESPACE>_LIVE_SUITE=smoke|extended`; runner
scripts set `*_LIVE_READY=1` which un-ignores the live test files and
disables the global `fetch` mock in jest setup
- New root scripts: `test:live:openai{,:smoke,:extended}`,
`test:live:agent{,:smoke,:extended}`
- Ollama smoke runner unchanged; ollama.live.test.ts extended with a new
`Ollama live token-usage audit` describe block (4 tests)
- `.gitignore` updated to cover `.env`/`.env.local` (secrets-leak gap)
Suites are excluded from default `pnpm test` via `testPathIgnorePatterns`,
require explicit env vars, and never run in CI. See
`LLM_METADATA_DECISIONS.md` #19 for the rationale and #20 for the
adapter→builtin→override precedence bug the live tests uncovered.
Verified locally:
- `pnpm test:live:openai:extended`: 5/5 pass
- `pnpm test:live:ollama:extended`: 12/12 pass
- `pnpm test:live:agent:extended`: 3/3 pass
- `pnpm test` (default): 109/109 pass with no live env vars set
Contributor
Author
Live-provider eval hardening passPushed two follow-up commits:
Claims now empirically verified
Local results
Decision log updates
Deferred (known gap)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Producer side of the billing/metering hookup tracked in constructive-io/constructive-planning#907. Surfaces
prompt_tokens,completion_tokens,total_tokens, andreasoning_tokenson agentic-kit responses, plus a cumulative rollup on the agent loop and React hook.Field mapping (for the billing consumer)
prompt_tokensusage.inputcompletion_tokensusage.outputreasoning_tokensusage.reasoningtotal_tokensusage.totalTokenscache_read_tokensusage.cacheReadcache_write_tokensusage.cacheWritereasoningis a subset ofoutput(matches OpenAI's wire contract —completion_tokensalready includes reasoning).totalTokens = input + output + cacheRead + cacheWrite— invariant unchanged, no double-counting. Billing computes pure-completion asoutput - reasoning.Commit structure
Each concern is its own commit so reviewers can land them independently if needed:
feat(agentic-kit)— AddreasoningtoUsage, addaddUsage/snapshotUsagehelpers, drop unused return fromcalculateUsageCost.fix(openai)— Stop double-counting reasoning intooutput, exposeusage.reasoning, includecacheWritein thetotalTokensfallback.fix(ollama)— Actually invokecalculateUsageCostsocost.totalpopulates (was silently zero).fix(anthropic)— Initializereasoning: 0(API does not expose this field).feat(agent)— AccumulatetotalUsageonAgentState, snapshot ontoturn_end/agent_endevents, reset onprompt(), preserve acrosscontinue().feat(react)— Surfaceusage: Usage | nullonuseChatwithtotalTokens-keyed change-detection guard to avoid no-op re-renders.docs— Append-onlyLLM_METADATA_DECISIONS.mdrecording the design decisions.Out of scope (explicit)
promptTokens, etc.) — single canonical shape; consumers translate at the boundary.prompt_tokens_details.cache_write_tokensingestion — no consumer yet.Test plan
pnpm buildsucceeds across the workspace (ESM/CJS dual output)pnpm -r test— new assertions pass, existing usage assertions still passoutput > 0,reasoning > 0,totalTokens === input + output + cacheRead + cacheWritestate.totalUsagematches manual sum across turns field-for-fieldreasoning === 0,cacheRead/cacheWritenon-zerocostset on the descriptor —cost.total > 0useChat().usagepopulates onagent_endand resets on newprompt()