feat(events): emit 5-way token breakdown + context-window utilization in message_complete#87
feat(events): emit 5-way token breakdown + context-window utilization in message_complete#87byapparov wants to merge 2 commits into
Conversation
… in message_complete (#86) - Expand `tokens` in `message_complete` from an opaque `info.tokens` passthrough to an explicit object with all 5 fields: input / output / reasoning / cache.read / cache.write — mirroring upstream LLM.Usage shape. The data was already captured in MessageV2.Assistant.tokens via StepFinishPart accumulation; this change surfaces it explicitly. - Add `context: { used, limit, ratio }` to `message_complete`: - `used = input + cache.read` (tokens occupying the context window this turn) - `limit` sourced from Provider.getModel() → model.limit.context (models.dev) - `ratio = used / limit`; emits `null` when limit is unknown (unregistered endpoint) - Cost kept correct: `info.cost` accumulates real per-step cost from StepFinishPart, NOT from the new step.ended event which emits cost:0 (the cost:0 trap). - Update EVENTS.md with the extended schema and field-by-field documentation. - Add TDD test file (RED→GREEN): `test/cli/usage-token-breakdown.test.ts`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0187MsfK1upr6K2BKVbmaebQ
Code reviewVerdict: Address the major findings before merging. · 🔴 0 · 🟠 1 · 🟡 1 · ⚪ 0 · 0/2 resolved
🤖 Fix all 2 open findings with your agent📋 Out-of-diff findings (2)
Reviewed 3 files · 0 inline · view all 2 findings ↗ aictrl · AI code review for fast-moving teams · aictrl.dev |
…dContextWindow
Custom models without a registered context limit default to `limit.context = 0`
(provider.ts:929). The old guard `contextLimit != null` passed for 0, causing
`ratio = used / 0 = Infinity`, which JSON.stringify serialises as `null` *inside*
the context object — diverging from the documented top-level null contract in
EVENTS.md ("null — emitted when the model's context limit is not known").
Fix: extract pure helper `buildContextWindow(limit, used)` that returns null when
limit is null or <=0. This also makes the computation unit-testable.
Replace source-grep tests (which could pass even with wrong logic, per bot review)
with 10 behavioural unit tests of `buildContextWindow` covering: null limit, zero
limit (🟠 regression case), ratio computation, JSON-serialisability, and the
top-level-null contract. Retain slim source-text checks for structural wiring.
Fixes review findings from PR #87 (aictrl-dev bot):
- 🟠 limit:0 yields Infinity ratio, breaks null contract
- 🟡 Tests grep source text instead of running emit path
Review response — PR #87Triaged 2 findings (🟠 1, 🟡 1). Both verified TRUE; both fixed. Issues addressed (pushed to this PR)
Review claims verified false (no change needed)(none — both findings were genuine) Not addressed here(none — all findings fixed) |
| if (contextLimit == null || contextLimit <= 0) return null | ||
| return { | ||
| used: contextUsed, | ||
| limit: contextLimit, |
There was a problem hiding this comment.
🟡 ratio unclamped but documented as 0–1.
| limit: contextLimit, | |
| In buildContextWindow: `ratio: Math.min(1, contextUsed / contextLimit),` — or update EVENTS.md from "ratio (0–1)" to "ratio (0–1; may exceed 1 if usage exceeds the registered limit)". |
🤖 Fix with your agent
Fix this code review finding (aictrl-dev/cli PR #87, packages/cli/src/cli/cmd/run.ts:102):
Problem: ratio unclamped but documented as 0–1
Detail: `buildContextWindow` computes `ratio: contextUsed / contextLimit` with no clamp, but EVENTS.md documents `ratio` as `(0–1)`. The code can legitimately produce `ratio > 1`: if `contextUsed` ever exceeds the registry limit (stale/lowered `model.limit.context` after a model update, or a provider whose actual billed prompt exceeds the registered window), `ratio` exceeds 1 and contradicts the documented range. Consumers relying on `0 ≤ ratio ≤ 1` (e.g. for a utilization bar) will break. Either clamp (`Math.min(1, used/limit)`) or relax the doc to note values may exceed 1 when usage exceeds the registered limit.
Suggested fix: In buildContextWindow: `ratio: Math.min(1, contextUsed / contextLimit),` — or update EVENTS.md from "ratio (0–1)" to "ratio (0–1; may exceed 1 if usage exceeds the registered limit)".
Implement the fix on the PR head branch and add a regression test that fails before the fix and passes after.
Why this matters
buildContextWindow computes ratio: contextUsed / contextLimit with no clamp, but EVENTS.md documents ratio as (0–1). The code can legitimately produce ratio > 1: if contextUsed ever exceeds the registry limit (stale/lowered model.limit.context after a model update, or a provider whose actual billed prompt exceeds the registered window), ratio exceeds 1 and contradicts the documented range. Consumers relying on 0 ≤ ratio ≤ 1 (e.g. for a utilization bar) will break. Either clamp (Math.min(1, used/limit)) or relax the doc to note values may exceed 1 when usage exceeds the registered limit.
if (contextLimit == null || contextLimit <= 0) return null
return {
used: contextUsed,
limit: contextLimit,
ratio: contextUsed / contextLimit,
}
}| const contextLimit = await Provider.getModel(info.providerID, info.modelID) | ||
| .then((m) => m.limit.context) | ||
| .catch(() => null) | ||
| const contextUsed = tokens.input + tokens.cache.read |
There was a problem hiding this comment.
🟡 Verify upstream tokens shape for all providers.
🤖 Fix with your agent
Fix this code review finding (aictrl-dev/cli PR #87, packages/cli/src/cli/cmd/run.ts:510):
Problem: Verify upstream tokens shape for all providers
Detail: The correctness of both `context.used` and the `reasoning` field hinges on the upstream `info.tokens` shape that this diff cannot show:
1. `contextUsed = tokens.input + tokens.cache.read` is correct only if `input` EXCLUDES cached tokens (Anthropic semantics). For OpenAI, raw `prompt_tokens` historically INCLUDES cached tokens — if the upstream `LLM.Usage` isn't normalized per-provider, `used` double-counts cache reads, and the EVENTS.md "a token is counted in exactly one bucket" guarantee silently fails for those providers.
2. `reasoning: info.tokens.reasoning` emits `undefined` (dropped by JSON.stringify → missing field) if any provider's `StepFinishPart` doesn't populate `reasoning`.
Please confirm the upstream normalization covers every provider before relying on these metrics; otherwise guard with `(info.tokens.reasoning ?? 0)` and verify `input` excludes cache for OpenAI.
Implement the fix on the PR head branch and add a regression test that fails before the fix and passes after.
Why this matters
The correctness of both context.used and the reasoning field hinges on the upstream info.tokens shape that this diff cannot show:
contextUsed = tokens.input + tokens.cache.readis correct only ifinputEXCLUDES cached tokens (Anthropic semantics). For OpenAI, rawprompt_tokenshistorically INCLUDES cached tokens — if the upstreamLLM.Usageisn't normalized per-provider,useddouble-counts cache reads, and the EVENTS.md "a token is counted in exactly one bucket" guarantee silently fails for those providers.reasoning: info.tokens.reasoningemitsundefined(dropped by JSON.stringify → missing field) if any provider'sStepFinishPartdoesn't populatereasoning.
Please confirm the upstream normalization covers every provider before relying on these metrics; otherwise guard with(info.tokens.reasoning ?? 0)and verifyinputexcludes cache for OpenAI.
const contextLimit = await Provider.getModel(info.providerID, info.modelID)
.then((m) => m.limit.context)
.catch(() => null)
const contextUsed = tokens.input + tokens.cache.read
const context = buildContextWindow(contextLimit, contextUsed)| // custom models) buildContextWindow returns null to signal "unknown". | ||
| const contextLimit = await Provider.getModel(info.providerID, info.modelID) | ||
| .then((m) => m.limit.context) | ||
| .catch(() => null) |
There was a problem hiding this comment.
🟡 .catch(()=>null) swallows non-not-found errors.
🤖 Fix with your agent
Fix this code review finding (aictrl-dev/cli PR #87, packages/cli/src/cli/cmd/run.ts:509):
Problem: .catch(()=>null) swallows non-not-found errors
Detail: `Provider.getModel(...).catch(() => null)` turns *every* rejection into the "unknown context" sentinel — not only the intended "model not registered" case. A transient registry fetch failure, a programming error in `getModel`, or a malformed `m.limit` shape would all silently degrade `message_complete.context` to `null` with zero signal, making regressions invisible. Consider catching only the expected not-found error type, or logging the unexpected error before falling back to `null`.
Implement the fix on the PR head branch and add a regression test that fails before the fix and passes after.
Why this matters
Provider.getModel(...).catch(() => null) turns every rejection into the "unknown context" sentinel — not only the intended "model not registered" case. A transient registry fetch failure, a programming error in getModel, or a malformed m.limit shape would all silently degrade message_complete.context to null with zero signal, making regressions invisible. Consider catching only the expected not-found error type, or logging the unexpected error before falling back to null.
const contextLimit = await Provider.getModel(info.providerID, info.modelID)
.then((m) => m.limit.context)
.catch(() => null)
const contextUsed = tokens.input + tokens.cache.read| "reasoning": 0, | ||
| "cache": { "read": 8800, "write": 1024 } | ||
| }, | ||
| "context": { "used": 9824, "limit": 200000, "ratio": 0.049 }, |
There was a problem hiding this comment.
⚪ Example ratio 0.049 ≠ emitted 0.04912.
🤖 Fix with your agent
Fix this code review finding (aictrl-dev/cli PR #87, EVENTS.md:91):
Problem: Example ratio 0.049 ≠ emitted 0.04912
Detail: The `message_complete` example shows `"context": { "used": 9824, "limit": 200000, "ratio": 0.049 }`, but `buildContextWindow` emits the unrounded float: `9824 / 200000 = 0.04912`. The example rounds to 3 decimals, which may mislead consumers who treat the doc as a byte-exact spec. Either note that `ratio` is an unrounded float, or pick an example whose `used/limit` is exact (e.g. used 9800 → 0.049).
Implement the fix on the PR head branch and add a regression test that fails before the fix and passes after.
Why this matters
The message_complete example shows "context": { "used": 9824, "limit": 200000, "ratio": 0.049 }, but buildContextWindow emits the unrounded float: 9824 / 200000 = 0.04912. The example rounds to 3 decimals, which may mislead consumers who treat the doc as a byte-exact spec. Either note that ratio is an unrounded float, or pick an example whose used/limit is exact (e.g. used 9800 → 0.049).
"cache": { "read": 8800, "write": 1024 }
},
"context": { "used": 9824, "limit": 200000, "ratio": 0.049 },
"finish": "tool-calls"
| * | ||
| * Returns `null` (meaning "unknown") when: | ||
| * - `contextLimit` is `null` — Provider.getModel threw (unregistered model) | ||
| * - `contextLimit` is `0` — custom model without a registered limit defaults to |
There was a problem hiding this comment.
⚪ JSDoc hardcodes provider.ts:929 (line rot).
🤖 Fix with your agent
Fix this code review finding (aictrl-dev/cli PR #87, packages/cli/src/cli/cmd/run.ts:88):
Problem: JSDoc hardcodes provider.ts:929 (line rot)
Detail: The `buildContextWindow` JSDoc references `` `context: 0` (provider.ts:929) ``. A hardcoded file:line citation will silently rot the moment `provider.ts` is edited, turning the comment into misleading guidance. Reference the behavior/symbol (e.g. "the provider's default `limit.context = 0` for unregistered custom models") instead of a line number.
Implement the fix on the PR head branch and add a regression test that fails before the fix and passes after.
Why this matters
The buildContextWindow JSDoc references `context: 0` (provider.ts:929). A hardcoded file:line citation will silently rot the moment provider.ts is edited, turning the comment into misleading guidance. Reference the behavior/symbol (e.g. "the provider's default limit.context = 0 for unregistered custom models") instead of a line number.
* - `contextLimit` is `0` — custom model without a registered limit defaults to
* `context: 0` (provider.ts:929). A zero limit would yield `Infinity`/`NaN` for
* ratio, which `JSON.stringify` serialises as `null` inside the object — diverging
Code reviewVerdict: Looks good — only minor / nit comments below. · 🔴 0 · 🟠 0 · 🟡 3 · ⚪ 3 · 0/6 resolved
🤖 Fix all 6 open findings with your agent📋 Out-of-diff findings (6)
Reviewed 3 files · 0 inline · view all 6 findings ↗ aictrl · AI code review for fast-moving teams · aictrl.dev |
Summary
message_complete: expandstokensfrom an opaqueinfo.tokenspassthrough to an explicit object —input / output / reasoning / cache.read / cache.write— mirroring upstreamLLM.Usage. The data was already captured inMessageV2.Assistant.tokensviaStepFinishPartaccumulation; this change surfaces all fields explicitly.context: { used, limit, ratio }whereused = input + cache.read,limitcomes fromProvider.getModel() → model.limit.context(models.dev registry), andratio = used / limit. Emitsnullwhen the model's context limit is not known (unregistered custom endpoint).cost: 0trap avoided:info.costis kept as-is — it accumulates real per-step cost fromStepFinishPart. The new upstreamstep.endedevent emitscost: 0(reconciled later by a projector); we do not touch that path.How the cache split was already captured
MessageV2.StepFinishPart(the legacy step-finish message part) already hastokens: { input, output, reasoning, cache: { read, write } }. The assistant messageinfo.tokensis accumulated from these step parts — cache split included. No new provider-level capture was needed; we just stop dropping it in the emit call.Context limit source
Provider.getModel(providerID, modelID)returns the model record from the models.dev registry, which hasmodel.limit.context. The lookup is wrapped in a.catch(() => null)so an unknown model (custom endpoint) gracefully emitscontext: nullrather than throwing.Test plan
packages/cli/test/cli/usage-token-breakdown.test.ts(8 cases) — confirmed RED before implementationpackages/cli/src/cli/cmd/run.tsbun test test/cli/usage-token-breakdown.test.ts— 8/8 GREENbun test test/cli/— 77/77 GREEN (no regressions)bun run typecheck— cleanbun turbo typecheckacross all 5 packages — 6/6 tasks successfulCloses #86
🤖 Generated with Claude Code
https://claude.ai/code/session_0187MsfK1upr6K2BKVbmaebQ