Fix Gemini and Vibe one-shot rates by ozymandiashh · Pull Request #352 · getagentseal/codeburn

ozymandiashh · 2026-05-18T21:29:14Z

Summary

Addresses #351 by fixing how non-Claude provider calls are cached for one-shot/retry classification.

The core issue was not cache hits. CodeBurn's one-shot rate is based on edit turns with zero detected retries/self-corrections. For Gemini and Mistral Vibe, the cached turn shape did not give the classifier enough structure to see retries inside a single user request.

This PR adds provider-level turn grouping so related assistant calls are cached under the same user turn. That lets the existing classifier see multi-message flows like Edit -> Bash -> Edit and count them as retries instead of reporting them as independent one-shot turns.

Closes #351.

Root Cause

The classifier already knows how to detect retries when a turn contains multiple assistant calls:

User: implement parser fix
Assistant call 1: Edit
Assistant call 2: Bash
Assistant call 3: Edit

That should be one edit turn with one retry, because the assistant edited, ran a command, then edited again.

Before this PR, provider calls emitted through ParsedProviderCall were converted into cached turns one call at a time. So Gemini assistant messages were cached like this:

Turn 1: [Edit]  -> retries = 0, one-shot
Turn 2: [Bash]  -> no edit turn
Turn 3: [Edit]  -> retries = 0, one-shot

The classifier never saw the full Edit -> Bash -> Edit sequence in one turn, so the one-shot rate could look artificially perfect.

Example: Gemini

Gemini already exposes per-assistant-message token/tool data, so no private user logs are needed to validate the fix.

Synthetic fixture shape:

[
  { "type": "user", "content": "implement parser update in src/parser.ts" },
  { "type": "gemini", "id": "g1", "toolCalls": [{ "name": "edit_file" }] },
  { "type": "gemini", "id": "g2", "toolCalls": [{ "name": "run_command", "args": { "command": "npm test" } }] },
  { "type": "gemini", "id": "g3", "toolCalls": [{ "name": "edit_file" }] }
]

Before:

3 cached turns, each with 1 assistant call
retry detector cannot see Edit -> Bash -> Edit
oneShotTurns can be inflated

After:

1 cached turn, with assistant calls [g1, g2, g3]
retries = 1
oneShotTurns = 0

The new regression test asserts exactly that.

Example: Mistral Vibe

Mistral Vibe's local logs are different from Gemini:

meta.json.stats has cumulative session totals such as session_prompt_tokens, session_completion_tokens, and session_cost.
messages.jsonl has user/assistant/tool message structure and assistant tool_calls.
Current Vibe logs do not expose cache-read/cache-write token fields.

So this PR does not invent cache token counts. CodeBurn still reports Vibe cache token counts as 0 until Vibe persists those fields locally.

What this PR does instead:

Splits Vibe assistant messages into per-message provider calls so the classifier can see assistant message order.
Groups those calls under the current user turn via turnId.
Distributes cumulative session prompt/completion tokens across the assistant calls while preserving exact session totals.
Prefers meta.json.stats.session_cost when present, because that is the best local cost signal available and may already reflect Vibe-side accounting better than CodeBurn's price-derived estimate.
Preserves Vibe's provider-calculated cost through the session cache so session_cost is not lost during cached reads.

Synthetic Vibe fixture shape:

// meta.json
{
  "stats": {
    "session_prompt_tokens": 300,
    "session_completion_tokens": 90,
    "session_cost": 0.123456
  }
}

// messages.jsonl, simplified
{ "role": "user", "content": "implement parser update" }
{ "role": "assistant", "message_id": "a1", "tool_calls": [{ "function": { "name": "search_replace" } }] }
{ "role": "assistant", "message_id": "a2", "tool_calls": [{ "function": { "name": "bash", "arguments": "{\"command\":\"npm test\"}" } }] }
{ "role": "assistant", "message_id": "a3", "tool_calls": [{ "function": { "name": "write_file" } }] }

After parsing:

1 cached turn, with assistant calls [a1, a2, a3]
retries = 1
oneShotTurns = 0
totalInputTokens = 300
totalOutputTokens = 90
totalCostUSD = 0.123456

What Changed

Add optional ParsedProviderCall.turnId so providers can mark calls that belong to the same user turn.
Group parsed provider calls into cached turns by (sessionId, turnId) when turnId is present.
Assign Gemini assistant calls to the current user turn.
Split Mistral Vibe assistant messages into per-message calls grouped under the current user turn.
Prefer meta.json.stats.session_cost for Mistral Vibe cost when present.
Add optional cached costUSD for Vibe calls so provider-calculated Vibe cost survives cache round-trips.
Bump session cache to v2 so existing cached Gemini/Vibe entries are re-derived with the new grouping.
Document current Vibe cache limitations and the session_cost behavior.

Validation

Behavior proof

The main proof is tests/provider-turn-grouping.test.ts, which exercises the exact suspicious one-shot shape rather than only checking that the suite is green.

Gemini fixture:

input:  user prompt + Gemini assistant messages [edit_file, run_command, edit_file]
output: one parsed CodeBurn session turn with assistant calls [g1, g2, g3]
asserts: retries = 1
asserts: category oneShotTurns = 0

This proves the classifier can now see the full Edit -> Bash -> Edit sequence for Gemini instead of treating each assistant message as a separate one-shot candidate.

Mistral Vibe fixture:

input:  meta.json session totals + session_cost + messages [search_replace, bash, write_file]
output: one parsed CodeBurn session turn with assistant calls [a1, a2, a3]
asserts: retries = 1
asserts: category oneShotTurns = 0
asserts: totalInputTokens = 300
asserts: totalOutputTokens = 90
asserts: totalCostUSD = 0.123456

This proves both sides of the Vibe fix: retry detection sees the multi-message edit flow, and Vibe's own session_cost survives parser/cache round-trip instead of being replaced by a price-derived estimate.

tests/providers/mistral-vibe.test.ts also includes a direct cost regression where the fixture sets an intentionally large token price but session_cost = 0.381681; the provider returns exactly 0.381681, proving session_cost takes precedence over estimated pricing.

Command results

./node_modules/.bin/tsc --noEmit --pretty false
npx vitest run tests/provider-turn-grouping.test.ts tests/providers/gemini.test.ts tests/providers/mistral-vibe.test.ts tests/parser-gemini-cache.test.ts tests/session-cache.test.ts - 68 tests passed
npm run build
npm test -- --run - 63 files / 874 tests passed
git diff --check
Claude Opus 4.7 effort max review: PASS
Gemini 3.1 Pro Preview review: PASS
GitHub checks: check, assess, and semgrep pass

Notes

Validation uses synthetic fixtures only. No private prompts, project names, session IDs, or local user logs were used.

This PR does not claim exact Vibe cache token accounting. Current Vibe local logs do not include cache token fields, so exact cache read/write token reporting requires an upstream Vibe logging change.

iamtoruk · 2026-05-18T22:37:37Z

Superseded by #355, which combines the turnId grouping from this PR (for Gemini/Vibe) with the toolSequence approach from #353 (for Kiro/Goose).

Fix provider turn grouping for one-shot rates

ce57b01

This was referenced May 18, 2026

Not tracking sub agents (Fedora 44) #336

Open

Fix 100% one-shot rate for Gemini/Mistral/Kiro/Goose #353

Closed

iamtoruk mentioned this pull request May 18, 2026

Fix one-shot rate detection for all non-Claude providers #355

Merged

7 tasks

iamtoruk closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Gemini and Vibe one-shot rates#352

Fix Gemini and Vibe one-shot rates#352
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/vibe-gemini-one-shot

ozymandiashh commented May 18, 2026 •

edited

Loading

Uh oh!

iamtoruk commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ozymandiashh commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Example: Gemini

Example: Mistral Vibe

What Changed

Validation

Behavior proof

Command results

Notes

Uh oh!

iamtoruk commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ozymandiashh commented May 18, 2026 •

edited

Loading