Skip to content

Fix Gemini and Vibe one-shot rates#352

Closed
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/vibe-gemini-one-shot
Closed

Fix Gemini and Vibe one-shot rates#352
ozymandiashh wants to merge 1 commit into
getagentseal:mainfrom
ozymandiashh:codex/vibe-gemini-one-shot

Conversation

@ozymandiashh
Copy link
Copy Markdown
Contributor

@ozymandiashh ozymandiashh commented May 18, 2026

Summary

Addresses #351 by fixing how non-Claude provider calls are cached for one-shot/retry classification.

The core issue was not cache hits. CodeBurn's one-shot rate is based on edit turns with zero detected retries/self-corrections. For Gemini and Mistral Vibe, the cached turn shape did not give the classifier enough structure to see retries inside a single user request.

This PR adds provider-level turn grouping so related assistant calls are cached under the same user turn. That lets the existing classifier see multi-message flows like Edit -> Bash -> Edit and count them as retries instead of reporting them as independent one-shot turns.

Closes #351.

Root Cause

The classifier already knows how to detect retries when a turn contains multiple assistant calls:

User: implement parser fix
Assistant call 1: Edit
Assistant call 2: Bash
Assistant call 3: Edit

That should be one edit turn with one retry, because the assistant edited, ran a command, then edited again.

Before this PR, provider calls emitted through ParsedProviderCall were converted into cached turns one call at a time. So Gemini assistant messages were cached like this:

Turn 1: [Edit]  -> retries = 0, one-shot
Turn 2: [Bash]  -> no edit turn
Turn 3: [Edit]  -> retries = 0, one-shot

The classifier never saw the full Edit -> Bash -> Edit sequence in one turn, so the one-shot rate could look artificially perfect.

Example: Gemini

Gemini already exposes per-assistant-message token/tool data, so no private user logs are needed to validate the fix.

Synthetic fixture shape:

[
  { "type": "user", "content": "implement parser update in src/parser.ts" },
  { "type": "gemini", "id": "g1", "toolCalls": [{ "name": "edit_file" }] },
  { "type": "gemini", "id": "g2", "toolCalls": [{ "name": "run_command", "args": { "command": "npm test" } }] },
  { "type": "gemini", "id": "g3", "toolCalls": [{ "name": "edit_file" }] }
]

Before:

3 cached turns, each with 1 assistant call
retry detector cannot see Edit -> Bash -> Edit
oneShotTurns can be inflated

After:

1 cached turn, with assistant calls [g1, g2, g3]
retries = 1
oneShotTurns = 0

The new regression test asserts exactly that.

Example: Mistral Vibe

Mistral Vibe's local logs are different from Gemini:

  • meta.json.stats has cumulative session totals such as session_prompt_tokens, session_completion_tokens, and session_cost.
  • messages.jsonl has user/assistant/tool message structure and assistant tool_calls.
  • Current Vibe logs do not expose cache-read/cache-write token fields.

So this PR does not invent cache token counts. CodeBurn still reports Vibe cache token counts as 0 until Vibe persists those fields locally.

What this PR does instead:

  1. Splits Vibe assistant messages into per-message provider calls so the classifier can see assistant message order.
  2. Groups those calls under the current user turn via turnId.
  3. Distributes cumulative session prompt/completion tokens across the assistant calls while preserving exact session totals.
  4. Prefers meta.json.stats.session_cost when present, because that is the best local cost signal available and may already reflect Vibe-side accounting better than CodeBurn's price-derived estimate.
  5. Preserves Vibe's provider-calculated cost through the session cache so session_cost is not lost during cached reads.

Synthetic Vibe fixture shape:

// meta.json
{
  "stats": {
    "session_prompt_tokens": 300,
    "session_completion_tokens": 90,
    "session_cost": 0.123456
  }
}
// messages.jsonl, simplified
{ "role": "user", "content": "implement parser update" }
{ "role": "assistant", "message_id": "a1", "tool_calls": [{ "function": { "name": "search_replace" } }] }
{ "role": "assistant", "message_id": "a2", "tool_calls": [{ "function": { "name": "bash", "arguments": "{\"command\":\"npm test\"}" } }] }
{ "role": "assistant", "message_id": "a3", "tool_calls": [{ "function": { "name": "write_file" } }] }

After parsing:

1 cached turn, with assistant calls [a1, a2, a3]
retries = 1
oneShotTurns = 0
totalInputTokens = 300
totalOutputTokens = 90
totalCostUSD = 0.123456

What Changed

  • Add optional ParsedProviderCall.turnId so providers can mark calls that belong to the same user turn.
  • Group parsed provider calls into cached turns by (sessionId, turnId) when turnId is present.
  • Assign Gemini assistant calls to the current user turn.
  • Split Mistral Vibe assistant messages into per-message calls grouped under the current user turn.
  • Prefer meta.json.stats.session_cost for Mistral Vibe cost when present.
  • Add optional cached costUSD for Vibe calls so provider-calculated Vibe cost survives cache round-trips.
  • Bump session cache to v2 so existing cached Gemini/Vibe entries are re-derived with the new grouping.
  • Document current Vibe cache limitations and the session_cost behavior.

Validation

Behavior proof

The main proof is tests/provider-turn-grouping.test.ts, which exercises the exact suspicious one-shot shape rather than only checking that the suite is green.

Gemini fixture:

input:  user prompt + Gemini assistant messages [edit_file, run_command, edit_file]
output: one parsed CodeBurn session turn with assistant calls [g1, g2, g3]
asserts: retries = 1
asserts: category oneShotTurns = 0

This proves the classifier can now see the full Edit -> Bash -> Edit sequence for Gemini instead of treating each assistant message as a separate one-shot candidate.

Mistral Vibe fixture:

input:  meta.json session totals + session_cost + messages [search_replace, bash, write_file]
output: one parsed CodeBurn session turn with assistant calls [a1, a2, a3]
asserts: retries = 1
asserts: category oneShotTurns = 0
asserts: totalInputTokens = 300
asserts: totalOutputTokens = 90
asserts: totalCostUSD = 0.123456

This proves both sides of the Vibe fix: retry detection sees the multi-message edit flow, and Vibe's own session_cost survives parser/cache round-trip instead of being replaced by a price-derived estimate.

tests/providers/mistral-vibe.test.ts also includes a direct cost regression where the fixture sets an intentionally large token price but session_cost = 0.381681; the provider returns exactly 0.381681, proving session_cost takes precedence over estimated pricing.

Command results

  • ./node_modules/.bin/tsc --noEmit --pretty false
  • npx vitest run tests/provider-turn-grouping.test.ts tests/providers/gemini.test.ts tests/providers/mistral-vibe.test.ts tests/parser-gemini-cache.test.ts tests/session-cache.test.ts - 68 tests passed
  • npm run build
  • npm test -- --run - 63 files / 874 tests passed
  • git diff --check
  • Claude Opus 4.7 effort max review: PASS
  • Gemini 3.1 Pro Preview review: PASS
  • GitHub checks: check, assess, and semgrep pass

Notes

Validation uses synthetic fixtures only. No private prompts, project names, session IDs, or local user logs were used.

This PR does not claim exact Vibe cache token accounting. Current Vibe local logs do not include cache token fields, so exact cache read/write token reporting requires an upstream Vibe logging change.

@iamtoruk
Copy link
Copy Markdown
Member

Superseded by #355, which combines the turnId grouping from this PR (for Gemini/Vibe) with the toolSequence approach from #353 (for Kiro/Goose).

@iamtoruk iamtoruk closed this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate Mistral Vibe and Gemini one-shot rates

2 participants