Conversation
Appwrite ArenaProject ID: Tip Every Git commit and branch gets its own deployment URL automatically |
WalkthroughThis pull request updates the benchmark configuration and streaming implementation. The config file reorganizes the model entries array with new model definitions (including Grok-4.1, MiniMax, DeepSeek, Qwen, Kimi, GLM, GPT-5 variants, and Claude Opus entries) and adds new metadata fields such as country, provider website, and provider color specifications, while reordering entries from cheapest to most expensive. The runner file modifies streaming parameter handling by replacing maxCompletionTokens with max_completion_tokens and introduces try/catch error handling for streaming responses that captures partial content (reasoning, tool calls, and usage data) before gracefully returning incomplete results on stream errors. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.OpenGrep is compatible with Semgrep configurations. Add an |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmark/src/runner.ts (1)
124-153:⚠️ Potential issue | 🔴 CriticalHandle mid-stream error chunks before treating the response as complete.
OpenRouter sends streaming errors as normal SSE chunks with a top-level
errorobject andfinish_reason: "error", not as exceptions. The current code has no check for these error chunks in the for-await loop—they simply skip past (since they lack a delta field) and allow partial content to be returned as a normalApiResponse. This lets the benchmark score truncated output as valid answers. Check bothchunk.errorandfinish_reason === "error"inside the loop and either throw or mark the response as incomplete, then havecallModel()handle the error state instead of scoring it as a clean completion.Suggested direction
+ let incomplete = false; for await (const chunk of stream) { + if (chunk.error || chunk.choices?.[0]?.finish_reason === "error") { + if (debug) { + debugLog("STREAM ERROR", chunk.error); + } + if (content || reasoning || toolCallMap.size > 0) { + incomplete = true; + break; + } + throw new Error(chunk.error?.message ?? "Stream ended with error"); + } + const chunkStr = JSON.stringify(chunk, null, 2) ?? String(chunk); // ... }Then propagate
incompleteinApiResponseand havecallModel()/processQuestion()skip or retry instead of scoring it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmark/src/runner.ts` around lines 124 - 153, The stream loop processing in the for-await over stream fails to detect mid-stream error chunks because it only checks for delta and ignores top-level error or finish_reason === "error"; update the loop that inspects each chunk (the code using chunk, delta, and chunk.usage) to check if chunk.error exists or if chunk.choices?.[0]?.finish_reason === "error" and then either throw a descriptive error or set a flag indicating the response is incomplete; propagate that state into the ApiResponse (add/set an incomplete/error field) and ensure callModel() and processQuestion() respect that flag (skip scoring, retry or surface the error) instead of treating truncated content as a successful completion.
🧹 Nitpick comments (1)
benchmark/src/config.ts (1)
10-109: Extract provider-level metadata out ofMODELS.
providerWebsite, brand/chart colors, andcountryare now duplicated on every model row, whilegetProviderWebsite(),getModelColor(), andgetProviderBrandColor()still resolve byproviderand take the first match. A single inconsistent duplicate will silently change UI metadata based on array order. Consider moving provider metadata into a dedicated map and keepingMODELSmodel-specific.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmark/src/config.ts` around lines 10 - 109, The MODELS array currently duplicates provider-level metadata (providerWebsite, providerBrandColor, providerChartColor, country) causing order-dependent overrides; extract these fields into a new PROVIDERS map (keyed by provider or provider id) containing providerWebsite, providerBrandColor, providerChartColor, country and remove them from each entry in MODELS, then update lookup functions getProviderWebsite(), getModelColor(), and getProviderBrandColor() to read from the PROVIDERS map (with sensible fallbacks if a provider key is missing) and update any consumers of MODELS to use PROVIDERS for provider metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmark/src/runner.ts`:
- Around line 93-100: The streaming-accumulated reasoning string (variable
reasoning in callModelRaw) is not returned or threaded into follow-up assistant
messages, which drops important multi-turn context; update types.ts to add an
optional reasoning?: string to ChatMessage and add reasoning?: string to the
ApiResponse returned by callModelRaw(), modify callModelRaw() to return the
accumulated reasoning alongside content and toolCalls, and when constructing
follow-up assistant messages (the assistant message assembly around
build/follow-up rounds) append/assign that returned reasoning to the assistant
ChatMessage so subsequent tool-call rounds receive it.
---
Outside diff comments:
In `@benchmark/src/runner.ts`:
- Around line 124-153: The stream loop processing in the for-await over stream
fails to detect mid-stream error chunks because it only checks for delta and
ignores top-level error or finish_reason === "error"; update the loop that
inspects each chunk (the code using chunk, delta, and chunk.usage) to check if
chunk.error exists or if chunk.choices?.[0]?.finish_reason === "error" and then
either throw a descriptive error or set a flag indicating the response is
incomplete; propagate that state into the ApiResponse (add/set an
incomplete/error field) and ensure callModel() and processQuestion() respect
that flag (skip scoring, retry or surface the error) instead of treating
truncated content as a successful completion.
---
Nitpick comments:
In `@benchmark/src/config.ts`:
- Around line 10-109: The MODELS array currently duplicates provider-level
metadata (providerWebsite, providerBrandColor, providerChartColor, country)
causing order-dependent overrides; extract these fields into a new PROVIDERS map
(keyed by provider or provider id) containing providerWebsite,
providerBrandColor, providerChartColor, country and remove them from each entry
in MODELS, then update lookup functions getProviderWebsite(), getModelColor(),
and getProviderBrandColor() to read from the PROVIDERS map (with sensible
fallbacks if a provider key is missing) and update any consumers of MODELS to
use PROVIDERS for provider metadata.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2e562ec6-ff6f-41ee-b84c-59a46b794297
📒 Files selected for processing (4)
benchmark/src/config.tsbenchmark/src/runner.tssrc/data/results-with-skills.jsonsrc/data/results-without-skills.json
| let content = ""; | ||
| let reasoning = ""; | ||
| const toolCallMap = new Map<number, ToolCall>(); | ||
| let usage: UsageInfo | undefined; | ||
|
|
||
| let chunkIndex = 0; | ||
| let lastChunk: string = ""; | ||
| let secondToLastChunk: string = ""; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd runner.tsRepository: appwrite/arena
Length of output: 81
🏁 Script executed:
cat -n benchmark/src/runner.ts | head -120Repository: appwrite/arena
Length of output: 3785
🏁 Script executed:
cat -n benchmark/src/runner.ts | sed -n '85,110p'Repository: appwrite/arena
Length of output: 963
🏁 Script executed:
cat -n benchmark/src/runner.ts | sed -n '155,175p'Repository: appwrite/arena
Length of output: 703
🏁 Script executed:
cat -n benchmark/src/runner.ts | sed -n '205,230p'Repository: appwrite/arena
Length of output: 762
🏁 Script executed:
rg "callModelRaw" benchmark/src/runner.ts -A 5 -B 2Repository: appwrite/arena
Length of output: 824
🏁 Script executed:
rg "callModel\(" benchmark/src/runner.ts -A 3 -B 2Repository: appwrite/arena
Length of output: 361
🏁 Script executed:
wc -l benchmark/src/runner.tsRepository: appwrite/arena
Length of output: 85
🏁 Script executed:
cat -n benchmark/src/runner.ts | sed -n '240,330p'Repository: appwrite/arena
Length of output: 2987
🏁 Script executed:
cat -n benchmark/src/runner.ts | sed -n '330,420p'Repository: appwrite/arena
Length of output: 3344
🏁 Script executed:
rg "messages\." benchmark/src/runner.ts -B 2 -A 2 | head -80Repository: appwrite/arena
Length of output: 367
🏁 Script executed:
rg "message:" benchmark/src/runner.ts -B 1 -A 1Repository: appwrite/arena
Length of output: 177
🏁 Script executed:
rg "interface ApiResponse" benchmark/src/runner.ts -A 10Repository: appwrite/arena
Length of output: 221
🏁 Script executed:
rg "interface ChatMessage\|type ChatMessage" benchmark/src/ -A 5Repository: appwrite/arena
Length of output: 40
🏁 Script executed:
cat benchmark/src/types.ts | head -60Repository: appwrite/arena
Length of output: 1347
🏁 Script executed:
rg "ChatMessage" benchmark/src/types.ts -A 10Repository: appwrite/arena
Length of output: 319
🏁 Script executed:
cat benchmark/src/types.ts | grep -A 30 "ChatMessage"Repository: appwrite/arena
Length of output: 853
Preserve model reasoning across multi-turn tool call rounds.
The reasoning variable is accumulated from streaming responses (lines 162–167) but not included in the ApiResponse return value (lines 214–223), preventing it from being forwarded to follow-up assistant messages. Per OpenRouter and Gemini 3.1 Pro Preview documentation, multi-turn tool calls degrade when reasoning context is lost.
To fix this, extend the ChatMessage interface in types.ts to include an optional reasoning field, update the ApiResponse interface to return reasoning alongside content and toolCalls, return the accumulated reasoning from callModelRaw(), and append reasoning to assistant messages when building follow-up rounds (line 394–398).
Also applies to: 162–167, 214–223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/src/runner.ts` around lines 93 - 100, The streaming-accumulated
reasoning string (variable reasoning in callModelRaw) is not returned or
threaded into follow-up assistant messages, which drops important multi-turn
context; update types.ts to add an optional reasoning?: string to ChatMessage
and add reasoning?: string to the ApiResponse returned by callModelRaw(), modify
callModelRaw() to return the accumulated reasoning alongside content and
toolCalls, and when constructing follow-up assistant messages (the assistant
message assembly around build/follow-up rounds) append/assign that returned
reasoning to the assistant ChatMessage so subsequent tool-call rounds receive
it.

What does this PR do?
(Provide a description of what this PR does.)
Test Plan
(Write your test plan here. If you changed any code, please provide us with clear instructions on how you verified your changes work.)
Related PRs and Issues
(If this PR is related to any other PR or resolves any issue or related to any issue link all related PR and issues here.)
Have you read the Contributing Guidelines on issues?
(Write your answer here.)
Summary by CodeRabbit
New Features
Improvements