Chore: rerun benchs by Meldiron · Pull Request #25 · appwrite/arena

Meldiron · 2026-03-13T14:46:32Z

What does this PR do?

(Provide a description of what this PR does.)

Test Plan

(Write your test plan here. If you changed any code, please provide us with clear instructions on how you verified your changes work.)

Related PRs and Issues

(If this PR is related to any other PR or resolves any issue or related to any issue link all related PR and issues here.)

Have you read the Contributing Guidelines on issues?

(Write your answer here.)

Summary by CodeRabbit

New Features
- Added support for additional AI models to the benchmark suite
- Increased token limit for model completions to 16,384 tokens
Improvements
- Organized benchmark models by pricing (cheapest to most expensive)
- Enhanced streaming error handling to preserve partial results when stream failures occur

appwrite · 2026-03-13T14:46:38Z

Appwrite Arena

Project ID: appwrite-arena

Sites (1)

Site	Status	Logs	Preview	QR
Arena `arena`	Ready	View Logs	Preview URL

Tip

Every Git commit and branch gets its own deployment URL automatically

coderabbitai · 2026-03-13T14:46:50Z

Walkthrough

This pull request updates the benchmark configuration and streaming implementation. The config file reorganizes the model entries array with new model definitions (including Grok-4.1, MiniMax, DeepSeek, Qwen, Kimi, GLM, GPT-5 variants, and Claude Opus entries) and adds new metadata fields such as country, provider website, and provider color specifications, while reordering entries from cheapest to most expensive. The runner file modifies streaming parameter handling by replacing maxCompletionTokens with max_completion_tokens and introduces try/catch error handling for streaming responses that captures partial content (reasoning, tool calls, and usage data) before gracefully returning incomplete results on stream errors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Chore: rerun benchs' is vague and does not clearly communicate the actual changes. The PR involves significant modifications to model configurations and error handling in benchmark code, not just re-running benchmarks.	Consider revising the title to be more specific about the main changes, such as 'Update benchmark models configuration and improve streaming error handling' or similar, to better reflect the substantive code modifications.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore-rerun-benchs

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Meldiron

lgtm

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmark/src/runner.ts (1)
124-153: ⚠️ Potential issue | 🔴 Critical

Handle mid-stream error chunks before treating the response as complete.

OpenRouter sends streaming errors as normal SSE chunks with a top-level error object and finish_reason: "error", not as exceptions. The current code has no check for these error chunks in the for-await loop—they simply skip past (since they lack a delta field) and allow partial content to be returned as a normal ApiResponse. This lets the benchmark score truncated output as valid answers. Check both chunk.error and finish_reason === "error" inside the loop and either throw or mark the response as incomplete, then have callModel() handle the error state instead of scoring it as a clean completion.
Suggested direction
+ let incomplete = false;
  for await (const chunk of stream) {
+   if (chunk.error || chunk.choices?.[0]?.finish_reason === "error") {
+     if (debug) {
+       debugLog("STREAM ERROR", chunk.error);
+     }
+     if (content || reasoning || toolCallMap.size > 0) {
+       incomplete = true;
+       break;
+     }
+     throw new Error(chunk.error?.message ?? "Stream ended with error");
+   }
+
    const chunkStr = JSON.stringify(chunk, null, 2) ?? String(chunk);
    // ...
  }
Then propagate incomplete in ApiResponse and have callModel() / processQuestion() skip or retry instead of scoring it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/runner.ts` around lines 124 - 153, The stream loop processing
in the for-await over stream fails to detect mid-stream error chunks because it
only checks for delta and ignores top-level error or finish_reason === "error";
update the loop that inspects each chunk (the code using chunk, delta, and
chunk.usage) to check if chunk.error exists or if
chunk.choices?.[0]?.finish_reason === "error" and then either throw a
descriptive error or set a flag indicating the response is incomplete; propagate
that state into the ApiResponse (add/set an incomplete/error field) and ensure
callModel() and processQuestion() respect that flag (skip scoring, retry or
surface the error) instead of treating truncated content as a successful
completion.

🧹 Nitpick comments (1)

benchmark/src/config.ts (1)
10-109: Extract provider-level metadata out of MODELS.

providerWebsite, brand/chart colors, and country are now duplicated on every model row, while getProviderWebsite(), getModelColor(), and getProviderBrandColor() still resolve by provider and take the first match. A single inconsistent duplicate will silently change UI metadata based on array order. Consider moving provider metadata into a dedicated map and keeping MODELS model-specific.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/config.ts` around lines 10 - 109, The MODELS array currently
duplicates provider-level metadata (providerWebsite, providerBrandColor,
providerChartColor, country) causing order-dependent overrides; extract these
fields into a new PROVIDERS map (keyed by provider or provider id) containing
providerWebsite, providerBrandColor, providerChartColor, country and remove them
from each entry in MODELS, then update lookup functions getProviderWebsite(),
getModelColor(), and getProviderBrandColor() to read from the PROVIDERS map
(with sensible fallbacks if a provider key is missing) and update any consumers
of MODELS to use PROVIDERS for provider metadata.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/src/runner.ts`:
- Around line 93-100: The streaming-accumulated reasoning string (variable
reasoning in callModelRaw) is not returned or threaded into follow-up assistant
messages, which drops important multi-turn context; update types.ts to add an
optional reasoning?: string to ChatMessage and add reasoning?: string to the
ApiResponse returned by callModelRaw(), modify callModelRaw() to return the
accumulated reasoning alongside content and toolCalls, and when constructing
follow-up assistant messages (the assistant message assembly around
build/follow-up rounds) append/assign that returned reasoning to the assistant
ChatMessage so subsequent tool-call rounds receive it.

---

Outside diff comments:
In `@benchmark/src/runner.ts`:
- Around line 124-153: The stream loop processing in the for-await over stream
fails to detect mid-stream error chunks because it only checks for delta and
ignores top-level error or finish_reason === "error"; update the loop that
inspects each chunk (the code using chunk, delta, and chunk.usage) to check if
chunk.error exists or if chunk.choices?.[0]?.finish_reason === "error" and then
either throw a descriptive error or set a flag indicating the response is
incomplete; propagate that state into the ApiResponse (add/set an
incomplete/error field) and ensure callModel() and processQuestion() respect
that flag (skip scoring, retry or surface the error) instead of treating
truncated content as a successful completion.

---

Nitpick comments:
In `@benchmark/src/config.ts`:
- Around line 10-109: The MODELS array currently duplicates provider-level
metadata (providerWebsite, providerBrandColor, providerChartColor, country)
causing order-dependent overrides; extract these fields into a new PROVIDERS map
(keyed by provider or provider id) containing providerWebsite,
providerBrandColor, providerChartColor, country and remove them from each entry
in MODELS, then update lookup functions getProviderWebsite(), getModelColor(),
and getProviderBrandColor() to read from the PROVIDERS map (with sensible
fallbacks if a provider key is missing) and update any consumers of MODELS to
use PROVIDERS for provider metadata.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e562ec6-ff6f-41ee-b84c-59a46b794297

📥 Commits

Reviewing files that changed from the base of the PR and between b423d9a and 0666128.

📒 Files selected for processing (4)

benchmark/src/config.ts
benchmark/src/runner.ts
src/data/results-with-skills.json
src/data/results-without-skills.json

coderabbitai · 2026-03-13T14:55:59Z

benchmark/src/runner.ts

+	let content = "";
+	let reasoning = "";
+	const toolCallMap = new Map<number, ToolCall>();
+	let usage: UsageInfo | undefined;
+
+	let chunkIndex = 0;
+	let lastChunk: string = "";
+	let secondToLastChunk: string = "";


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd runner.ts

Repository: appwrite/arena

Length of output: 81

🏁 Script executed:

cat -n benchmark/src/runner.ts | head -120

Repository: appwrite/arena

Length of output: 3785

🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '85,110p'

Repository: appwrite/arena

Length of output: 963

🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '155,175p'

Repository: appwrite/arena

Length of output: 703

🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '205,230p'

Repository: appwrite/arena

Length of output: 762

🏁 Script executed:

rg "callModelRaw" benchmark/src/runner.ts -A 5 -B 2

Repository: appwrite/arena

Length of output: 824

🏁 Script executed:

rg "callModel\(" benchmark/src/runner.ts -A 3 -B 2

Repository: appwrite/arena

Length of output: 361

🏁 Script executed:

wc -l benchmark/src/runner.ts

Repository: appwrite/arena

Length of output: 85

🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '240,330p'

Repository: appwrite/arena

Length of output: 2987

🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '330,420p'

Repository: appwrite/arena

Length of output: 3344

🏁 Script executed:

rg "messages\." benchmark/src/runner.ts -B 2 -A 2 | head -80

Repository: appwrite/arena

Length of output: 367

🏁 Script executed:

rg "message:" benchmark/src/runner.ts -B 1 -A 1

Repository: appwrite/arena

Length of output: 177

🏁 Script executed:

rg "interface ApiResponse" benchmark/src/runner.ts -A 10

Repository: appwrite/arena

Length of output: 221

🏁 Script executed:

rg "interface ChatMessage\|type ChatMessage" benchmark/src/ -A 5

Repository: appwrite/arena

Length of output: 40

🏁 Script executed:

cat benchmark/src/types.ts | head -60

Repository: appwrite/arena

Length of output: 1347

🏁 Script executed:

rg "ChatMessage" benchmark/src/types.ts -A 10

Repository: appwrite/arena

Length of output: 319

🏁 Script executed:

cat benchmark/src/types.ts | grep -A 30 "ChatMessage"

Repository: appwrite/arena

Length of output: 853

Preserve model reasoning across multi-turn tool call rounds.

The reasoning variable is accumulated from streaming responses (lines 162–167) but not included in the ApiResponse return value (lines 214–223), preventing it from being forwarded to follow-up assistant messages. Per OpenRouter and Gemini 3.1 Pro Preview documentation, multi-turn tool calls degrade when reasoning context is lost.

To fix this, extend the ChatMessage interface in types.ts to include an optional reasoning field, update the ApiResponse interface to return reasoning alongside content and toolCalls, return the accumulated reasoning from callModelRaw(), and append reasoning to assistant messages when building follow-up rounds (line 394–398).

Also applies to: 162–167, 214–223

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmark/src/runner.ts` around lines 93 - 100, The streaming-accumulated reasoning string (variable reasoning in callModelRaw) is not returned or threaded into follow-up assistant messages, which drops important multi-turn context; update types.ts to add an optional reasoning?: string to ChatMessage and add reasoning?: string to the ApiResponse returned by callModelRaw(), modify callModelRaw() to return the accumulated reasoning alongside content and toolCalls, and when constructing follow-up assistant messages (the assistant message assembly around build/follow-up rounds) append/assign that returned reasoning to the assistant ChatMessage so subsequent tool-call rounds receive it.

Meldiron added 6 commits March 12, 2026 16:05

Wipe benchs

b665f2f

Wipe benchs

b622823

Cheapen beginning of bench

240be2a

Better resolve partial response

6584a53

Increase tokens limit

81d13b6

Re-run all benchmarks

0666128

Meldiron commented Mar 13, 2026

View reviewed changes

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

Meldiron merged commit 154f69e into main Mar 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chore: rerun benchs#25

Chore: rerun benchs#25
Meldiron merged 6 commits intomainfrom
chore-rerun-benchs

Meldiron commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

appwrite bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Meldiron left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Meldiron commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Related PRs and Issues

Have you read the Contributing Guidelines on issues?

Summary by CodeRabbit

Uh oh!

appwrite bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Appwrite Arena

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Meldiron left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Meldiron commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

appwrite bot commented Mar 13, 2026 •

edited

Loading

coderabbitai bot commented Mar 13, 2026 •

edited

Loading