Skip to content

Chore: rerun benchs#25

Merged
Meldiron merged 6 commits intomainfrom
chore-rerun-benchs
Mar 13, 2026
Merged

Chore: rerun benchs#25
Meldiron merged 6 commits intomainfrom
chore-rerun-benchs

Conversation

@Meldiron
Copy link
Contributor

@Meldiron Meldiron commented Mar 13, 2026

What does this PR do?

(Provide a description of what this PR does.)

Test Plan

(Write your test plan here. If you changed any code, please provide us with clear instructions on how you verified your changes work.)

Related PRs and Issues

(If this PR is related to any other PR or resolves any issue or related to any issue link all related PR and issues here.)

Have you read the Contributing Guidelines on issues?

(Write your answer here.)

Summary by CodeRabbit

  • New Features

    • Added support for additional AI models to the benchmark suite
    • Increased token limit for model completions to 16,384 tokens
  • Improvements

    • Organized benchmark models by pricing (cheapest to most expensive)
    • Enhanced streaming error handling to preserve partial results when stream failures occur

@appwrite
Copy link

appwrite bot commented Mar 13, 2026

Appwrite Arena

Project ID: appwrite-arena

Sites (1)
Site Status Logs Preview QR
 Arena
arena
Ready Ready View Logs Preview URL QR Code

Tip

Every Git commit and branch gets its own deployment URL automatically

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Walkthrough

This pull request updates the benchmark configuration and streaming implementation. The config file reorganizes the model entries array with new model definitions (including Grok-4.1, MiniMax, DeepSeek, Qwen, Kimi, GLM, GPT-5 variants, and Claude Opus entries) and adds new metadata fields such as country, provider website, and provider color specifications, while reordering entries from cheapest to most expensive. The runner file modifies streaming parameter handling by replacing maxCompletionTokens with max_completion_tokens and introduces try/catch error handling for streaming responses that captures partial content (reasoning, tool calls, and usage data) before gracefully returning incomplete results on stream errors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Chore: rerun benchs' is vague and does not clearly communicate the actual changes. The PR involves significant modifications to model configurations and error handling in benchmark code, not just re-running benchmarks. Consider revising the title to be more specific about the main changes, such as 'Update benchmark models configuration and improve streaming error handling' or similar, to better reflect the substantive code modifications.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore-rerun-benchs
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Copy link
Contributor Author

@Meldiron Meldiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmark/src/runner.ts (1)

124-153: ⚠️ Potential issue | 🔴 Critical

Handle mid-stream error chunks before treating the response as complete.

OpenRouter sends streaming errors as normal SSE chunks with a top-level error object and finish_reason: "error", not as exceptions. The current code has no check for these error chunks in the for-await loop—they simply skip past (since they lack a delta field) and allow partial content to be returned as a normal ApiResponse. This lets the benchmark score truncated output as valid answers. Check both chunk.error and finish_reason === "error" inside the loop and either throw or mark the response as incomplete, then have callModel() handle the error state instead of scoring it as a clean completion.

Suggested direction
+ let incomplete = false;
  for await (const chunk of stream) {
+   if (chunk.error || chunk.choices?.[0]?.finish_reason === "error") {
+     if (debug) {
+       debugLog("STREAM ERROR", chunk.error);
+     }
+     if (content || reasoning || toolCallMap.size > 0) {
+       incomplete = true;
+       break;
+     }
+     throw new Error(chunk.error?.message ?? "Stream ended with error");
+   }
+
    const chunkStr = JSON.stringify(chunk, null, 2) ?? String(chunk);
    // ...
  }

Then propagate incomplete in ApiResponse and have callModel() / processQuestion() skip or retry instead of scoring it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/runner.ts` around lines 124 - 153, The stream loop processing
in the for-await over stream fails to detect mid-stream error chunks because it
only checks for delta and ignores top-level error or finish_reason === "error";
update the loop that inspects each chunk (the code using chunk, delta, and
chunk.usage) to check if chunk.error exists or if
chunk.choices?.[0]?.finish_reason === "error" and then either throw a
descriptive error or set a flag indicating the response is incomplete; propagate
that state into the ApiResponse (add/set an incomplete/error field) and ensure
callModel() and processQuestion() respect that flag (skip scoring, retry or
surface the error) instead of treating truncated content as a successful
completion.
🧹 Nitpick comments (1)
benchmark/src/config.ts (1)

10-109: Extract provider-level metadata out of MODELS.

providerWebsite, brand/chart colors, and country are now duplicated on every model row, while getProviderWebsite(), getModelColor(), and getProviderBrandColor() still resolve by provider and take the first match. A single inconsistent duplicate will silently change UI metadata based on array order. Consider moving provider metadata into a dedicated map and keeping MODELS model-specific.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/config.ts` around lines 10 - 109, The MODELS array currently
duplicates provider-level metadata (providerWebsite, providerBrandColor,
providerChartColor, country) causing order-dependent overrides; extract these
fields into a new PROVIDERS map (keyed by provider or provider id) containing
providerWebsite, providerBrandColor, providerChartColor, country and remove them
from each entry in MODELS, then update lookup functions getProviderWebsite(),
getModelColor(), and getProviderBrandColor() to read from the PROVIDERS map
(with sensible fallbacks if a provider key is missing) and update any consumers
of MODELS to use PROVIDERS for provider metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/src/runner.ts`:
- Around line 93-100: The streaming-accumulated reasoning string (variable
reasoning in callModelRaw) is not returned or threaded into follow-up assistant
messages, which drops important multi-turn context; update types.ts to add an
optional reasoning?: string to ChatMessage and add reasoning?: string to the
ApiResponse returned by callModelRaw(), modify callModelRaw() to return the
accumulated reasoning alongside content and toolCalls, and when constructing
follow-up assistant messages (the assistant message assembly around
build/follow-up rounds) append/assign that returned reasoning to the assistant
ChatMessage so subsequent tool-call rounds receive it.

---

Outside diff comments:
In `@benchmark/src/runner.ts`:
- Around line 124-153: The stream loop processing in the for-await over stream
fails to detect mid-stream error chunks because it only checks for delta and
ignores top-level error or finish_reason === "error"; update the loop that
inspects each chunk (the code using chunk, delta, and chunk.usage) to check if
chunk.error exists or if chunk.choices?.[0]?.finish_reason === "error" and then
either throw a descriptive error or set a flag indicating the response is
incomplete; propagate that state into the ApiResponse (add/set an
incomplete/error field) and ensure callModel() and processQuestion() respect
that flag (skip scoring, retry or surface the error) instead of treating
truncated content as a successful completion.

---

Nitpick comments:
In `@benchmark/src/config.ts`:
- Around line 10-109: The MODELS array currently duplicates provider-level
metadata (providerWebsite, providerBrandColor, providerChartColor, country)
causing order-dependent overrides; extract these fields into a new PROVIDERS map
(keyed by provider or provider id) containing providerWebsite,
providerBrandColor, providerChartColor, country and remove them from each entry
in MODELS, then update lookup functions getProviderWebsite(), getModelColor(),
and getProviderBrandColor() to read from the PROVIDERS map (with sensible
fallbacks if a provider key is missing) and update any consumers of MODELS to
use PROVIDERS for provider metadata.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e562ec6-ff6f-41ee-b84c-59a46b794297

📥 Commits

Reviewing files that changed from the base of the PR and between b423d9a and 0666128.

📒 Files selected for processing (4)
  • benchmark/src/config.ts
  • benchmark/src/runner.ts
  • src/data/results-with-skills.json
  • src/data/results-without-skills.json

Comment on lines +93 to +100
let content = "";
let reasoning = "";
const toolCallMap = new Map<number, ToolCall>();
let usage: UsageInfo | undefined;

let chunkIndex = 0;
let lastChunk: string = "";
let secondToLastChunk: string = "";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd runner.ts

Repository: appwrite/arena

Length of output: 81


🏁 Script executed:

cat -n benchmark/src/runner.ts | head -120

Repository: appwrite/arena

Length of output: 3785


🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '85,110p'

Repository: appwrite/arena

Length of output: 963


🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '155,175p'

Repository: appwrite/arena

Length of output: 703


🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '205,230p'

Repository: appwrite/arena

Length of output: 762


🏁 Script executed:

rg "callModelRaw" benchmark/src/runner.ts -A 5 -B 2

Repository: appwrite/arena

Length of output: 824


🏁 Script executed:

rg "callModel\(" benchmark/src/runner.ts -A 3 -B 2

Repository: appwrite/arena

Length of output: 361


🏁 Script executed:

wc -l benchmark/src/runner.ts

Repository: appwrite/arena

Length of output: 85


🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '240,330p'

Repository: appwrite/arena

Length of output: 2987


🏁 Script executed:

cat -n benchmark/src/runner.ts | sed -n '330,420p'

Repository: appwrite/arena

Length of output: 3344


🏁 Script executed:

rg "messages\." benchmark/src/runner.ts -B 2 -A 2 | head -80

Repository: appwrite/arena

Length of output: 367


🏁 Script executed:

rg "message:" benchmark/src/runner.ts -B 1 -A 1

Repository: appwrite/arena

Length of output: 177


🏁 Script executed:

rg "interface ApiResponse" benchmark/src/runner.ts -A 10

Repository: appwrite/arena

Length of output: 221


🏁 Script executed:

rg "interface ChatMessage\|type ChatMessage" benchmark/src/ -A 5

Repository: appwrite/arena

Length of output: 40


🏁 Script executed:

cat benchmark/src/types.ts | head -60

Repository: appwrite/arena

Length of output: 1347


🏁 Script executed:

rg "ChatMessage" benchmark/src/types.ts -A 10

Repository: appwrite/arena

Length of output: 319


🏁 Script executed:

cat benchmark/src/types.ts | grep -A 30 "ChatMessage"

Repository: appwrite/arena

Length of output: 853


Preserve model reasoning across multi-turn tool call rounds.

The reasoning variable is accumulated from streaming responses (lines 162–167) but not included in the ApiResponse return value (lines 214–223), preventing it from being forwarded to follow-up assistant messages. Per OpenRouter and Gemini 3.1 Pro Preview documentation, multi-turn tool calls degrade when reasoning context is lost.

To fix this, extend the ChatMessage interface in types.ts to include an optional reasoning field, update the ApiResponse interface to return reasoning alongside content and toolCalls, return the accumulated reasoning from callModelRaw(), and append reasoning to assistant messages when building follow-up rounds (line 394–398).

Also applies to: 162–167, 214–223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/runner.ts` around lines 93 - 100, The streaming-accumulated
reasoning string (variable reasoning in callModelRaw) is not returned or
threaded into follow-up assistant messages, which drops important multi-turn
context; update types.ts to add an optional reasoning?: string to ChatMessage
and add reasoning?: string to the ApiResponse returned by callModelRaw(), modify
callModelRaw() to return the accumulated reasoning alongside content and
toolCalls, and when constructing follow-up assistant messages (the assistant
message assembly around build/follow-up rounds) append/assign that returned
reasoning to the assistant ChatMessage so subsequent tool-call rounds receive
it.

@Meldiron Meldiron merged commit 154f69e into main Mar 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant