Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .optimize-cache.json
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,9 @@
"static/images/blog/april-product-update-mongodb-support-appwrite-190-realtime-upgrades-and-ai-tooling/realtime_subscriptions_final.png": "26f3c24f5184967256bdf6f85d0c56b50eb106b6d1aa106588dd4941a24d4857",
"static/images/blog/april-product-update-mongodb-support-appwrite-190-realtime-upgrades-and-ai-tooling/ttl_caching0.75x.png": "45102679260700d191c2c1827d248e6a6eb82840c9a9eddf4e359bdf03f7c6e8",
"static/images/blog/april-product-update-mongodb-support-appwrite-190-realtime-upgrades-and-ai-tooling/Twitter_LinkedIn_Facebook.png": "1ff4ea62e7e51e03f86fa7aeb84787cf56046fd3d68f3745095794a6809e12bb",
"static/images/blog/arena-june-2026-update/arena-leaderboard-without-skills.png": "f164ccde7cad0a8316104fea77d841b3b08d453b31489e00b383c1275b25e885",
"static/images/blog/arena-june-2026-update/arena-opus-4-8-detail.png": "4008cb53a904cdf919f0fe7bf8820f6c9b6f46892c6fdda3ea7157633eb89b85",
"static/images/blog/arena-june-2026-update/cover.png": "e6f5d1d1f405a7bf42499cec7a8044ef80ac7d5dc83ae81a3cdfaa5bd5913023",
Comment thread
greptile-apps[bot] marked this conversation as resolved.
"static/images/blog/avif-in-storage/cover.png": "23c26ec1a8f23f5bf6c55b19407d0738aa41cdc502dc3eef14a78f430a14447b",
"static/images/blog/avoid-backend-overengineering/cover.png": "c586c235dd6d3f992980748ec7b15cd3411edefe2e71dffc080840540f6d3ba3",
"static/images/blog/baa-explained/cover.png": "a7b144c7549498760cc2bfddda186b8182766ef72e308abc637dc4cbb5a2c853",
Expand Down
116 changes: 116 additions & 0 deletions src/routes/blog/post/arena-june-2026-update/+page.markdoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
layout: post
title: "Claude Opus 4.8 tops Appwrite Arena: the June 2026 leaderboard update"
description: "Claude Opus 4.8 takes #1 on Appwrite Arena's without-skills board at 97.4%, the first model to beat Claude Opus 4.7, in a June update that adds four new frontier models."
date: 2026-06-01
cover: /images/blog/arena-june-2026-update/cover.avif
timeToRead: 6
author: atharva
category: ai
featured: false
faqs:
- question: "Which AI model knows Appwrite best in June 2026?"
answer: "It depends on the mode. With Appwrite documentation in the prompt, GPT 5.5 leads at 97.7% overall. Without any documentation, relying on training knowledge alone, Claude Opus 4.8 leads at 97.4%, the first model to pass 97% in that mode and the first to beat Claude Opus 4.7."
- question: "What new models were added to Appwrite Arena in June 2026?"
answer: "Four: Claude Opus 4.8 from Anthropic, Grok Build 0.1 from xAI, Gemini 3.5 Flash from Google, and MiniMax M3 from MiniMax. That brings the board from 11 models to 15, with the benchmark itself unchanged from May."
- question: "Why does Claude Opus 4.8 score higher without skills than with skills?"
answer: "Claude Opus 4.8 scores 97.4% without skills and 97.1% with skills. The model already knows Appwrite well from training, so adding documentation to the prompt does not raise its accuracy, and the extra input tokens push the with-skills run from $1.56 to $6.86. It is the first model on the board you would run without skills for both score and cost."
Comment thread
atharvadeosthale marked this conversation as resolved.
- question: "What is the cheapest AI model that knows Appwrite well?"
answer: "MiniMax M3 offers the strongest cost-to-score ratio. It scores 95.7% with skills at $0.49 per run and 91.0% without skills at $0.09 per run. DeepSeek V4 Flash is similarly inexpensive at $0.37 with skills, scoring 96.1%."
---

[Appwrite Arena](https://arena.appwrite.io) is an open-source benchmark that measures how well AI models understand Appwrite. It scores each model on 191 questions spanning every Appwrite service, run twice: once with the relevant [Appwrite Skill](/docs/tooling/ai/skills) loaded into context, and once on the model's training knowledge alone. The gap between those two runs is what tells you how well a model already knows the platform. The June update adds four new frontier models, taking the board from **11** to **15**, and one of them, Claude Opus 4.8, takes first place on the without-skills leaderboard.

# Claude Opus 4.8 leads the without-skills leaderboard

On the without-skills board, where models answer from training knowledge alone with no Appwrite documentation in the prompt, [Claude Opus 4.8](/blog/post/anthropic-just-launched-claude-opus-48-with-fast-mode-and-dynamic-workflows) scores 97.4% overall and takes first place. It is the first model to clear **97%** in that mode, and the first to rank above Claude Opus 4.7.

| Mode | Rank | Overall | MCQ | Free-form | Cost | Correct |
| --- | --- | --- | --- | --- | --- | --- |
| With skills | 3 of 15 | 97.1% | 97.6% | 94.4% | $6.86 | 186 / 191 |
| Without skills | 1 of 15 | 97.4% | 98.2% | 92.1% | $1.56 | 187 / 191 |

For almost every model on the board, adding Appwrite documentation to the prompt raises the score, because the documentation closes a knowledge gap. Claude Opus 4.8 is the first model where that does not hold: it scores higher without skills (97.4%) than with them (97.1%). The model already knows Appwrite well enough from training that adding documentation to the prompt does not improve its accuracy.

The same pattern appears in cost. At **$5** per million input tokens, including the skills documentation in every prompt raises the with-skills run to $6.86, more than **four times** the $1.56 without-skills run. For Claude Opus 4.8, skills add cost and slightly lower the score, making it the first model on the board better run without them.

Comment thread
atharvadeosthale marked this conversation as resolved.
![Claude Opus 4.8 model detail page on Appwrite Arena showing 97.1 percent overall with the category breakdown](/images/blog/arena-june-2026-update/arena-opus-4-8-detail.avif)

# New models added in June 2026

Claude Opus 4.8 is not the only addition. Three other frontier models also joined since May, each with a different balance of speed and cost.

| Model | Provider | Overall (with skills) | Rank | Cost / run | Speed | Price (in / out per 1M) |
| --- | --- | --- | --- | --- | --- | --- |
| Claude Opus 4.8 | Anthropic | 97.1% | 3 of 15 | $6.86 | 40 tok/s | $5.00 / $25.00 |
| Grok Build 0.1 | xAI | 96.7% | 4 of 15 | $2.28 | 138 tok/s | $1.00 / $2.00 |
| Gemini 3.5 Flash | Google | 96.2% | 7 of 15 | $3.78 | 118 tok/s | $1.50 / $9.00 |
| MiniMax M3 | MiniMax | 95.7% | 10 of 15 | $0.49 | 25 tok/s | $0.30 / $1.20 |

## Grok Build 0.1

- Ranks fourth with skills at **96.7%**, running at **138 tok/s**, far above Kimi K2.6's **17 tok/s**.
- Its free-form score gains **7.5 points** with skills, from **83.7%** to **91.2%**.
- Priced at **$1.00 / $2.00** per million tokens, or **$2.28** per with-skills run.

## Gemini 3.5 Flash

- Ranks seventh with skills at **96.2%** and runs at **118 tok/s**.
- Depends most on documentation of the new models: overall falls from **96.2%** with skills to **90.7%** without, and free-form moves **14.4 points**, from **77.5%** to **91.9%**.
- At **$9.00** per million output tokens, a with-skills run costs **$3.78**, among the higher figures on the board.

## MiniMax M3

- Offers the strongest cost-to-score ratio: **$0.49** per with-skills run (**95.7%**) and **$0.09** without skills (**91.0%**).
- Its **95.2%** free-form is the highest of the four new models.
- A clear improvement over MiniMax M2.7: **93.2%** to **95.7%** with skills, and **85.2%** to **91.0%** without.
- Its **$0.30 / $1.20** per-million pricing reflects a **50% discount** on OpenRouter running until June 7, 2026, so the cost figures above will rise once it ends.

# Without-skills leaderboard rankings

Adding Claude Opus 4.8 reorders the top of the without-skills rankings, where the spread between models is widest.

![Appwrite Arena without-skills leaderboard with Claude Opus 4.8 in first place](/images/blog/arena-june-2026-update/arena-leaderboard-without-skills.avif)
Comment on lines +72 to +73
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 MiniMax M3 pricing discount expires six days after publish

The post notes the 50% OpenRouter discount ends June 7, 2026 — only six days after the publish date of June 1. Any reader arriving a week or more after publication will see cost figures ($0.49 with skills, $0.09 without) that no longer reflect actual pricing. Consider either updating the figures once the discount lapses or making the published cost clearly conditional (e.g., noting the post-discount price alongside the discounted one).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


The top of the without-skills board now reads:

| # | Model | Overall | MCQ | Free-form | Cost |
| --- | --- | --- | --- | --- | --- |
| 1 | **Claude Opus 4.8** | **97.4%** | **98.2%** | **92.1%** | **$1.56** |
| 2 | Claude Opus 4.7 | 96.2% | 96.4% | 94.8% | $1.89 |
| 3 | GPT 5.5 | 94.0% | 94.5% | 90.6% | $3.97 |
| 4 | Kimi K2.6 | 93.6% | 95.2% | 83.5% | $0.48 |
| 5 | Grok Build 0.1 | 91.5% | 92.7% | 83.7% | $0.47 |

Two Anthropic models now hold the top two positions without any documentation, with GPT 5.5 close behind. The free-form column shows the expected pattern: the models that drop the most without skills are those that rely on documentation to answer open-ended questions, and the gap between multiple-choice and free-form widens further down the table.

# With-skills leaderboard rankings

With Appwrite documentation in the prompt, the board compresses toward the top. **Ten** of the **fifteen** models score **95.7%** or higher, and the top six sit within **1.4 points** of each other.

| # | Model | Overall | MCQ | Free-form | Cost |
| --- | --------------- | ------- | ----- | --------- | ----- |
| 1 | GPT 5.5 | 97.7% | 98.2% | 94.8% | $4.51 |
| 2 | Claude Opus 4.7 | 97.1% | 97.6% | 94.2% | $3.07 |
| 3 | Claude Opus 4.8 | 97.1% | 97.6% | 94.4% | $6.86 |
| 4 | Grok Build 0.1 | 96.7% | 97.6% | 91.2% | $2.28 |
| 5 | Qwen 3.6 Plus | 96.5% | 97.6% | 89.8% | $0.58 |
| 6 | Kimi K2.6 | 96.3% | 97.0% | 91.9% | $1.64 |

- **GPT 5.5 holds first place at 97.7%**, the only model above **97.5%** with skills, on the strength of a board-leading **98.2%** on multiple-choice.
- **The two Anthropic models trade places from the without-skills board.** With skills, Claude Opus 4.7 ranks **#2** and Claude Opus 4.8 ranks **#3**, both at **97.1%** with identical multiple-choice scores (**97.6%**) and **186 of 191** correct. Without skills the order is reversed, with Opus 4.8 at **97.4%** ahead of Opus 4.7 at **96.2%**. Documentation lifts Opus 4.7 by **0.9 points** (**96.2%** to **97.1%**) but does not help Opus 4.8 (**97.4%** to **97.1%**), so the two converge once the docs are in the prompt.
- **The field stays tight below the top.** Grok Build 0.1 (**96.7%**), Qwen 3.6 Plus (**96.5%**), and Kimi K2.6 (**96.3%**) are separated by fractions of a point, so cost and speed, rather than accuracy, decide between them.

# Resources

The Arena UI lets you filter by category, switch between with and without skills, sort by any column, and click through to a per-model breakdown with per-question reasoning and tool call counts. The repo is open source, so you can re-run the benchmark locally against your own OpenRouter key.

- [Appwrite Arena leaderboard](https://arena.appwrite.io)
- [Claude Opus 4.8 on Arena](https://arena.appwrite.io/model/claude-opus-4-8)
- [Grok Build 0.1 on Arena](https://arena.appwrite.io/model/grok-build-0-1)
- [Gemini 3.5 Flash on Arena](https://arena.appwrite.io/model/gemini-3-5-flash)
- [MiniMax M3 on Arena](https://arena.appwrite.io/model/minimax-m3)
- [Arena on GitHub](https://github.com/appwrite/arena)
- [Arena documentation](/docs/tooling/arena)
- [Appwrite Skills](/docs/tooling/ai/skills)
- [Discord community](https://appwrite.io/discord)
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading