From 27167f9d5f2ee7db40f88042bec5fe00c7c71712 Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 16:20:22 +0530
Subject: [PATCH 1/7] blog(gemini-3-5-flash): add deep-dive benchmark and
 capability review

---
 .optimize-cache.json                          |   1 +
 .../gemini-3-5-flash-deep-dive/+page.markdoc  | 177 ++++++++++++++++++
 .../gemini-3-5-flash-deep-dive/cover.avif     | Bin 0 -> 6921 bytes
 3 files changed, 178 insertions(+)
 create mode 100644 src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
 create mode 100644 static/images/blog/gemini-3-5-flash-deep-dive/cover.avif

diff --git a/.optimize-cache.json b/.optimize-cache.json
index ecaeb30e0c..578e19c11f 100644
--- a/.optimize-cache.json
+++ b/.optimize-cache.json
@@ -653,6 +653,7 @@
   "static/images/blog/gdpr-mobile-apps-guide/1.png": "d3521c227ad9fa7fce40e66caa3e3f5fc982cf95086c590cc0a326031f6646d5",
   "static/images/blog/gdpr-mobile-apps-guide/cover.png": "11d53b8884d5ca45e7d9ba8fb904633795886d30cff6cc06a6af6b9fb7d1225f",
   "static/images/blog/gdpr.png": "e253390207e4d3e0ff28d3a4b94bee549aa6c8dc040bce604f5c6ff746dd9a1b",
+  "static/images/blog/gemini-3-5-flash-deep-dive/cover.png": "6b9257a7ba879bc37e7f81ca7d5c014e8760260dfd0d53d089016ed5c2f27f39",
   "static/images/blog/get-inspired-for-hackathon/1.png": "bdb21244945f4c483d23f84e5c429f548a45047a34d24d9c7f263cfca951ec3e",
   "static/images/blog/get-inspired-for-hackathon/2.png": "cec920ba9aa9996041e2b9134c52fdb09f91db307035334d458e47f6f116146c",
   "static/images/blog/get-inspired-for-hackathon/3.png": "87d6484adbe6049ab39bead992ffb57ab13bf1e1b3157b736f0bb5ad3ef1dde4",
diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
new file mode 100644
index 0000000000..1b1c4ff4a8
--- /dev/null
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -0,0 +1,177 @@
+---
+layout: post
+title: "Gemini 3.5 Flash: a detailed benchmark and capability review"
+description: "A detailed look at Gemini 3.5 Flash: what shipped at Google I/O 2026, pricing, Google's own benchmark table, Artificial Analysis numbers, and how it scores on the Appwrite Arena benchmark."
+date: 2026-05-20
+cover: /images/blog/gemini-3-5-flash-deep-dive/cover.avif
+timeToRead: 11
+author: atharva
+category: ai
+unlisted: true
+faqs:
+  - question: "What is Gemini 3.5 Flash?"
+    answer: "Gemini 3.5 Flash is Google DeepMind's mid-2026 Flash-tier reasoning model, built on the Gemini 3 Flash foundation with explicit thinking levels that trade quality for cost and latency. It accepts text, images, audio, video, and PDF input, outputs up to 64K text tokens, and has a 1M token context window."
+  - question: "Is Gemini 3.5 Flash better than Gemini 3.1 Pro?"
+    answer: "On the benchmarks Google publishes, 3.5 Flash beats 3.1 Pro on Terminal-Bench 2.1 (76.2% vs 70.3%), MCP Atlas (83.6% vs 78.2%), Finance Agent v2 (57.9% vs 43.0%), and GDPval-AA Elo (1656 vs 1314). It still trails 3.1 Pro on Humanity's Last Exam, ARC-AGI-2, and the 128K MRCR v2 long-context test, so it is not a clean replacement for the Pro tier."
+  - question: "How much does Gemini 3.5 Flash cost?"
+    answer: "API pricing is $1.50 per million input tokens and $9.00 per million output tokens, with a 90% discount on cached input ($0.15 per million tokens). It is free to use in the Gemini app and inside AI Mode in Google Search."
+  - question: "What is the context window for Gemini 3.5 Flash?"
+    answer: "1 million input tokens, with a 64K token output cap. The knowledge cutoff is January 2025."
+  - question: "Is Gemini 3.5 Flash multimodal?"
+    answer: "Yes. It accepts text, images, audio, video, and PDFs as input. Output is text only. Function calling, structured output, code execution, and search-as-a-tool are all supported."
+  - question: "Where can I use Gemini 3.5 Flash?"
+    answer: "Through the Gemini app, the Gemini API, Google AI Studio, Gemini Enterprise, the Gemini Enterprise Agent Platform, Google AI Mode in Search, Google Antigravity, and Android Studio."
+---
+
+Gemini 3.5 Flash shipped on May 19, 2026 at Google I/O. Google positions it as "Pro-level reasoning at Flash-class latency," with the claim that a mid-tier model can carry agentic and coding workloads previously handled by the Pro tier.
+
+This post evaluates that claim against three data sources: Google's published model card, [Artificial Analysis](https://artificialanalysis.ai/models/gemini-3-5-flash), and [Appwrite Arena](https://arena.appwrite.io), an open-source benchmark covering 191 questions across nine Appwrite service categories.
+
+# Model overview
+
+Gemini 3.5 Flash is built on the Gemini 3 Flash reasoning foundation with explicit thinking levels that control quality, cost, and latency. The variant on the [Artificial Analysis leaderboard](https://artificialanalysis.ai/models/gemini-3-5-flash) and in most of Google's published numbers is the "high" thinking configuration.
+
+Model specifications:
+
+- **Inputs.** Text, images, audio, video, and PDFs, up to a 1M token context window.
+- **Output.** Text only, with a 64K token output cap.
+- **Knowledge cutoff.** January 2025.
+- **Tooling.** Function calling, structured output, code execution, and search-as-a-tool are all first-party.
+- **Distribution.** Gemini app, Gemini API, Google AI Studio, Gemini Enterprise, the Gemini Enterprise Agent Platform, Google Search AI Mode, Google Antigravity, and Android Studio.
+- **Status.** Public preview at launch, free in the consumer Gemini app and Search AI Mode.
+
+# Pricing
+
+API pricing per million tokens:
+
+- **Input:** $1.50
+- **Output:** $9.00
+- **Cached input:** $0.15 (90% discount)
+
+How it compares:
+
+- **vs Gemini 3 Flash** ($0.50 / $3.00): 3x more on both input and output.
+- **vs Gemini 3.1 Pro** (blended): approximately 40% cheaper.
+- **Within the Flash tier:** the most expensive Flash-tier model Google has released.
+
+# Google's published benchmark table
+
+The model card lists head-to-head numbers against Gemini 3 Flash, Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.7, and GPT-5.5. The full table:
+
+| Category | Benchmark | Gemini 3.5 Flash | Gemini 3 Flash | Gemini 3.1 Pro | Claude Sonnet 4.6 | Claude Opus 4.7 | GPT-5.5 |
+| -------- | --------- | ---------------- | -------------- | -------------- | ----------------- | --------------- | ------- |
+| Coding | Terminal-bench 2.1 (Terminus-2 harness) | **76.2%** | 58.0% | 70.3% | n/a | 66.1% | 78.2% |
+| Coding | SWE-Bench Pro (Public, single attempt) | 55.1% | 49.6% | 54.2% | n/a | **64.3%** | 58.6% |
+| Agentic | MCP Atlas (multi-step MCP workflows) | **83.6%** | 62.0% | 78.2% | 69.5% | 79.1% | 75.3% |
+| Agentic | Toolathlon (real-world tool use) | **56.5%** | 49.4% | n/a | n/a | n/a | 55.6% |
+| UI Control | OSWorld-Verified | 78.4% | 65.1% | 76.2% | 72.5% | 78.0% | **78.7%** |
+| Expert tasks | Finance Agent v2 | **57.9%** | 42.6% | 43.0% | 51.0% | 51.5% | 51.8% |
+| Expert tasks | GDPval-AA (Elo) | 1656 | 1204 | 1314 | 1676 | 1753 | **1769** |
+| Multimodal | CharXiv Reasoning (no tools) | **84.2%** | 80.3% | 83.3% | 72.4% | 82.1% | 84.1% |
+| Multimodal | MMMU-Pro (no tools) | **83.6%** | 81.2% | 80.5% | 74.5% | 75.2% | 81.2% |
+| Multimodal | Blueprint-Bench 2 (normalized) | 33.6% | 0.0% | 26.5% | 6.7% | 24.5% | **36.2%** |
+| Long context | MRCR v2 (8-needle, 128k average) | 77.3% | 67.2% | 84.9% | 84.9% | 59.3% | **94.8%** |
+| Long context | MRCR v2 (1M, pointwise) | **26.6%** | 22.1% | 26.3% | n/a | n/a | n/a |
+| Reasoning | Humanity's Last Exam (full set) | 40.2% | 33.7% | 44.4% | 33.2% | **46.9%** | 41.4% |
+| Reasoning | ARC-AGI-2 | 72.1% | 33.6% | 77.1% | 58.3% | 75.8% | **84.6%** |
+
+Gemini 3.5 Flash leads Pro-class models on agentic tasks (MCP Atlas, Toolathlon, Finance Agent v2) and on multimodal reasoning (CharXiv, MMMU-Pro). It trails on academic reasoning (Humanity's Last Exam, ARC-AGI-2). For coding, results sit between 3.1 Pro and GPT-5.5 depending on the benchmark.
+
+The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 4.5 points over 3.1 Pro. On MCP tool-call workloads, 3.5 Flash is Google's strongest model in the Gemini 3 series.
+
+# Artificial Analysis
+
+[Artificial Analysis](https://artificialanalysis.ai/models/gemini-3-5-flash) runs an independent evaluation suite and ranks models by Intelligence Index, a composite of 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt.
+
+Gemini 3.5 Flash on Artificial Analysis:
+
+- **Intelligence Index: 55** (rank #7 of 147). Top three: GPT-5.5 (xhigh) at 60.2, GPT-5.5 (high) at 58.9, Claude Opus 4.7 (max) at 57.3.
+- **Speed: 278 output tokens per second** (rank #2 of 147). Faster: gpt-oss-120b (high) at 246. Slower frontier-class models: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 64, Claude Opus 4.7 (max) at 50.
+- **Verbosity: 73M tokens** generated across the Intelligence Index suite, against a leaderboard average of 36M. Verbosity counts how many output tokens the model produced to complete the eval suite. Higher means the model spent more reasoning tokens per answer, which raises latency and bill size even when the per-token price is low.
+- **Cost to evaluate the Intelligence Index: $1,551.60.** That is 5.5x Gemini 3 Flash and 75% more than Gemini 3.1 Pro despite the lower per-token rate. This is the total dollar cost to run the full Intelligence Index once, combining per-token pricing and token volume. It serves as a proxy for what the model costs on heavy reasoning workloads in production.
+- **Hallucination rate: 61%** on the AA hallucination measure, 31 points lower than Gemini 3 Flash. The hallucination measure is the share of responses on a fabrication-probing prompt set where the model produces incorrect or invented content. Lower is better, and a 31-point drop versus the predecessor indicates a material gain in factual reliability.
+
+On the intelligence-versus-speed axis, Artificial Analysis ranks Gemini 3.5 Flash as the Pareto leader. No model in the same intelligence bracket runs near 278 tokens per second.
+
+## Intelligence per token against SOTA peers
+
+Per-model summaries from Artificial Analysis:
+
+| Model | AA Intelligence Index | Output tokens (full Index) | Total eval cost | Speed (tok/s) | Input $/Mtok | Output $/Mtok |
+| ----- | --------------------: | -------------------------: | --------------: | ------------: | -----------: | ------------: |
+| GPT-5.5 (xhigh) | 60.2 | 75M | $3,357 | 65 | $5.00 | $30.00 |
+| Claude Opus 4.7 (max) | 57.3 | 110M | $5,117 | 50 | $6.25 | $25.00 |
+| Gemini 3.1 Pro Preview | 57.2 | 57M | $892 | 123 | $2.00 | $12.00 |
+| **Gemini 3.5 Flash (high)** | **55.3** | **73M** | **$1,552** | **278** | **$1.50** | **$9.00** |
+| Kimi K2.6 | 53.9 | 170M | $948 | 98 | $0.95 | $4.00 |
+
+Two points are worth calling out.
+
+**GPT-5.5 is more intelligent on a similar token budget.** GPT-5.5 (xhigh) generates 75M tokens for the full Intelligence Index against 3.5 Flash's 73M, a 3% difference. For roughly the same output token count, GPT-5.5 scores 60.2 versus 55.3. The reason GPT-5.5's eval cost lands at $3,357 against 3.5 Flash's $1,552 is per-token pricing ($5/$30 vs $1.50/$9), not token efficiency. On quality per token, GPT-5.5 leads.
+
+**Gemini 3.1 Pro is the sharper internal comparison.** 3.1 Pro Preview generates 57M tokens, 22% fewer than 3.5 Flash, and scores 57.2 on the Intelligence Index, 1.9 points higher. Total eval cost is $892, 42% lower than 3.5 Flash. The only axis where 3.5 Flash leads is speed: 278 tokens per second versus 3.1 Pro's 123. Google's "Pro-level reasoning at Flash-class latency" claim holds on latency. On the Intelligence Index itself, 3.5 Flash is the second-best Gemini and uses more tokens than 3.1 Pro to reach a lower score.
+
+# Appwrite Arena: backend SDK and API performance
+
+Public leaderboards measure general capability, not whether a model can drive an SDK without hallucinating method names. [Appwrite Arena](https://arena.appwrite.io) is an open-source benchmark covering 191 questions across nine Appwrite service categories: Foundation, Auth, Databases, Functions, Storage, Sites, Messaging, Realtime, and CLI. Each model is evaluated twice: once with the relevant [Appwrite Skill](/docs/tooling/ai/skills) loaded into context, and once without. Results are published on [GitHub](https://github.com/appwrite/arena).
+
+Top finishers on the May 20, 2026 run:
+
+**With Skills loaded (Skill files in context, 191 questions):**
+
+| Model | Overall | MCQ | Freeform | Cost (USD) | Duration |
+| ----- | ------: | --: | -------: | ---------: | -------: |
+| GPT 5.5 | 97.70 | 98.20 | 94.80 | $4.51 | 33m |
+| Claude Opus 4.7 | 97.10 | 97.60 | 94.20 | $3.07 | 53m |
+| Qwen 3.6 Plus | 96.50 | 97.60 | 89.80 | $0.58 | 54m |
+| Kimi K2.6 | 96.30 | 97.00 | 91.90 | $1.64 | 135m |
+| **Gemini 3.5 Flash** | **96.20** | **96.90** | **91.90** | **$3.78** | **20m** |
+| DeepSeek V4 Flash | 96.10 | 96.40 | 94.20 | $0.37 | 125m |
+| Gemini 3.1 Pro (Preview) | 92.70 | 93.30 | 88.80 | $4.44 | 45m |
+| Gemini 3.1 Flash Lite (Preview) | 88.30 | 89.70 | 79.40 | $0.59 | 19m |
+
+**Without Skills (model's built-in knowledge only):**
+
+| Model | Overall | MCQ | Freeform | Cost (USD) | Duration |
+| ----- | ------: | --: | -------: | ---------: | -------: |
+| Claude Opus 4.7 | 96.20 | 96.40 | 94.80 | $1.89 | 25m |
+| GPT 5.5 | 94.20 | 94.50 | 90.00 | $2.19 | 27m |
+| Kimi K2.6 | 93.60 | 95.20 | 83.50 | $0.48 | 103m |
+| Gemini 3.1 Pro (Preview) | 92.50 | 95.30 | 76.90 | $1.34 | 26m |
+| **Gemini 3.5 Flash** | **90.70** | **92.90** | **77.50** | **$1.14** | **13m** |
+| GLM 5.1 | 90.20 | 91.50 | 81.90 | $0.30 | 45m |
+
+Three observations from the Arena data.
+
+**It is the fastest model in the top tier.** 20 minutes with Skills and 13 minutes without is faster than every model scoring above 90. The next-fastest top finisher is Gemini 3.1 Flash Lite at 19 minutes with Skills, but it scores 7.9 points lower.
+
+**Skills materially improve the freeform score.** Without Skills, freeform scores 77.5%. With Skills, freeform reaches 91.9%, a 14.4-point increase. The same delta for GPT 5.5 is 4.8 points (94.8 to 90.0), and for Claude Opus 4.7 is 0.6 points (94.8 to 94.2). 3.5 Flash relies more on in-context documentation than its frontier peers, consistent with the January 2025 knowledge cutoff.
+
+**Category profile.** With Skills, 3.5 Flash scores 100% on Messaging, MCQ Foundation, MCQ Auth, MCQ Functions, and MCQ Sites, and 94.1% on Realtime. The weakest categories are TablesDB (89.1% with Skills, 77.8% without) and CLI (95.0% with Skills, 73.3% without). Both require the most current API surface, which the knowledge cutoff does not cover.
+
+# Workloads where 3.5 Flash is the right choice
+
+- **MCP-driven agents.** MCP Atlas at 83.6% is the highest result Google has published on the benchmark. For agents driving an MCP server such as [Appwrite's API MCP](/docs/tooling/ai/mcp-servers/api), 3.5 Flash is the most cost-efficient frontier option.
+- **Throughput-bound multimodal pipelines.** CharXiv at 84.2% and MMMU-Pro at 83.6% at 278 tokens per second is a combination no other top-ten Intelligence Index model provides. Document ingestion with charts, audio and video reasoning, and pipelines with many small multimodal calls benefit directly.
+- **Iterative coding agents on bounded scope.** Terminal-Bench 2.1 at 76.2%, a 1M context window, and the highest throughput in the top ten allow more iterations per wall-clock minute than any frontier alternative. The reasoning gap to Opus 4.7 and GPT-5.5 only becomes a constraint on research-grade tasks.
+
+# Model selection for Appwrite projects
+
+[Appwrite](https://cloud.appwrite.io) provides the primitives an agent needs to operate on a project: typed tables, scoped API keys, an [API MCP server](/docs/tooling/ai/mcp-servers/api), a [Docs MCP server](/docs/tooling/ai/mcp-servers/docs), and [Agent Skills](/docs/tooling/ai/skills) for every major SDK. The Arena results above show how each model performs against this surface.
+
+Speed is the column where Gemini 3.5 Flash dominates, but speed is not coding intelligence. On the Arena freeform scores and the SOTA Intelligence Index comparison above, GPT 5.5 and Claude Opus 4.7 lead 3.5 Flash by a meaningful margin on the same Appwrite coding tasks.
+
+Two recommended defaults:
+
+1. For interactive workloads where a developer waits on the response, **Gemini 3.5 Flash with the Appwrite Skill loaded** is the fastest top-tier option. Use it when iteration speed beats per-response correctness.
+2. For coding work where correctness matters more than wall-clock latency, **GPT 5.5 or Claude Opus 4.7** lead. Both produce higher quality code on the same Appwrite tasks, with or without Skills loaded.
+
+For other cases, optimize on the price-to-throughput frontier, where 3.5 Flash sits.
+
+# Next steps
+
+Select Gemini 3.5 Flash inside a tool that supports it: Cursor, Google AI Studio, Google Antigravity, or the Gemini API directly. To connect Appwrite to the model, follow the [Cursor plugin docs](/docs/tooling/ai/ai-dev-tools/cursor) for Cursor, or the [Antigravity MCP setup docs](/docs/tooling/ai/ai-dev-tools/antigravity) for Antigravity. Both walk through adding the Appwrite API MCP and Docs MCP servers so the model can act on your project.
+
+- [Appwrite Arena](https://arena.appwrite.io)
+- [Gemini 3.5 Flash model card](https://deepmind.google/models/model-cards/gemini-3-5-flash/)
+- [Artificial Analysis: Gemini 3.5 Flash](https://artificialanalysis.ai/models/gemini-3-5-flash)
diff --git a/static/images/blog/gemini-3-5-flash-deep-dive/cover.avif b/static/images/blog/gemini-3-5-flash-deep-dive/cover.avif
new file mode 100644
index 0000000000000000000000000000000000000000..d54a1b14f17cdcc541bcac93d92a72bc25e5da12
GIT binary patch
literal 6921
zcmZu$V{~5K){Si^jh)75j3>4l+cq0Ec7w*Y)u?eAJ9%Q;w)uSN``$b5pKpzE7Uo=Y
z?6daywa*3v10ytd^>j3Hw=xI&Fgq)A*1x%(mC;8H?r3M`YV;TXFfvP1TjzgMFfd0e
z6W9Od|BWb)R<3sc6yP78!^+md<ZmSg0D}Pgi-3VA{u{x-@Nqt{4<ji5F9K%pQF2>Z
z+5htz|FGB}h1oxCA8I3a7S_Lt|A8OOf5Bf?j?RuBWTcg&iNi-kGID1X4WRl*i)H2F
zX!chL1_u4{Kmb1+nxmD6)n5b>4i4@k{M!#r_!s(^;$IH@FT-u*<{|nQc5}ApwR1GH
z_{WLxnpl}Q^167sxSH9y@P71I*;*Mnd-59DI$Hj-Uqn`pcK;+F(vRsO10W#5AfSRF
zphG_Z0xM^;f5-e-?n7+*&jx)AiSKJ?YUGLmhJe6hZSj`T13`dH$`-;VDb4_fCWfX2
z7687!MkLs0K#|cA*wy(>U~!yfHK+4i70Iw-q|PvSEy@_gfM};D!O)D^(&QnjU&TU7
zNF2V_=FRZP>{?O&XNowtxOse;V`G?JM!N`)zNGo1bM>kpA*<`R%w3lB5V%HH`|4k*
zIYTvf*KV?lP-gBZxOlX+nar{MyRyBgXawpbE{1VoP3(!q1s*zve1PdVAX{wWZq9Uv
zCJuOM))s(3B0JLCak%;une<P%3Y2uzVLB8wU8NPks<T@wi)S<7q<wv;<lF@F7|(Kn
zdK?3PwNh!=exdiF0pe^IQ}H5$h?C}_pjP?DfE^&5viW*`(7tqHY5dE45GS!~>DW-{
zgt9T{GkKmavjNvWyvkABqc1m=*HAp&5KhB*UB`F!S6GYYRI4HlC<Kmu-1!P9;=w6-
z+#!6nYf-Q9eZ&~67hbe|?qMH**|GJY_5jGWov!szC6t&*t$Z6IW6db4wx{u&^_jI1
zHUg4Fiyq;AsNW7I+_++h07ZkKc&fp!D;w@V-ynPC2l3I>QZD9nSrJFd^TZ~Nz23wc
z`rsWVTGS{0Yjv5Gi9?7Mx-UNBggjV2E~7c8Ix#LPdNzUb1=3y2p?jew)xp$k9^!Bs
zch^xH2Bh-VZlondi!lXt(q)kMt^+;nK8!em++YJn+GcFKIP4xFqHcbAx{{a1bJL~Q
zKNZItYe-`#?@7C-3=IKd>6$k`w1S*=KFd!U6_EO}tFq-|Wb>L4Y`_39?L7|c`PjNh
z8sHrrKAYs-<ts%CIvHhmOuh=%u=>S1B6XE{o26DI-=EW|EaRy^iwZFbt2o8xH>Wbr
zoFf2XWSo?zx7PQV<HCj%zsoQ-?wRvVlHUkwG)E0j8UfM>2xoPYUJ+5!jr|txZKQWl
z5Hub&{DAgl2SC<A*|9jHR;in}LQ<yRm@CzLDlx&dr&>zKK-jW)Qk%gTf2L0kf)2dr
zBH$q{<lhzF)shRjxAz2$9u#aQMv4vcGi`$@pH!{Lkc?6kGk0POa`=9w7&knS-a@w2
zKL<2mnQWuUYG`kl<|@a?bn}Hj)K<5^C0w>cKv9CXVt;*uC+F(%-Xv+Pjkl&EKP1I<
z(jF=NZFMY76xQtY8U=IQ2RHGOR`&vxp}*)UI{j60%M$uwXrRU>iS|py3#y#UtBL;8
z;i9Z6eT4`$Lw$hXZ=PzjKjNF_L)YsHU0n=D;x;iZCNI@z%lTJKeqS=qi=+{g<c4w=
z`N7NAGxmAr`AB_2R7}GyEPW<g7+nAj>ca>J_wl}rh{&78jn~Nx%6T#b-nnA0WsJz?
z1P(FhFdu7>c_vsVKIk_3MHqRiulB37Aad&8;AsW>_3G>Bua`+r6K}J+f9<npM=3bP
z^C|Ir$XUVuE&th}uz8pBiJ}bj(uXshcuz7hjY6bMCm5Qv)4kM|EQcfVT_IWEXSw3M
z4!~n+mpGNZ%n-F*50QG8NP$qLCKyYGav_&kT(=&2Zoh82w|{yA!<-*u|Fpq87A%!x
zv?(wOW@#eAh87J$$fS6rXlL%oCIm7(4B41{H8mLpS2wB(qQq~aL`ym!>cEQ`g`?2L
z+~-QRl138|)v35=*27=@xFSpuC|EVtHFd7*93%>*D_UE|uH9A0mYUFE+k=Os#}I3b
z>T8C>9osyEwb7y+1UqP%VHZ%AH^L<c+T#7?fFGyOw8|l0<S8=pu&hIKB`aKVpDdIa
zczKy<w%xeeu*|{sb2vou<>2R7rPS%}=OnJp`iR~xsItZKL-pFWN5OKPB!|+8T35nO
zRzMw*-V}sXxz|ppo^Ow2U4~eIkP0b-*Sf3y%Pns}roXxI>tH$4q0kWS)Ohu`>@@Fe
zsG9(^IXFD(aSFvB_O7gYbfdUB-oqnmV+L;S?MB|)0bPMbyw^kJTmmqiF6p)-3`yv2
z3+8^e5^#+ys*WsfU;L8%2$*?(wcZNW=6caA-Dy}tN7nM?hg#m0%!QFPI36bwOK!@_
z@DC~|HDbSi*MZOjaoMmA*JY!qrMexPt=zWiMWoj!-{fGf8b@V_%ckkVx!BakdN@M$
z8w&V*(pF8!>PwIEUKXG=*~y3MOAB<8I_eb?o?DJ~N;Ww+Mh;pwRg|IhYWAra1*`1?
z((vy$$c8GX?$b7zR?FT6t#Mc#3-4#x{0gEsCxx~&6(wAOeLg)9_qru#rEeB~?ljfK
z4xvAk0a<Mox1;i{dhH82Ym1kO1w-kw7j+E;PB|Rq5GnKXU`#!lwJ9W4o){BWR#ilG
z#wz6K^CJ3%CZ_r-byy@o>>`<wvzZvJXd~GZ{n(*lu~~j9D^damO%gFJ2*W+Gz`3<1
z>aLGcv{X$$^>s)d`%G&5P#wgx(4N%{81V64GwU+zj7??fYIOaQzDFMN%un5T@q3Dk
z)OF}=zZt`_N~d#_<GV`=Q^RifpNe>sGLMDTO6(1qfccpcD+9HSb(ed9NQ2?yJsBKO
zQX85mQ_66<o|K^9`wI-}B=R*z<=im|_R`2NcvFLJ0Bkf55_B<C>xDa}!YBJ=^_X_#
zQjq4N=jvBU*xIM=v&+VKP7&j$>u*JecG+PxUCC?~+zgb>>&!vyVy!Z#H3-z9G(ROs
zzcBJj4DXFn>5okDWRjw5nP6;+6|=)HNqvgyi+rz=nxKisA(~5JNgg09-}|C5L<67%
zQcjCxX@-=eX!0S8#{FE4G0jE^3S?tm6;9Bd_Nl*f*jraF`jSmP;H~>$-5h4sGFggw
zl9e(shw#&$D<|<<?ND?q#-YzbKi=0zcHeT;HEBBAx2Ut7@!mg!8NNrKdhRlxS=3uP
z;bpcQf$V&REcP-1h(8e#-YmcXO;?!}*a+Rn=TuUBRBEICVtUq06O}7BlKcg5#}PV9
zVQXB&Spozg&E*s1e-ozhPEBAdI7S)QM&?k)Lg#~3UeR?!)Uttb(LuH~Ba6>_llp>2
zKQWP*<?6d&KurXQ_oT6_0WjL}D3t<{>WS6oveUORdPUYdHRFd`);2?1ex~6Dg=d$s
z%WR)%{&}qLMJMRwcNOzW-vOX1IRt@<n{Xbi#V}tHbME8OHI>p9Ldj+Ge(;W3Zl=8W
z*{)C+*<&&H9B^{7P&TyV3&y%p)rh?0#|>DRwnYR$m?B7kttQ|Lh}Ghk4kbQEAM7^I
zMy22Xh9-W<a+oV3?1%hpg*J`D%`QGLWrSCbclz^iOVOBbj{?TwHv3(%Bz$wG@O5{X
z@QDK)fiWfraF(BfN}p3x{T-Xp9V}bteHfpCDQ4Vxh}8RHg4_esjACFB{IXz4+uJ*B
z*L^kp6*k0-C>;^bnGupyAoc*<&DFiTkaHw4NS%J&h~dJVv>zjqBsYG0fY$r9Zhq10
zJ+{rT@?2|0x-0F1K8M@!@f0Y&L9YqyszVR#PEJ`z>IjqW@Xy3TKfuln!EFe1V1T-9
zwUjYBL-i9!N<i%l6YCXQ8>!EsQJv_%P!m8hZz=wLpl~?w*|B!^dO7$!yv0r`0CDz8
zw4m79V9>O4YLX*)b?e?J*)ulMJ%h(sCnxEv2E0G*Egg{W=fiO+*fcyjeN(r>MHqQU
zR@Qs^j^j!?icS|th&V}|$euVcV;N@U?ppsR^s@=YGULyq5go+dPqr`Api?bKX>^gb
zjbrD+_cHO-MLtYMph)Zen2JaUYfQ?fif6>wls(RnIsn54$zmJp^uR`&I;~P}fS8&+
z+%)N@F?Ft<hQL#9@S2<dqx`bUcJ}XN#Y2@*{DW?pV&x-I4q85V81F6oCW>;s09r&;
z$l4ksbFpO2I6=G~N~v1+OB?8YUB+<kk`%EsPj30HdLJ8kW7^R{y_p~D9Sf_Y8CMBO
z<Cp9`nVe!Y&WMi4?Apu}(h(X4O~!*c8b$e0+>d)VyVFX@FZ?k245*c8VtT4uaJ?si
zv2f@DtA5ZwPFqMhXQ$HmCwW7<DZe@e!!X`xzn`?Pn9aLhKM?Lonjyq4Fl6aOqm@vp
zY{V`9@n0^}2hoqK0?lWR1;7L7zxA`_LE@lNO<w)NWvi#b;+(?2)p^CNIlzCRWWr3<
zw-mjUOrq))5u)u_BRD!3;UdUSDPXBr7LT!m&ndaS(<E2lSr_sqK#lG>q;)7UP&WhV
zJV2=%DcS5_5m*X2o;38iqhrZ4S!)vVA`p>-VQpQ%?_f6DUQgEM&=XM|m5HwaaP{gd
z;yvR-w906^MoHWxV(zx;7+_}_Dh$#FN%(;VR7cz)uf#VH#d$b0TovcyIA@G(ceI~`
z7tRtp!lFTDkzsSsEG%?Od9SNB5J^7YkuvdV3~9ETv29Ma!|R$Qe<EiEz+)!vC%c!B
zVe7MfRiiZLGCr3&0oI_8@fs5^2w#FM%I8b0tmPE@xlTu6Dg($Z(IXUOx*>#I4{<e9
zCWD@4)K^^y?}KNqPu4T)rjf7Ju&PBE*0N&;2_T(3T(VWlIh?JPPE`kJLxa+BXHm19
z3S-is{)oS<?aA>AtmljeM#FE^ky=ZBRjMXqeC@?Iv95_LZKMy$sya8sYFLCLSX=DP
z3hBgndZm8-^zCpZq$F!XC*y%&j~>}r(%6-j5p(C0as#A&G`u<6neAx*O?^}P*eAh^
z2ZSq`v}}lxHYwr(ISXH3zvjJ9YXQud!cy)kWBMqn5GHZ_2@?ztyih8zin&k)-`T`h
z5h=4-h%e$yCT|73O}2wVnB*48_Mxam7p}lAs)S~e=*p?Z-6r^hM!<7T-2f3XC{Y$C
z`LzeXtrgf^jfVn`U*fDdT#dqTa%SC^{xlnLrttdNz>M^_rAK(zq?`}-rEPtU4TDrZ
zhdHWwBpr`U&L1Xe4-vO+OY~1Q*V6T=!L~G`^ydbD%(GuYA;Z5qacd`cD0&Ep53XK!
zKUuv$U~WCbxKQxe-e6#UD}Ex1_Odsu$~Uqkbi}54j)Zb@UXmbI9@SfQe_<V9v!#Qp
zXet>JB4}+w)s-XB;Y@_3$^!`-a9+=Vf*tP@RgjA!)E%nSN3D^UAtb0`kHwEIy@CYn
zzgF80!Lmf@)9E-9c|gQ8#F8Ftml+674YiiwxErrRu8pI0`W8+iT{NPltuGJSylXsE
z;TV3Wd95=Bkok#JZVs#E<}$5suqv`&Gc*ouH$zF%@v)%p2=Fyvw0a1;hsj|LO3=Ei
z6b?6wyeyn031@W;B4RFa+J*eGh)N;X>|uqn)N|oWoiwqBc{)R{T*6?Q1ALy5l9_zw
zm`EgrXY>@7;z;12^puc40^jCFh<Gx<uTa^%Rlc%V{Pw@&coSqUuOr@Q;J9_PuCVin
zI?l^VScHgfmK*C5ef3X-%PUYs19dRgOF?eg<wo}>?BNJle}TX!0FuyaeUgQ+A@aDb
zAX(K7HXV9A=V0^JjI1_R&j21Wia6}9#E&`iC*UsTFh=C6j%S@clQ-}je?sQU7wBgj
zuEl;{G(?jl+Q;s@CX%UK=+aC7Kr@D6$sLFW`i*;ix4^tTTF0@m02`<rawShWepBh5
zQfvQ6c0Bqg#t)xaugNw}fUw<t+zqw(Y=YnN;txlTQV&s|6V6E$F8>{YXQ7IIF>%?(
zBR?M=Qi^W`qFBQ*9k)wck3KGk{lXZ~I;Zmpvw}?>hgyv58B-u(yg>twUdqjnRdLtI
z;f0ZPEP$${<dugkbP;Q@1}XBXB;vJv_dQU*LK;y8@|u*Qsx!tWANzqin;1?y-Nzhr
z<{&C=w3yp+9?FPN;zrs4B=bf{4_umnvx9iY=srYY?s92N(W_PhK>hbeNV}3sf!P-b
z$8V9RYsb$)#kpIV5v!tO;@~_W;?fd{D8}b$*-YI_)q?cnTTZZzt5i^fMQ@{0F_Y5s
zcBU?|Za*Nq)bU4{(5{bMCqcn*N^VqMxMAX`I<8YmVQ}Af2@y_(rg6K7`z@GR9^s|;
zmhap?PcrjEzjo#VqZKO@!VxPj&0pHG)Dfbg&Bx{{t=d9?GX0Uz$7@V|Zz>&dYuAhf
zp-ho&y1@w(1;5BzKwE1I$(058M|S9-YY~|3_sl*PApC1PIljP*mai%&B9(+ugMax*
zb&@|#sM;gERy8`dp8Ec9t2-eY#>0V}9p+mwz7#pCzr;hG4~CG!wf-6h;Ee!OCM@K9
zR_yXdjLq(G6HeB2biq!fQ$*9K?sp1m%M*^<YLKF#G9aM<3W_o@mx6>zaGxWGv2F*!
zjMpq3+j)H^&Mtc{@JgqGB5$~LJ-W<CFdLW7D4O~M@k++upDWSs<ii(nO8$JK2%{A(
z>3N!6Zp8IpzAlwlI=K>?rJCD<MV_Gq;!djfX(cGX8aviv(#Aj&C$3U2*%E*irC@UD
zUhMBDI~t3yp?jtYa=wc|Lt?QiHu?8tItFaze!{@?+>Nzt`G(AFzsahH&10I|t#(Uz
z#rn3T*)i>*y881NIln3GC766ouHaGkD-Sf!9FeNKtwNl2+vWDN>}q(oMvLlio*6#@
z8~F+3u}Z^m)1P<OnNq@fjH%<@0Hk6W(U4tQN$gsl@oRm5;>q&hR>=l?Yw;dDU=e~M
z2n6cWDHo#HHa2mZm&jT?8DJ}{!hvsKNZJ<yfpE@0OHFls3c-)Hm+Mh`W1w1}Op_6?
z&e&d2z<*bxdJt%kH|#z4<Mk`1`y6i5^GW+p5iFWd$^juu0SrX}8-x90O5I)Y#HQhm
z^ut0`)t!@x^wn>}uf^gzw(prEPP=?b`tZeSddiY5&(W)M2NW|r@^TVvWmmQlM^}PT
zU+Zg%>G$D_$_K)>tiGB<f5)3(^4Wtk*J2delJcr9ci0vfO2Kq?RIk;Z?73xO)Q#|B
zN=Cxx8D>NB1*9*NnoR>lk-Wx9o>dj<>U+RFx3d`x-vY!eyS$zyLWgmNpa%u$4OEiM
zV?$|Oa;`deEha=W`F<qG^59B+F59YoW>&D#Nu+FC0zAqp>_O^Hpse-5YxSFNC2r>X
z%)lkFM~_P9!jl#)%5t#`Tx{qf1$&Y_I{w)ZO$BCjVANAFcgEZuCo-p6a9L4O_Zzy6
z2AT}@k*b3AMGt<KMMClDb+)TDG)R)z8_5u-vi8cDWk6rf6BGPK!#jc|8`hW%v!^fA
zauk3{J8RZkosQtlUZB8x1|nS6|5dhIAa&Mg?OSvhwvpg+w)d7=CcQQ3`+7D)T|BhM
zu-VQ09PGDTbn))d4c5Zd@XLda`*Hg$>A~|Um|0CL=L;YM;yDwiI0I7@8YF&<w_U2#
zG6pfG<P#9Rv+PjDb<cc!YTyIKrPygi+pZaY_3SP_KTiN+UkUwVOT)X>W;i5oE}s~e
zaLUs*##CJjp-mtRF6?dIlvuocg(Xnxz?jkcFi4s(n3S1xc&wO<PfyR7MY?3=y%KjD
zZjS(gO~>Jnld;@w7b35g{tZ5xldmw|ZZX?o#KU=VcMJt1SO=1ZU$8NY99jSY;Q|{<
zMu$T)iWEo>X;Xchr<@O;T1AVyn;tRyC99$QQwXa;knCphqe1UaqxlC12x%MZ7Kv>f
zZwrw1wc&OzyGp+8mS_i?t}cbgj~}rRZ;Evq1HVsoBMVXaxID(63*oPb+~!#i{rQ<W
zFpHYx)33tu9+v7-9=2PLa+{t7R{S+0_al-H^y41qX6`;C+ex)#wEe=%$xgNHFd5&d
z^{Ci&>{j`XArRwxml3vGL^ChS)IUpR3kscva5;-rcXDHV<4o3JCO{GYO|*;NrPpw#
z`29+(u|>9~LYI+Bo9y9yi({vK&Sie5T0i_`p*L+rv2UFtRysoHOD|<V0DzwiQ5v(%
zF7~Vpx7=b8xR(!mC?x|^=b+OjzMP|0fv~x|DF+qq{!n6&*$ZS3K(A~5?Ja(sSGU3#
z;Y*v7vxJAT1SRXI5VjT!;lQB$6_&7>jA>o@TN*e);@B@-6BqWBFKQY{MZUo2{PcN{
znD|s7nHZHZ4R$YmBhc*Dhh#ZrM+gJr3IV1%5xYkibCK@~<e^C<>gbhhby58LNvvvt
zZ9*QObwyAMH}X9%R%~9=WsB!zHd-G#^t=1Ir74~|adtrDbUF}&gO2lg!ndzN+O6Xo
z#ly|hpX;{yKdBh(w!ixkVBkw#qdndx2(DzAI}INz>@Z%8IyD2S#_A_%TWs=4f*SV&
zwP5Vf-2yTPy%U7qQ8+Z^=eYFkScEim6BVI;^M%<g3QojuEuZF5ryK-Y7d_hW92ORh
zh4)$vm<~c44AgDC0~&K;X?0LTlhf=-d{8tcUy|g|N%qc_60ga%R=)DwT`n>3>aF|n
zaGb|;tTy|TT{>OIyI|XYe$XZXqN;|_s0f=FIViQi^{X(_a+@6i#CjJ5^IEg4<&d?j
zl`{%Ko)w1t#Is<l(U!4_j2-5*nj(@|`PFWj0czh%Ik+uZ*S<uw1@ASz9kvtIRUWYu
zyb->b`QD_~LD}qZ7mRH4TBJnAv6D{IT>nXYbdz2TX8Dy86`BtI>q4rp>*H1g7XP1H
nkzBT~L|`>wU`k+c7(AF$H;B_${{LSF^Dn#k|8f7r{?qh-aE|rt

literal 0
HcmV?d00001


From 615c8bef14b9015681145b75b56af5eda732cd9c Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 16:27:02 +0530
Subject: [PATCH 2/7] blog(gemini-3-5-flash): fix contradictory speed
 comparison

Gemini 3.5 Flash is the fastest frontier-class peer at 278 tok/s; gpt-oss-120b (high) at 246 is the next closest, not faster.
---
 src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
index 1b1c4ff4a8..b3180599b1 100644
--- a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -86,7 +86,7 @@ The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 4.5
 Gemini 3.5 Flash on Artificial Analysis:
 
 - **Intelligence Index: 55** (rank #7 of 147). Top three: GPT-5.5 (xhigh) at 60.2, GPT-5.5 (high) at 58.9, Claude Opus 4.7 (max) at 57.3.
-- **Speed: 278 output tokens per second** (rank #2 of 147). Faster: gpt-oss-120b (high) at 246. Slower frontier-class models: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 64, Claude Opus 4.7 (max) at 50.
+- **Speed: 278 output tokens per second** (rank #2 of 147 in its AA price class). The closest frontier peer is gpt-oss-120b (high) at 246. Other frontier-class models are well behind: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 64, Claude Opus 4.7 (max) at 50.
 - **Verbosity: 73M tokens** generated across the Intelligence Index suite, against a leaderboard average of 36M. Verbosity counts how many output tokens the model produced to complete the eval suite. Higher means the model spent more reasoning tokens per answer, which raises latency and bill size even when the per-token price is low.
 - **Cost to evaluate the Intelligence Index: $1,551.60.** That is 5.5x Gemini 3 Flash and 75% more than Gemini 3.1 Pro despite the lower per-token rate. This is the total dollar cost to run the full Intelligence Index once, combining per-token pricing and token volume. It serves as a proxy for what the model costs on heavy reasoning workloads in production.
 - **Hallucination rate: 61%** on the AA hallucination measure, 31 points lower than Gemini 3 Flash. The hallucination measure is the share of responses on a fabrication-probing prompt set where the model produces incorrect or invented content. Lower is better, and a 31-point drop versus the predecessor indicates a material gain in factual reliability.

From 2528c533fc1196d2592a28928ccd8a2a3689b324 Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 16:38:09 +0530
Subject: [PATCH 3/7] blog(gemini-3-5-flash): fix Skills delta direction and
 sign

GPT 5.5 parenthetical reversed (90.0 to 94.8 = +4.8). Claude Opus 4.7 delta is -0.6, not +0.6: Skills reduced its freeform score from 94.8 to 94.2.
---
 src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
index b3180599b1..5be0524f5d 100644
--- a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -145,7 +145,7 @@ Three observations from the Arena data.
 
 **It is the fastest model in the top tier.** 20 minutes with Skills and 13 minutes without is faster than every model scoring above 90. The next-fastest top finisher is Gemini 3.1 Flash Lite at 19 minutes with Skills, but it scores 7.9 points lower.
 
-**Skills materially improve the freeform score.** Without Skills, freeform scores 77.5%. With Skills, freeform reaches 91.9%, a 14.4-point increase. The same delta for GPT 5.5 is 4.8 points (94.8 to 90.0), and for Claude Opus 4.7 is 0.6 points (94.8 to 94.2). 3.5 Flash relies more on in-context documentation than its frontier peers, consistent with the January 2025 knowledge cutoff.
+**Skills materially improve the freeform score.** Without Skills, freeform scores 77.5%. With Skills, freeform reaches 91.9%, a 14.4-point increase. The same delta for GPT 5.5 is +4.8 points (90.0 to 94.8), and for Claude Opus 4.7 is −0.6 points (94.8 to 94.2), where Skills slightly lowered the score because the model's built-in Appwrite knowledge is already near the ceiling. 3.5 Flash relies more on in-context documentation than its frontier peers, consistent with the January 2025 knowledge cutoff.
 
 **Category profile.** With Skills, 3.5 Flash scores 100% on Messaging, MCQ Foundation, MCQ Auth, MCQ Functions, and MCQ Sites, and 94.1% on Realtime. The weakest categories are TablesDB (89.1% with Skills, 77.8% without) and CLI (95.0% with Skills, 73.3% without). Both require the most current API surface, which the knowledge cutoff does not cover.
 

From 78cd3a72b9038bcec9a0fab45e2212d84643b2a6 Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 16:45:26 +0530
Subject: [PATCH 4/7] blog(gemini-3-5-flash): fix arithmetic and consistency
 errors

- MCP Atlas margin over 3.1 Pro: 5.4 points, not 4.5 (83.6 - 78.2).
- GPT-5.5 (xhigh) speed: 65 tok/s, matching the SOTA table and AA summary.
- Realtime score qualified as MCQ Realtime (94.1%); overall is 94.0.
- Reframed Flash Lite reference: it scores 88.3, below the 90-point top tier.
---
 .../blog/post/gemini-3-5-flash-deep-dive/+page.markdoc    | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
index 5be0524f5d..e5dd34b66c 100644
--- a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -77,7 +77,7 @@ The model card lists head-to-head numbers against Gemini 3 Flash, Gemini 3.1 Pro
 
 Gemini 3.5 Flash leads Pro-class models on agentic tasks (MCP Atlas, Toolathlon, Finance Agent v2) and on multimodal reasoning (CharXiv, MMMU-Pro). It trails on academic reasoning (Humanity's Last Exam, ARC-AGI-2). For coding, results sit between 3.1 Pro and GPT-5.5 depending on the benchmark.
 
-The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 4.5 points over 3.1 Pro. On MCP tool-call workloads, 3.5 Flash is Google's strongest model in the Gemini 3 series.
+The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 5.4 points over 3.1 Pro. On MCP tool-call workloads, 3.5 Flash is Google's strongest model in the Gemini 3 series.
 
 # Artificial Analysis
 
@@ -86,7 +86,7 @@ The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 4.5
 Gemini 3.5 Flash on Artificial Analysis:
 
 - **Intelligence Index: 55** (rank #7 of 147). Top three: GPT-5.5 (xhigh) at 60.2, GPT-5.5 (high) at 58.9, Claude Opus 4.7 (max) at 57.3.
-- **Speed: 278 output tokens per second** (rank #2 of 147 in its AA price class). The closest frontier peer is gpt-oss-120b (high) at 246. Other frontier-class models are well behind: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 64, Claude Opus 4.7 (max) at 50.
+- **Speed: 278 output tokens per second** (rank #2 of 147 in its AA price class). The closest frontier peer is gpt-oss-120b (high) at 246. Other frontier-class models are well behind: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 65, Claude Opus 4.7 (max) at 50.
 - **Verbosity: 73M tokens** generated across the Intelligence Index suite, against a leaderboard average of 36M. Verbosity counts how many output tokens the model produced to complete the eval suite. Higher means the model spent more reasoning tokens per answer, which raises latency and bill size even when the per-token price is low.
 - **Cost to evaluate the Intelligence Index: $1,551.60.** That is 5.5x Gemini 3 Flash and 75% more than Gemini 3.1 Pro despite the lower per-token rate. This is the total dollar cost to run the full Intelligence Index once, combining per-token pricing and token volume. It serves as a proxy for what the model costs on heavy reasoning workloads in production.
 - **Hallucination rate: 61%** on the AA hallucination measure, 31 points lower than Gemini 3 Flash. The hallucination measure is the share of responses on a fabrication-probing prompt set where the model produces incorrect or invented content. Lower is better, and a 31-point drop versus the predecessor indicates a material gain in factual reliability.
@@ -143,11 +143,11 @@ Top finishers on the May 20, 2026 run:
 
 Three observations from the Arena data.
 
-**It is the fastest model in the top tier.** 20 minutes with Skills and 13 minutes without is faster than every model scoring above 90. The next-fastest top finisher is Gemini 3.1 Flash Lite at 19 minutes with Skills, but it scores 7.9 points lower.
+**It is the fastest model in the top tier.** 20 minutes with Skills and 13 minutes without is faster than every other model scoring above 90. The only model in the with-Skills table with a shorter run is Gemini 3.1 Flash Lite at 19 minutes, but it scores 88.3, below the 90-point top tier.
 
 **Skills materially improve the freeform score.** Without Skills, freeform scores 77.5%. With Skills, freeform reaches 91.9%, a 14.4-point increase. The same delta for GPT 5.5 is +4.8 points (90.0 to 94.8), and for Claude Opus 4.7 is −0.6 points (94.8 to 94.2), where Skills slightly lowered the score because the model's built-in Appwrite knowledge is already near the ceiling. 3.5 Flash relies more on in-context documentation than its frontier peers, consistent with the January 2025 knowledge cutoff.
 
-**Category profile.** With Skills, 3.5 Flash scores 100% on Messaging, MCQ Foundation, MCQ Auth, MCQ Functions, and MCQ Sites, and 94.1% on Realtime. The weakest categories are TablesDB (89.1% with Skills, 77.8% without) and CLI (95.0% with Skills, 73.3% without). Both require the most current API surface, which the knowledge cutoff does not cover.
+**Category profile.** With Skills, 3.5 Flash scores 100% on Messaging, MCQ Foundation, MCQ Auth, MCQ Functions, and MCQ Sites, and 94.1% on MCQ Realtime. The weakest categories are TablesDB (89.1% with Skills, 77.8% without) and CLI (95.0% with Skills, 73.3% without). Both require the most current API surface, which the knowledge cutoff does not cover.
 
 # Workloads where 3.5 Flash is the right choice
 

From c237f18204975f135ddeb99f37f04e7c029f1eae Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 16:51:05 +0530
Subject: [PATCH 5/7] blog(gemini-3-5-flash): unify precision for Intelligence
 Index and eval cost

Intelligence Index bullet uses 55.3 to match the SOTA table.
Eval cost bullet uses $1,552 to match the table and downstream prose.
---
 src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
index e5dd34b66c..286a6f3c76 100644
--- a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -85,10 +85,10 @@ The largest gain is MCP Atlas: a 21.6 point increase over Gemini 3 Flash and 5.4
 
 Gemini 3.5 Flash on Artificial Analysis:
 
-- **Intelligence Index: 55** (rank #7 of 147). Top three: GPT-5.5 (xhigh) at 60.2, GPT-5.5 (high) at 58.9, Claude Opus 4.7 (max) at 57.3.
+- **Intelligence Index: 55.3** (rank #7 of 147). Top three: GPT-5.5 (xhigh) at 60.2, GPT-5.5 (high) at 58.9, Claude Opus 4.7 (max) at 57.3.
 - **Speed: 278 output tokens per second** (rank #2 of 147 in its AA price class). The closest frontier peer is gpt-oss-120b (high) at 246. Other frontier-class models are well behind: Gemini 3.1 Pro Preview at 123, GPT-5.5 (xhigh) at 65, Claude Opus 4.7 (max) at 50.
 - **Verbosity: 73M tokens** generated across the Intelligence Index suite, against a leaderboard average of 36M. Verbosity counts how many output tokens the model produced to complete the eval suite. Higher means the model spent more reasoning tokens per answer, which raises latency and bill size even when the per-token price is low.
-- **Cost to evaluate the Intelligence Index: $1,551.60.** That is 5.5x Gemini 3 Flash and 75% more than Gemini 3.1 Pro despite the lower per-token rate. This is the total dollar cost to run the full Intelligence Index once, combining per-token pricing and token volume. It serves as a proxy for what the model costs on heavy reasoning workloads in production.
+- **Cost to evaluate the Intelligence Index: $1,552.** That is 5.5x Gemini 3 Flash and 75% more than Gemini 3.1 Pro despite the lower per-token rate. This is the total dollar cost to run the full Intelligence Index once, combining per-token pricing and token volume. It serves as a proxy for what the model costs on heavy reasoning workloads in production.
 - **Hallucination rate: 61%** on the AA hallucination measure, 31 points lower than Gemini 3 Flash. The hallucination measure is the share of responses on a fabrication-probing prompt set where the model produces incorrect or invented content. Lower is better, and a 31-point drop versus the predecessor indicates a material gain in factual reliability.
 
 On the intelligence-versus-speed axis, Artificial Analysis ranks Gemini 3.5 Flash as the Pareto leader. No model in the same intelligence bracket runs near 278 tokens per second.

From 0875798f5e6ae62dfc4144b9e9508cdf21ea29ba Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 17:03:27 +0530
Subject: [PATCH 6/7] blog(gemini-3-5-flash): remove unlisted flag

Frontmatter unlisted: true was copied from a style-reference post without authorization. Removing so the post appears on the blog index.
---
 src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
index 286a6f3c76..71e2d60dbe 100644
--- a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -7,7 +7,6 @@ cover: /images/blog/gemini-3-5-flash-deep-dive/cover.avif
 timeToRead: 11
 author: atharva
 category: ai
-unlisted: true
 faqs:
   - question: "What is Gemini 3.5 Flash?"
     answer: "Gemini 3.5 Flash is Google DeepMind's mid-2026 Flash-tier reasoning model, built on the Gemini 3 Flash foundation with explicit thinking levels that trade quality for cost and latency. It accepts text, images, audio, video, and PDF input, outputs up to 64K text tokens, and has a 1M token context window."

From b6f195f8cadb482ee60698bdcf3944c0648365ca Mon Sep 17 00:00:00 2001
From: Atharva Deosthale <atharva.deosthale17@gmail.com>
Date: Wed, 20 May 2026 17:08:40 +0530
Subject: [PATCH 7/7] blog(gemini-3-5-flash): correct vs 3.1 Pro pricing
 comparison

3.5 Flash is 25% cheaper per token than 3.1 Pro on both input and output ($1.50/$2.00 and $9.00/$12.00 = 0.75), not 40%.
---
 src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
index 71e2d60dbe..1ffb49595b 100644
--- a/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
+++ b/src/routes/blog/post/gemini-3-5-flash-deep-dive/+page.markdoc
@@ -50,7 +50,7 @@ API pricing per million tokens:
 How it compares:
 
 - **vs Gemini 3 Flash** ($0.50 / $3.00): 3x more on both input and output.
-- **vs Gemini 3.1 Pro** (blended): approximately 40% cheaper.
+- **vs Gemini 3.1 Pro** ($2.00 / $12.00): 25% cheaper per token on both input and output.
 - **Within the Flash tier:** the most expensive Flash-tier model Google has released.
 
 # Google's published benchmark table