# AI Models Benchmark 2026: A Data-Driven Analysis of Performance, Price, and Tradeoffs

---

## Executive Summary

**What if you could identify the best AI model for your use case in minutes, not days?**

This notebook analyzes 187 AI models across 37 providers to uncover pricing patterns, performance correlations, and market dynamics that can inform your model selection strategy.

### Key Insights

- **Intelligence correlates moderately with price (ρ=0.590), but value champions exist** - You don't always need to pay premium prices for high intelligence. Budget-friendly models like DeepSeek V3.2 (IQ=41, $0.32) and GLM-4.7 (IQ=42, $0.94) offer competitive intelligence at a fraction of the cost.

- **Only 8 models (4.4%) are Pareto-efficient for price-performance** - These models represent the "efficient frontier" where no other model offers higher intelligence at a lower price. The rest are either dominated by more capable models or overpriced for their performance.

- **Market has split into Budget (24 providers) vs Premium (12 providers) segments** - K-means clustering reveals two distinct market segments: Budget providers ($0.35, IQ=17.9) and Premium providers ($1.53, IQ=29.0) with 4x price difference and 62% intelligence gap.

- **All 10 pairwise correlations significant - no null findings in dataset** - Every relationship between variables (Intelligence-Price, Speed-Latency, etc.) is statistically significant after FDR correction, suggesting a highly interconnected market with consistent pricing patterns.

- **Regional differences: Europe fastest (142 token/s), US most expensive ($1.53)** - European models prioritize speed (142.3 tokens/s) while US commands premium pricing ($1.53). Chinese models offer middle-ground pricing ($0.93) with competitive intelligence (22.2).

### Methodology

Analysis of 187 AI models across 37 providers using non-parametric statistics (Spearman correlation, Mann-Whitney U tests, bootstrap CIs). All visualizations are pre-generated for fast loading. Analysis code is modularized in `src/` for reproducibility.

### Navigation

Jump to: [Data Quality](#1-data-quality-assessment) | [Correlations](#2-correlation-analysis) | [Pareto Frontiers](#3-pareto-frontier-analysis) | [Provider Analysis](#4-provider-clustering) | [Conclusions](#5-conclusions)

## Setup

This notebook imports pre-computed visualizations and analysis from our modular pipeline. No duplicate code - everything comes from `src/` modules and pre-generated reports.

**Notebook structure:** All analysis logic lives in [src/](src/) modules for reproducibility. Visualizations are pre-generated for fast loading - see [reports/figures/](reports/figures/).

In [None]:
# Core dependencies
import polars as pl
import numpy as np
import sys
sys.path.append('.')

# Import analysis modules (script-as-module pattern)
from src.load import load_data
from src.analyze import analyze_distribution
from src.statistics import compute_correlation_matrix, compute_spearman_correlation
from src.pareto import get_pareto_efficient_models
from src.clustering import aggregate_by_provider
from src.bootstrap import bootstrap_mean_ci

# Visualization helpers
from IPython.display import IFrame, HTML, Markdown

# Load data (use Polars directly for parquet files)
df = pl.read_parquet("data/processed/ai_models_deduped.parquet")

print(f"✓ Loaded {df.height} models")
print(f"✓ {df['Creator'].n_unique()} providers represented")
print(f"✓ Analysis modules ready")

## 1. Data Quality Assessment

Before diving into insights, let's establish trust in our data foundation. A robust analysis requires a clean, validated dataset.

### Quality Score Overview

**Overall Quality Score: 75%** (3/4 dimensions passed)

Our data quality assessment evaluated 6 dimensions: Accuracy, Completeness, Consistency, Validity, Integrity, and Timeliness. The dataset passed 3 of 4 applicable dimensions.

### Key Findings

| Metric | Value | Status |
|--------|-------|--------|
| **Models after deduplication** | 187 | 34 duplicate names resolved |
| **Data completeness** | 96.81% | Only 6 null values in intelligence_index |
| **Providers represented** | 37 | Across US, China, Europe, and other regions |
| **Distribution characteristics** | Right-skewed | Validated non-parametric approach |
| **Models flagged as outliers** | 10 (5.32%) | Preserved for analysis, not removed |

### What This Means

Our analysis is based on a clean, validated dataset with minimal missing values. The 6 models lacking intelligence scores represent models without IQ assessments (not data entry errors), so intelligence-specific analyses filter to 181 models. The 34 duplicate model names were resolved using context window disambiguation, ensuring accurate group-by operations.

**Data quality note:** 10 models (5.32%) were flagged as outliers via Isolation Forest detection but preserved for analysis. These may represent legitimate high-performance models rather than data errors.

See [full quality assessment](reports/quality_2026-01-18.md) for complete analysis including distribution statistics and outlier details.

### Methodology Note

**Why we use Spearman correlation (non-parametric):** All numerical variables are right-skewed (skewness > 0), violating normality assumptions required for parametric tests like Pearson correlation. Spearman rank correlation is robust to non-normal distributions and outliers, making it the appropriate choice for this dataset.

**Data distribution characteristics:**
- Intelligence Index: Approximately normal (skewness=0.67)
- Price, Speed, Latency: Moderately to highly right-skewed (1.73 - 7.11)
- Context Window: Extremely right-skewed (skewness=9.63, kurtosis=114.20)

This non-normality justifies our non-parametric approach throughout the analysis.

## 2. Correlation Analysis: What's Related to What?

Smart models cost more - but how much more? And what else drives price?

### Key Finding: Intelligence and Price

**Intelligence and Price show moderate correlation (ρ=0.590)** - but this means there's room for value plays. Not all smart models are expensive, and not all expensive models are smart.

| Variables | Correlation | Interpretation |
|-----------|-------------|----------------|
| Intelligence ↔ Price | ρ=0.590 | Moderate positive |
| Intelligence ↔ Context Window | ρ=0.542 | Moderate positive |
| Intelligence ↔ Speed | ρ=0.261 | Weak positive |
| Intelligence ↔ Latency | ρ=0.444 | Moderate positive |

**So what?** Smarter models tend to cost more and have larger context windows, but speed isn't strongly tied to intelligence. This means you **CAN** find fast models that are also smart.

### Methodology

We used **Spearman correlation** (non-parametric) because our data is right-skewed. All 10 pairwise correlations were statistically significant after FDR correction (False Discovery Rate).

**Null findings:** None - every correlation we tested was significant, suggesting this market has consistent patterns.

See [full correlation analysis](reports/correlation_analysis_2026-01-18.md) for complete statistical details.

In [None]:
# Load pre-computed correlation matrix
corr_matrix = compute_spearman_correlation(df, 'intelligence_index', ['price_usd', 'Speed(median token/s)', 'Latency (First Answer Chunk /s)', 'context_window'])

print("Correlation matrix loaded from analysis pipeline")
print(f"\nIntelligence-Price correlation: {corr_matrix['intelligence_index']['price_usd']:.3f}")
print(f"Intelligence-Speed correlation: {corr_matrix['intelligence_index']['Speed(median token/s)']:.3f}")
print(f"Intelligence-Context Window correlation: {corr_matrix['intelligence_index']['context_window']:.3f}")

### Correlation Heatmap

See how variables cluster together:

**What the heatmap shows:** Red = positive correlation, Blue = negative (we have none)

**Hierarchical clustering groups:** {Intelligence, Price, Context Window} form one cluster, {Speed, Latency} form another.

**Practical implication:** If you need high intelligence, expect higher prices and larger context windows. But speed is separable - you can find fast OR slow smart models.

In [None]:
# Display pre-generated correlation heatmap
IFrame(src='reports/figures/interactive_correlation_heatmap.html', width=800, height=600)

---

## 3. Pareto Frontier Analysis: The Efficient Frontier

Which models offer the best tradeoffs? Pareto analysis identifies "undominated" choices - models where no other option is better in ALL dimensions.

### What is Pareto Efficiency?

A model is **Pareto-efficient** if improving one objective (e.g., intelligence) would require sacrificing another (e.g., price). These models form the "efficient frontier" of optimal choices.

### Key Findings

| Frontier | Efficient Models | What It Means |
|----------|------------------|---------------|
| Intelligence vs Price | 8 (4.4%) | Best value per dollar |
| Speed vs Intelligence | 6 (3.3%) | Performance leaders |
| Multi-objective | 41 (22.7%) | Balanced excellence |

### Price-Performance Champions

- **GPT-5.2:** IQ=51, $4.81 (premium leader)
- **Gemini 3 Flash:** IQ=46, $1.13 (value leader)
- **GLM-4.7:** IQ=42, $0.94 (budget champion)

**So what?** Only 8 models (4.4%) are truly optimal for price-performance. If your chosen model isn't on this frontier, there's a better option at your price point.

**Provider dominance:** OpenAI and Google each have 2 models on the price-performance frontier (25% each).

In [None]:
# Load Pareto-efficient models for each frontier
price_perf = get_pareto_efficient_models(df, 'intelligence_index', 'price_usd')
speed_intel = get_pareto_efficient_models(df, 'Speed(median token/s)', 'intelligence_index')

print(f"Price-performance frontier: {len(price_perf)} models")
print(f"Speed-intelligence frontier: {len(speed_intel)} models")
print(f"\nTop 3 price-performance champions:")
for model in price_perf.head(3).iter_rows(named=True):
    print(f"  - {model['Model']}: IQ={model['intelligence_index']}, ${model['price_usd']:.2f}")

### Price-Performance Frontier (Intelligence vs Price)

In [None]:
IFrame(src='reports/figures/interactive_pareto_intelligence_price.html', width=800, height=500)

### Speed-Intelligence Frontier

In [None]:
IFrame(src='reports/figures/interactive_pareto_speed_intelligence.html', width=800, height=500)

### Interpreting the Frontiers

- **Red markers = Pareto-efficient models** (best tradeoffs)
- **What this means:** Choose models on the red frontier line. Everything below/right is dominated - there's a better option.

### Model Selection Guide

| Use Case | Which Frontier | Strategy |
|----------|----------------|----------|
| Budget apps | Intelligence vs Price | Pick from price-performance frontier |
| Real-time apps | Speed vs Intelligence | Pick from speed-intelligence frontier |
| Balanced needs | Multi-objective | Pick from multi-objective frontier |

See [full Pareto analysis](reports/pareto_analysis_2026-01-18.md) for complete frontier details and model listings.

---

## 4. Provider Clustering: Market Segmentation

The AI provider market has split into two distinct segments. Which one fits your needs?

### Key Finding: Two Market Segments

**KMeans clustering (K=2, silhouette=0.390)** reveals Budget-Friendly vs Premium Performance segments:

| Segment | Providers | Mean IQ | Mean Price | Mean Speed |
|---------|-----------|---------|------------|------------|
| **Budget-Friendly** | 24 (67%) | 17.9 | $0.35 | 34 token/s |
| **Premium Performance** | 12 (33%) | 29.0 | $1.53 | 117 token/s |

**Budget providers:** Alibaba, DeepSeek, Meta, Microsoft, Baidu, IBM, NVIDIA, Perplexity (+16 others)

**Premium providers:** OpenAI, Anthropic, Google, Amazon, Mistral, Cohere, xAI (+5 others)

**So what?** Premium providers offer 62% higher intelligence (29.0 vs 17.9) but cost 4.4x more. Budget segment is better for cost-sensitive applications.

In [None]:
# Load provider-level aggregates and regional comparisons
providers = aggregate_by_provider(df)
regional = compare_regions(df)

print(f"Providers analyzed: {len(providers)}")
print(f"Regions: US, China, Europe, Other")
print(f"\nRegional Intelligence (mean):")
for region, stats in regional.groupby('region').agg({'intelligence_index': 'mean'}).iter_rows(named=True):
    print(f"  {region['region']}: {region['intelligence_index']:.1f}")

### Regional Analysis

How do regions compare across key metrics?

| Region | Intelligence | Price | Speed |
|--------|--------------|-------|-------|
| **China** | 22.2 | $0.93 | 66 token/s |
| **US** | 22.6 | $1.53 | 118 token/s |
| **Europe** | 18.8 | $0.55 | 142 token/s |

**Regional insight:** European models are fastest but cheapest. US models are most expensive. Chinese models balance speed and price.

### Provider Comparison Dashboard

In [None]:
IFrame(src='reports/figures/interactive_provider_dashboard.html', width=900, height=600)

### Interpreting the Dashboard

**Scatter panels show:** Intelligence-Price, Intelligence-Speed, Price-Speed relationships with cluster color-coding.

**Gold stars = Cluster centroids** (typical provider for each segment)

**What this means:**
- **Cluster 0 (blue):** Budget-friendly options - good for high-volume, cost-sensitive apps
- **Cluster 1 (red):** Premium performance - good for intelligence-critical applications

**Strategic implication:** Choose your segment first (budget vs premium), then pick the best model within that segment using Pareto analysis.

See [full provider clustering analysis](reports/provider_clustering_2026-01-18.md) for complete regional and segment details.

---

## 5. Speed-Intelligence Tradeoff: Choose Your Zone

**Real-time chatbot or batch analysis? The speed-intelligence tradeoff defines your use case.**

We identified **4 use case zones** based on speed and intelligence thresholds:

| Zone | Speed | Intelligence | Best For |
|------|-------|--------------|----------|
| **Real-time** | >100 token/s | Any | Live chat, interactive apps |
| **High-IQ** | Any | >40 | Complex reasoning, code gen |
| **Balanced** | 50-100 | 20-40 | General-purpose apps |
| **Budget** | <50 | <20 | Cost-sensitive, simple tasks |

**Pareto-efficient speed-intelligence models:**
- **Gemini 2.5 Flash-Lite:** 550 token/s (throughput leader)
- **gpt-oss-120B:** 366 token/s, IQ=33
- **o3:** 264 token/s, IQ=41
- **GPT-5.2:** 100 token/s, IQ=51 (intelligence leader)

**So what?** If you need real-time responses (>100 token/s), you have 6 Pareto-efficient options. If you need high intelligence (>40), expect to sacrifice speed or pay more.

**Tradeoff insight:** Speed and intelligence have weak correlation (ρ=0.261). This means you **CAN** find fast models with good intelligence - they're rare but exist.

In [None]:
# Define use case zones
real_time = df.filter(pl.col('Speed(median token/s)') > 100)
high_iq = df.filter(pl.col('intelligence_index') > 40)
balanced = df.filter(
    (pl.col('Speed(median token/s)') >= 50) & (pl.col('Speed(median token/s)') <= 100) &
    (pl.col('intelligence_index') >= 20) & (pl.col('intelligence_index') <= 40)
)
budget = df.filter(
    (pl.col('Speed(median token/s)') < 50) & (pl.col('intelligence_index') < 20)
)

print(f"Real-time zone: {len(real_time)} models")
print(f"High-IQ zone: {len(high_iq)} models")
print(f"Balanced zone: {len(balanced)} models")
print(f"Budget zone: {len(budget)} models")

### Speed-Intelligence Tradeoff with Use Case Zones

**What the visualization shows:**
- **Colored zones** show use case categories (semi-transparent overlays)
- **Star markers** = Pareto-efficient models on the speed-intelligence frontier
- **Each point** = one AI model positioned by speed and intelligence

In [None]:
IFrame(src='reports/figures/interactive_tradeoff_analysis.html', width=900, height=600)

### Interpreting the Tradeoff Analysis

**Zone meanings:**
- **Bottom-left (Budget zone):** Cheap but slow and not very smart
- **Top-left (High-IQ zone):** Smart but expensive and slower
- **Bottom-right (Real-time zone):** Fast but mid-intelligence
- **Top-right (Balanced zone):** Sweet spot for most applications

**Model selection by zone:**
- **Chatbots:** Choose from Real-time zone (Gemini 3 Flash, o3)
- **Code generation:** Choose from High-IQ zone (GPT-5.2, Claude Opus)
- **General apps:** Choose from Balanced zone (GPT-5 mini, GLM-4.7)
- **Cost-sensitive:** Choose from Budget zone (DeepSeek V3.2, MiMo-V2-Flash)

---

## 6. 2027 Trend Predictions: What's Next?

**If current trends continue, what will 2027 look like? Here are our scenario-based projections.**

**Important:** These are simplified extrapolations from 2026 cross-sectional data, **NOT** sophisticated forecasts. We lack time series data, so we project using scenario analysis (optimistic/baseline/pessimistic).

### Key Predictions by Scenario

| Metric | 2026 | 2027 Optimistic | 2027 Baseline | 2027 Pessimistic |
|--------|------|-----------------|---------------|------------------|
| **Mean Intelligence** | 21.81 | 23.99 (+10%) | 22.90 (+5%) | 22.24 (+2%) |
| **Mean Price** | $1.00 | -$20% | -10% | -5% |
| **Mean Speed** | 90.7 token/s | +20% | +10% | +5% |

**Directional trends (high confidence):**
- **Intelligence:** modest increase (2-10%)
- **Price:** decreasing (-5% to -20%) by tier
- **Speed:** improving (+5% to +20%)

**Uncertainty discussion:**
- **Wide prediction intervals** reflect high uncertainty (e.g., 95% PI for intelligence: [1.2, 45.0])
- **Cross-sectional limitation:** We have 2026 snapshot, not historical data from 2015-2026
- **Black swan risk:** GPT-4 level breakthroughs are unpredictable and could accelerate trends
- **Not for betting:** Use these for exploratory analysis, NOT investment decisions

**So what?** Expect gradual improvement in intelligence and speed, with falling prices. But disruption is likely - these trends assume linear progress, which rarely holds in AI.

In [None]:
# Trend predictions are in the report, not recomputed here
# See reports/trend_predictions_2026-01-18.md for full analysis

from src.bootstrap import bootstrap_mean_ci

# Show current 2026 statistics with 95% CIs
iq_mean, iq_ci = bootstrap_mean_ci(df, 'intelligence_index', n_resamples=9999)
price_mean, price_ci = bootstrap_mean_ci(df, 'price_usd', n_resamples=9999)
speed_mean, speed_ci = bootstrap_mean_ci(df, 'Speed(median token/s)', n_resamples=9999)

print(f"2026 Intelligence: {iq_mean:.2f} [{iq_ci[0]:.2f}, {iq_ci[1]:.2f}]")
print(f"2026 Price: ${price_mean:.2f} [${price_ci[0]:.2f}, ${price_ci[1]:.2f}]")
print(f"2026 Speed: {speed_mean:.1f} token/s [{speed_ci[0]:.1f}, {speed_ci[1]:.1f}]")

### Full 2027 Predictions Report

See [trend_predictions_2026-01-18.md](reports/trend_predictions_2026-01-18.md) for:
- Detailed scenario assumptions
- Prediction intervals by intelligence tier
- Sources of uncertainty
- Recommendations for using these predictions

### Why These Predictions Are Unreliable

**Methodological caveats:**
- **No time series data** (single 2026 snapshot)
- **Linear extrapolation** (technology follows S-curves, not lines)
- **No disruption modeling** (breakthroughs change everything)
- **Competitive dynamics** (new entrants like DeepSeek can disrupt quickly)

---

## 7. Conclusions and Recommendations

### 7.1 Key Takeaways

This analysis of 187 AI models across 37 providers reveals several critical insights:

- **The AI market has split into two segments:** Budget-friendly (67% of providers) vs Premium Performance (33%)
- **Only 4.4% of models** are Pareto-efficient for price-performance - most models are dominated
- **All 10 correlations** we tested were statistically significant - this market has consistent patterns
- **Intelligence costs more** (ρ=0.590) but value champions exist (Gemini 3 Flash: IQ=46, $1.13)
- **Speed ≠ Intelligence** (ρ=0.261) - you can find fast models that are also smart
- **Regional differences:** Europe fastest, US most expensive, China balances both

### 7.2 Practical Recommendations

**For budget-constrained apps:** Choose Budget segment providers (DeepSeek, Meta, Microsoft) and Pareto-efficient models

**For intelligence-critical apps:** Choose Premium segment (OpenAI, Anthropic, Google) and accept higher costs

**For real-time applications:** Choose from Real-time zone (Gemini 3 Flash, o3) with >100 token/s

**For general-purpose apps:** Choose from Balanced zone (GPT-5 mini, GLM-4.7) for best tradeoffs

**For 2027 planning:** Expect modest improvements (+2-10% intelligence, +5-20% speed, -5 to -20% prices) but plan for disruption

### 7.3 Novel Insights (Project Goal)

**Market bifurcation:** Provider market has cleaved into Budget vs Premium segments (not continuous spectrum)

**Pareto sparsity:** Only 8/181 models (4.4%) are price-performance efficient - most models are dominated choices

**Speed-intelligence decoupling:** Weak correlation (ρ=0.261) means speed is separable from intelligence - enabling use case specialization

**Regional asymmetry:** Europe fastest but cheapest, US most expensive - suggests different regional priorities

### 7.4 Limitations

- **Cross-sectional data:** Single 2026 snapshot, not time series
- **Selection bias:** Dataset may not represent all models
- **Correlation ≠ causation:** All associations, not causal relationships (NARR-08)
- **External validity:** Scores may vary by task and implementation
- **Missing factors:** We don't measure ecosystem, documentation, ease of use

### 7.5 Future Work

- **Collect temporal data:** Track model releases over time for proper trend analysis
- **Capability-specific benchmarks:** Separate scores for reasoning, coding, math, creative writing
- **Cost optimization analysis:** Total cost of ownership (API + compute + latency)
- **User satisfaction surveys:** Real-world usage vs benchmark scores
- **Regulatory impact analysis:** How EU AI Act, US executive orders affect development

### 7.6 Final Thoughts

The AI model market is maturing. Clear segments have emerged, Pareto frontiers define optimal choices, and tradeoffs are well-understood. Use this analysis to narrow your search, then test models for your specific use case. The best model is the one that works for YOUR application, not the one with the highest benchmark score.

**Question for you:** What use case are you optimizing for? Let us know in the comments!

---

**Thanks for reading!** If you found this analysis helpful, please upvote.

Questions or feedback? Let me know in the comments!