# AI Models Benchmark 2026: A Data-Driven Analysis of Performance, Price, and Tradeoffs

---

## Executive Summary

**What if you could identify the best AI model for your use case in minutes, not days?**

This notebook analyzes 187 AI models across 37 providers to uncover pricing patterns, performance correlations, and market dynamics that can inform your model selection strategy.

### Key Insights

- **Intelligence correlates moderately with price (ρ=0.590), but value champions exist** - You don't always need to pay premium prices for high intelligence. Budget-friendly models like DeepSeek V3.2 (IQ=41, $0.32) and GLM-4.7 (IQ=42, $0.94) offer competitive intelligence at a fraction of the cost.

- **Only 8 models (4.4%) are Pareto-efficient for price-performance** - These models represent the "efficient frontier" where no other model offers higher intelligence at a lower price. The rest are either dominated by more capable models or overpriced for their performance.

- **Market has split into Budget (24 providers) vs Premium (12 providers) segments** - K-means clustering reveals two distinct market segments: Budget providers ($0.35, IQ=17.9) and Premium providers ($1.53, IQ=29.0) with 4x price difference and 62% intelligence gap.

- **All 10 pairwise correlations significant - no null findings in dataset** - Every relationship between variables (Intelligence-Price, Speed-Latency, etc.) is statistically significant after FDR correction, suggesting a highly interconnected market with consistent pricing patterns.

- **Regional differences: Europe fastest (142 token/s), US most expensive ($1.53)** - European models prioritize speed (142.3 tokens/s) while US commands premium pricing ($1.53). Chinese models offer middle-ground pricing ($0.93) with competitive intelligence (22.2).

### Methodology

Analysis of 187 AI models across 37 providers using non-parametric statistics (Spearman correlation, Mann-Whitney U tests, bootstrap CIs). All visualizations are pre-generated for fast loading. Analysis code is modularized in `src/` for reproducibility.

### Navigation

Jump to: [Data Quality](#1-data-quality-assessment) | [Correlations](#2-correlation-analysis) | [Pareto Frontiers](#3-pareto-frontier-analysis) | [Provider Analysis](#4-provider-clustering) | [Conclusions](#5-conclusions)

## Setup

This notebook imports pre-computed visualizations and analysis from our modular pipeline. No duplicate code - everything comes from `src/` modules and pre-generated reports.

**Notebook structure:** All analysis logic lives in [src/](src/) modules for reproducibility. Visualizations are pre-generated for fast loading - see [reports/figures/](reports/figures/).

In [None]:
# Core dependencies
import polars as pl
import numpy as np
import sys
sys.path.append('.')

# Import analysis modules (script-as-module pattern)
from src.load import load_data
from src.analyze import analyze_distribution
from src.statistics import compute_correlation_matrix, compute_spearman_correlation
from src.pareto import get_pareto_efficient_models
from src.clustering import aggregate_by_provider
from src.bootstrap import bootstrap_mean_ci

# Visualization helpers
from IPython.display import IFrame, HTML, Markdown

# Load data (use Polars directly for parquet files)
df = pl.read_parquet("data/processed/ai_models_deduped.parquet")

print(f"✓ Loaded {df.height} models")
print(f"✓ {df['Creator'].n_unique()} providers represented")
print(f"✓ Analysis modules ready")

## 1. Data Quality Assessment

Before diving into insights, let's establish trust in our data foundation. A robust analysis requires a clean, validated dataset.

### Quality Score Overview

**Overall Quality Score: 75%** (3/4 dimensions passed)

Our data quality assessment evaluated 6 dimensions: Accuracy, Completeness, Consistency, Validity, Integrity, and Timeliness. The dataset passed 3 of 4 applicable dimensions.

### Key Findings

| Metric | Value | Status |
|--------|-------|--------|
| **Models after deduplication** | 187 | 34 duplicate names resolved |
| **Data completeness** | 96.81% | Only 6 null values in intelligence_index |
| **Providers represented** | 37 | Across US, China, Europe, and other regions |
| **Distribution characteristics** | Right-skewed | Validated non-parametric approach |
| **Models flagged as outliers** | 10 (5.32%) | Preserved for analysis, not removed |

### What This Means

Our analysis is based on a clean, validated dataset with minimal missing values. The 6 models lacking intelligence scores represent models without IQ assessments (not data entry errors), so intelligence-specific analyses filter to 181 models. The 34 duplicate model names were resolved using context window disambiguation, ensuring accurate group-by operations.

**Data quality note:** 10 models (5.32%) were flagged as outliers via Isolation Forest detection but preserved for analysis. These may represent legitimate high-performance models rather than data errors.

See [full quality assessment](reports/quality_2026-01-18.md) for complete analysis including distribution statistics and outlier details.

### Methodology Note

**Why we use Spearman correlation (non-parametric):** All numerical variables are right-skewed (skewness > 0), violating normality assumptions required for parametric tests like Pearson correlation. Spearman rank correlation is robust to non-normal distributions and outliers, making it the appropriate choice for this dataset.

**Data distribution characteristics:**
- Intelligence Index: Approximately normal (skewness=0.67)
- Price, Speed, Latency: Moderately to highly right-skewed (1.73 - 7.11)
- Context Window: Extremely right-skewed (skewness=9.63, kurtosis=114.20)

This non-normality justifies our non-parametric approach throughout the analysis.