which-llm

A Claude Code skill that resolves "which model should I use?" to a real, current answer. Joins the Artificial Analysis leaderboard (520+ models, intelligence/cost/benchmarks) with the OpenRouter catalog (slug availability, :free tier reality) into a single queryable dataset your agent can reason over. Refreshed daily.

Install

/plugin marketplace add ariobarin/which-llm
/plugin install which-llm@which-llm

Auto-updates when this repo ships a new version. Requires Python 3.10+ and uv.

Direct install without the plugin system

git clone https://github.com/ariobarin/which-llm /tmp/which-llm
cp -r /tmp/which-llm/plugins/which-llm/skills/which-llm ~/.claude/skills/which-llm

Example output

$ uv run python query.py models --intel-min 50 --max-cost 500 --modality text,image --top 5

slug                  name                                     creator   intel  idx-run$  ctx      free  openrouter
--------------------  ---------------------------------------  --------  -----  --------  -------  ----  --------------------------
deepseek-v4-pro       DeepSeek V4 Pro (Reasoning, Max Effort)  DeepSeek  51.5   $267.82   1000000        deepseek/deepseek-v4-pro
grok-4-3              Grok 4.3 (high)                          xAI       53.2   $395.17   1000000        x-ai/grok-4.3
mimo-v2-5-pro         MiMo-V2.5-Pro                            Xiaomi    53.8   $461.59   1000000        xiaomi/mimo-v2.5-pro

idx-run$ = USD to run AA's full benchmark suite once on the model — a relative inference-cost proxy, not a per-call price. For actual API pricing, use price_1m_input_tokens / price_1m_output_tokens.

⚠ About :free OpenRouter slugs: These aren't "the free version of the model" — they're community / promotional endpoints (often via Chutes or similar) with aggressive rate limits, daily caps, and sometimes different quantization than the paid listing. Great for prototyping; don't wire them into production without testing throughput against your real load.

What your agent will do with it

Trigger phrases that activate the skill:

"I need a vision model under $500 with reasoning. What are my options?" "Is there a free version of DeepSeek V4 Flash on OpenRouter?" "Cheapest model with intelligence > 50?" "Compare GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro."

Under the hood the agent runs short query.py commands and reasons over the output.

Commands

Three verbs, one consistent table schema.

Command	Use
`query.py models [<pattern>] [filters]`	Filter / rank / list models. Default: top 20 by intel.
`query.py show <slug>`	Full per-model profile (benchmarks, pricing, OR slugs, modalities). Accepts fuzzy slug if unambiguous.
`query.py data status`	Data freshness, model count, OpenRouter enrichment status
`query.py data refresh`	Re-scrape AA + cross-reference OR (~10s)

models flags: --top N, --sort intel|cost|ctx, --pareto, --free, --intel-min N, --max-cost N, --min-cost N, --context-min N, --modality text,image,audio,video, --reasoning/--no-reasoning, --open-weights/--no-open-weights, --json.

plot_pareto.py renders the Intelligence-vs-Cost Pareto chart as a PNG for visual exploration.

How it works

scrape.py fetches artificialanalysis.ai/models (an 8 MB HTML page) and parses the Next.js RSC payload, extracting every model object with its full schema — 60+ fields including individual benchmarks, pricing tiers, modality flags, context window, reasoning capability.
enrich.py fetches the OpenRouter catalog and matches each AA model against it by name, with token-multiset fallback for word-order differences. Current match rate ~51% — the rest are mostly models OpenRouter doesn't carry.
query.py reads the merged CSV and answers structured questions.
A daily GitHub Action re-runs steps 1-2 and commits any changes, so the shipped snapshot is rarely more than 24h stale.

No API keys, no auth, no rate-limited services — just public pages.

Data files

File	Contents
`artifacts/models_enriched.csv`	The full merged dataset (60+ columns per row)
`artifacts/models.json`	Original AA fields, preserved exactly
`artifacts/openrouter.json`	Raw OpenRouter catalog

When NOT to use

Benchmarks AA doesn't track (domain-specific evals).
Models too new for AA to have indexed (<1 week post-release sometimes).
For an authoritative per-API-call price on a non-OR provider — verify directly with that provider.

License

MIT. See LICENSE.

Credits

Data from Artificial Analysis and OpenRouter. Scrapes only public pages, no credentials required.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.claude-plugin		.claude-plugin
.github		.github
dev		dev
examples		examples
plugins/which-llm		plugins/which-llm
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

which-llm

Install

Example output

What your agent will do with it

Commands

How it works

Data files

When NOT to use

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

which-llm

Install

Example output

What your agent will do with it

Commands

How it works

Data files

When NOT to use

License

Credits

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages