Open-source, stateless AI model selection engine.
Select the cheapest model that can successfully complete the task.
ModelDirector doesn't execute prompts and isn't a gateway. It scores a user prompt against a configurable set of candidate models and returns the best pick — with full reasoning, a per-model cost estimate, and an auditable decision trace.
$ echo "Translate 'hello' to Spanish." | modeldirector select -c config.yaml
{
"selected_model": "gpt5mini",
"policy": "cheapest_capable",
"scores": {
"gpt5mini": { "overall": 92, "reasoning": 80, "coding": 85, "context": 90, "explanation": "..." },
"sonnet": { "overall": 95, "reasoning": 90, "coding": 92, "context": 95, "explanation": "..." },
"opus": { "overall": 99, "reasoning": 99, "coding": 97, "context": 99, "explanation": "..." }
},
"estimated_cost_usd": {
"gpt5mini": 0.0001,
"sonnet": 0.0020,
"opus": 0.0100
},
"input_tokens": 7,
"reason": "'gpt5mini' is the first model exceeding the threshold of 80 (overall=92, priority=1, cost=$0.15/1M in, $0.60/1M out)."
}
ModelDirector is a model selection engine. Not a router. Not a gateway. Not a model. It does one thing:
- You define your candidate models (id, real cost, capabilities, strengths, description).
- ModelDirector scores a prompt against every candidate.
- Your configured policy picks the winner.
- The decision, scores, and per-model USD cost estimate are returned.
The engine is 100% independent of how your agent actually calls the LLM. It works with anything that talks LiteLLM — OpenAI, Anthropic, OpenRouter, Ollama, vLLM, Bedrock, Azure. The same engine powers every adapter:
ModelDirector Engine
┌────────────────────────────────────┐
│ Config loader · Selector │
│ Policy engine · Cost estimator │
└────────────────────────────────────┘
│
┌──────────────┬────────────┼────────────┬──────────────┐
│ │ │ │ │
Python SDK CLI REST MCP Your adapter
Pick the surface that fits. Same engine, same policies, same result.
Most agents ship with a single hardcoded model. Two problems:
- Cost — using Opus for a typo-fix is wasteful.
- Lock-in — switching providers means rewriting call sites.
ModelDirector sits before the model call. The same director.select(prompt)
call works for OpenAI, Anthropic, local Ollama, or a self-hosted mix.
The selection is auditable (reason), deterministic (configurable seed),
explainable (full per-model scores), and transparent — every output
includes a per-model USD cost estimate so the trade-off is visible to the
caller.
# Core (SDK + CLI)
pip install modeldirector
# With REST adapter
pip install "modeldirector[rest]"
# With MCP adapter (for Claude Desktop, Continue, etc.)
pip install "modeldirector[mcp]"
# Everything
pip install "modeldirector[all]"Or from this repo:
git clone https://github.com/aniketkarne-com/ModelDirector
cd ModelDirector
uv sync --all-extras-
Write a config (
config.yaml):selector: provider: openrouter model: anthropic/claude-3.5-haiku # small, fast, JSON-reliable temperature: 0.0 policy: type: cheapest_capable threshold: 80 models: - id: gpt5mini # model id = model name, no tier labels name: openai/gpt-4o-mini display_name: GPT-4o mini description: | OpenAI's small, fast, low-cost model. Good for classification, routing, short summarisation, and basic Q&A. Weaker at long- horizon reasoning and complex code generation. strengths: # structured tags the selector can rely on - classification - short_summarisation - simple_qa capabilities: { reasoning: 65, coding: 70, context: 70 } priority: 1 cost: # USD per 1M tokens (input / output) input: 0.15 output: 0.60 - id: sonnet name: anthropic/claude-3.5-sonnet display_name: Claude 3.5 Sonnet description: | Anthropic's mid-tier model. Strong at coding, architecture, and long-context reasoning. 200k context window. strengths: - coding - architecture - refactoring - long_context capabilities: { reasoning: 88, coding: 92, context: 95 } priority: 2 cost: input: 3.00 output: 15.00 - id: opus name: anthropic/claude-opus-4 display_name: Claude Opus 4 description: | Anthropic's frontier model. Best reasoning, complex multi-step planning, nuanced code generation. strengths: - hard_reasoning - complex_coding - architecture_design capabilities: { reasoning: 95, coding: 95, context: 99 } priority: 3 cost: input: 15.00 output: 75.00
-
Select a model for any prompt:
# CLI modeldirector select -c config.yaml -p "Summarise this article." # From a file modeldirector select -c config.yaml -f prompt.txt # From stdin echo "Write a haiku" | modeldirector select -c config.yaml
-
Or use the Python SDK:
from modeldirector import ModelDirector, load_config director = ModelDirector(load_config("config.yaml")) result = director.select("Explain quantum entanglement in one paragraph.") print(result.selected_model, "-", result.reason) print(" cost:", result.estimated_cost_usd[result.selected_model], "USD")
-
Or call the REST API:
MODELDIRECTOR_CONFIG=config.yaml uvicorn modeldirector.adapters.rest:app curl -X POST http://localhost:8000/select \ -H 'content-type: application/json' \ -d '{"prompt": "Fix the typo in this sentence."}' -
Or expose it as an MCP tool:
modeldirector-mcp config.yaml # talks MCP stdio to Claude Desktop etc.
The selector is a small, fast LLM (e.g. claude-3.5-haiku, gpt-4o-mini,
qwen3-32b) that gets a structured JSON prompt with your task, every
candidate's profile (description, capabilities, strengths, and real
cost per 1M tokens), and a strict scoring rubric. It returns scores only —
the decision logic is yours.
Three built-in policies:
| Policy | Picks |
|---|---|
cheapest_capable |
First model (by priority, then by input cost) whose overall >= threshold |
highest_confidence |
The model with the highest overall score |
best_value |
Best overall / cost.input ratio (confidence per dollar) |
Custom policies subclass Policy and override apply(scores, models).
| Field | Required | Default | Notes |
|---|---|---|---|
provider |
yes | — | LiteLLM provider, e.g. openrouter, openai, ollama |
model |
yes | — | Model name (LiteLLM format) |
api_key |
no | env | Falls back to ${PROVIDER}_API_KEY |
api_base |
no | — | Override the API base URL |
temperature |
no | 0.0 |
Sampling temperature |
max_tokens |
no | — | Cap on the selector's response |
| Field | Type | Default | Notes |
|---|---|---|---|
type |
enum | cheapest_capable |
One of cheapest_capable, highest_confidence, best_value |
threshold |
int | 85 |
Confidence threshold for cheapest_capable |
cost_per_million |
bool | true |
Hint that costs are USD per 1M tokens (affects reports only) |
| Field | Required | Default | Notes |
|---|---|---|---|
id |
yes | — | Stable identifier used in output. Use model names (e.g. gpt5mini, sonnet) — not tier labels like cheap/premium. The engine doesn't know what tiers mean. |
name |
yes | — | LiteLLM-format model name (the actual call) |
display_name |
no | name |
Human-friendly label |
description |
no | "" |
Strongly recommended. Free-form context for the selector — small LLMs may not recognise model names. |
strengths |
no | [] |
Strongly recommended. Structured task-type tags (e.g. coding, architecture, long_context). More reliable for small selectors than relying on the model name. |
capabilities |
yes | — | reasoning, coding, context required, creativity optional (0-100) |
priority |
no | 1 |
Lower = preferred. Tie-breaker for cheapest_capable. |
cost |
yes | — | Per-1M-token cost in USD. cost: {input: 0.15, output: 0.60} for new configs. A bare number is accepted for back-compat and treated as input=output=x. |
| Interface | Use it for | Entry point |
|---|---|---|
| Python SDK | Embedding in an app | from modeldirector import ModelDirector |
| CLI | Shell scripts, cron jobs | modeldirector select -c config.yaml -p "..." |
| REST | Microservices, polyglot stacks | uvicorn modeldirector.adapters.rest:app |
| MCP | Claude Desktop, Continue, Roo Code | modeldirector-mcp config.yaml |
The same engine powers all four. Adapters are thin translation layers.
$ pytest tests/ -q -m "not integration"
47 passed in 0.06s
Includes:
- 43 unit tests for the policy engine, config loader, cost estimator,
and selector (no network) —
pytest tests/ - 4 integration tests that hit OpenRouter with a real selector LLM —
pytest tests/test_integration.py(skipped automatically ifOPENROUTER_API_KEYisn't set)
CI on every commit runs the unit suite. The integration suite is opt-in
(pytest -m integration).
benchmarks/run_benchmark.py runs 30 real tasks (translation, Q&A, code
generation, architecture design, prose) through the full ModelDirector
pipeline and reports:
- which model was selected per task
- per-task USD cost
- per-category cost savings vs an "always-opus" baseline
- selector latency (p50 / p95 / max)
Run it:
OPENROUTER_API_KEY=*** python -m benchmarks.run_benchmarkLatest results are committed under benchmarks/output/results.json
and re-rendered below.
Captured 2026-06-07 with
selector: anthropic/claude-3.5-haikuand a 3-model candidate set (gpt5mini = gpt-4o-mini,sonnet = claude-3.5-sonnet,opus = claude-opus-4) with real per-1M-token USD costs. 30 tasks, 8 categories.
Headline numbers
| Metric | Value |
|---|---|
| Tasks completed | 30 / 30 |
| Errors | 0 |
| Mean cost savings vs always-opus | 100.0 % |
Tasks that picked gpt5mini |
11 / 30 |
Tasks that picked sonnet |
19 / 30 |
Tasks that picked opus |
0 / 30 |
| Selector p50 latency | 3.9 s |
| Selector p95 latency | 4.5 s |
| Selector max latency | 5.1 s |
Per-task model picks
| Category | n | Mean savings | Picked |
|---|---|---|---|
| trivial | 5 | 100.0 % | 4× gpt5mini, 1× sonnet |
| simple_qa | 5 | 100.0 % | 4× gpt5mini, 1× sonnet |
| summarisation | 3 | 100.0 % | 2× gpt5mini, 1× sonnet |
| light_coding | 4 | 100.0 % | 4× sonnet |
| medium_coding | 4 | 100.0 % | 4× sonnet |
| reasoning | 3 | 100.0 % | 1× gpt5mini, 2× sonnet |
| hard_coding | 4 | 100.0 % | 4× sonnet |
| writing | 2 | 100.0 % | 2× sonnet |
What this means in real USD. Per 1M input tokens:
| Model | $/1M in | $/1M out | vs opus (input) |
|---|---|---|---|
| gpt5mini | $0.15 | $0.60 | 100× cheaper |
| sonnet | $3.00 | $15.00 | 5× cheaper |
| opus | $15.00 | $75.00 | baseline |
For a typical 200-token prompt:
gpt5mini≈ $0.00003 input / $0.00006 outputsonnet≈ $0.0006 input / $0.003 outputopus≈ $0.003 input / $0.015 output
Why no opus picks? Sonnet's overall score cleared the configured
threshold (80) for every task in the battery. The selector correctly
identified that the extra cost of the frontier model wasn't justified for
this task mix. Lower the threshold to 60 or tighten the description/
strengths on opus, and the selector will start to pick opus for the
hardest tasks.
Raw data: benchmarks/output/results.json — every score, every reason, every latency, every USD estimate.
modeldirector/
├── modeldirector/
│ ├── config.py # Pydantic config models (Cost, ModelProfile, PolicyConfig, ...)
│ ├── loader.py # YAML / dict loader, ${ENV} expansion
│ ├── models.py # Public types: ModelScore, SelectionResult (with estimated_cost_usd)
│ ├── policy.py # Policy engine (3 built-ins + custom)
│ ├── selector.py # Selector engine (LiteLLM, dynamic prompt) + token estimation
│ └── adapters/
│ ├── cli.py # click-based CLI
│ ├── rest.py # FastAPI server
│ └── mcp.py # FastMCP tool
├── benchmarks/
│ └── run_benchmark.py # 30-task real-LLM benchmark
├── examples/
│ └── config.yaml # ready-to-use config
├── tests/ # 43 unit + 4 integration tests
├── docs/
│ ├── hero.jpg # README hero (top)
│ └── how-it-works.jpg # "How it works" section diagram
├── prd.md
├── pyproject.toml
└── README.md
The engine is the product. Adapters are derived. Adding a new interface
(Discord bot, Slack command, custom agent, etc.) is a matter of
constructing a ModelDirector and translating the input/output to the
adapter's protocol.
- Stateless — no database, no storage, no learning, no telemetry. Every request is independent.
- Model-agnostic — never hardcodes model names. You define the candidate set; ModelDirector works with any provider LiteLLM supports (OpenAI, Anthropic, Ollama, vLLM, Bedrock, Azure, etc.).
- Cost-transparent —
costis a first-class field,estimated_cost_usdis in every response. The trade-off is visible, not hidden. - Configuration first — every threshold, every policy, every model is config. No magic defaults that hide the cost trade-off.
- Auditable — the response includes per-model scores, per-model cost, the reason, and the policy that was applied. You always know why a model was picked.
MIT. See LICENSE.
PRs welcome. Bug reports and feature requests go in GitHub Issues.
For substantial changes, open an issue first to discuss the design — this project prizes a small, stable API surface.

