ModelDirector

Open-source, stateless AI model selection engine.

Select the cheapest model that can successfully complete the task.

ModelDirector doesn't execute prompts and isn't a gateway. It scores a user prompt against a configurable set of candidate models and returns the best pick — with full reasoning, a per-model cost estimate, and an auditable decision trace.

$ echo "Translate 'hello' to Spanish." | modeldirector select -c config.yaml
{
  "selected_model": "gpt5mini",
  "policy": "cheapest_capable",
  "scores": {
    "gpt5mini": { "overall": 92, "reasoning": 80, "coding": 85, "context": 90, "explanation": "..." },
    "sonnet":   { "overall": 95, "reasoning": 90, "coding": 92, "context": 95, "explanation": "..." },
    "opus":     { "overall": 99, "reasoning": 99, "coding": 97, "context": 99, "explanation": "..." }
  },
  "estimated_cost_usd": {
    "gpt5mini": 0.0001,
    "sonnet":   0.0020,
    "opus":     0.0100
  },
  "input_tokens": 7,
  "reason": "'gpt5mini' is the first model exceeding the threshold of 80 (overall=92, priority=1, cost=$0.15/1M in, $0.60/1M out)."
}

The product: a model selection engine

ModelDirector is a model selection engine. Not a router. Not a gateway. Not a model. It does one thing:

You define your candidate models (id, real cost, capabilities, strengths, description).
ModelDirector scores a prompt against every candidate.
Your configured policy picks the winner.
The decision, scores, and per-model USD cost estimate are returned.

The engine is 100% independent of how your agent actually calls the LLM. It works with anything that talks LiteLLM — OpenAI, Anthropic, OpenRouter, Ollama, vLLM, Bedrock, Azure. The same engine powers every adapter:

                          ModelDirector Engine
                  ┌────────────────────────────────────┐
                  │  Config loader  ·  Selector        │
                  │  Policy engine  ·  Cost estimator  │
                  └────────────────────────────────────┘
                                    │
        ┌──────────────┬────────────┼────────────┬──────────────┐
        │              │            │            │              │
   Python SDK         CLI         REST         MCP      Your adapter

Pick the surface that fits. Same engine, same policies, same result.

Why

Most agents ship with a single hardcoded model. Two problems:

Cost — using Opus for a typo-fix is wasteful.
Lock-in — switching providers means rewriting call sites.

ModelDirector sits before the model call. The same director.select(prompt) call works for OpenAI, Anthropic, local Ollama, or a self-hosted mix. The selection is auditable (reason), deterministic (configurable seed), explainable (full per-model scores), and transparent — every output includes a per-model USD cost estimate so the trade-off is visible to the caller.

Install

# Core (SDK + CLI)
pip install modeldirector

# With REST adapter
pip install "modeldirector[rest]"

# With MCP adapter (for Claude Desktop, Continue, etc.)
pip install "modeldirector[mcp]"

# Everything
pip install "modeldirector[all]"

Or from this repo:

git clone https://github.com/aniketkarne-com/ModelDirector
cd ModelDirector
uv sync --all-extras

Quickstart

Write a config (config.yaml):

selector:
  provider: openrouter
  model: anthropic/claude-3.5-haiku  # small, fast, JSON-reliable
  temperature: 0.0

policy:
  type: cheapest_capable
  threshold: 80

models:
  - id: gpt5mini                 # model id = model name, no tier labels
    name: openai/gpt-4o-mini
    display_name: GPT-4o mini
    description: |
      OpenAI's small, fast, low-cost model. Good for classification,
      routing, short summarisation, and basic Q&A. Weaker at long-
      horizon reasoning and complex code generation.
    strengths:                    # structured tags the selector can rely on
      - classification
      - short_summarisation
      - simple_qa
    capabilities: { reasoning: 65, coding: 70, context: 70 }
    priority: 1
    cost:                          # USD per 1M tokens (input / output)
      input: 0.15
      output: 0.60

  - id: sonnet
    name: anthropic/claude-3.5-sonnet
    display_name: Claude 3.5 Sonnet
    description: |
      Anthropic's mid-tier model. Strong at coding, architecture,
      and long-context reasoning. 200k context window.
    strengths:
      - coding
      - architecture
      - refactoring
      - long_context
    capabilities: { reasoning: 88, coding: 92, context: 95 }
    priority: 2
    cost:
      input: 3.00
      output: 15.00

  - id: opus
    name: anthropic/claude-opus-4
    display_name: Claude Opus 4
    description: |
      Anthropic's frontier model. Best reasoning, complex multi-step
      planning, nuanced code generation.
    strengths:
      - hard_reasoning
      - complex_coding
      - architecture_design
    capabilities: { reasoning: 95, coding: 95, context: 99 }
    priority: 3
    cost:
      input: 15.00
      output: 75.00

Select a model for any prompt:

# CLI
modeldirector select -c config.yaml -p "Summarise this article."

# From a file
modeldirector select -c config.yaml -f prompt.txt

# From stdin
echo "Write a haiku" | modeldirector select -c config.yaml

Or use the Python SDK:

from modeldirector import ModelDirector, load_config

director = ModelDirector(load_config("config.yaml"))
result = director.select("Explain quantum entanglement in one paragraph.")
print(result.selected_model, "-", result.reason)
print("  cost:", result.estimated_cost_usd[result.selected_model], "USD")

Or call the REST API:

MODELDIRECTOR_CONFIG=config.yaml uvicorn modeldirector.adapters.rest:app
curl -X POST http://localhost:8000/select \
     -H 'content-type: application/json' \
     -d '{"prompt": "Fix the typo in this sentence."}'

Or expose it as an MCP tool:

modeldirector-mcp config.yaml   # talks MCP stdio to Claude Desktop etc.

How it works

The selector is a small, fast LLM (e.g. claude-3.5-haiku, gpt-4o-mini, qwen3-32b) that gets a structured JSON prompt with your task, every candidate's profile (description, capabilities, strengths, and real cost per 1M tokens), and a strict scoring rubric. It returns scores only — the decision logic is yours.

Three built-in policies:

Policy	Picks
`cheapest_capable`	First model (by priority, then by input cost) whose `overall >= threshold`
`highest_confidence`	The model with the highest overall score
`best_value`	Best `overall / cost.input` ratio (confidence per dollar)

Custom policies subclass Policy and override apply(scores, models).

Configuration reference

`selector`

Field	Required	Default	Notes
`provider`	yes	—	LiteLLM provider, e.g. `openrouter`, `openai`, `ollama`
`model`	yes	—	Model name (LiteLLM format)
`api_key`	no	env	Falls back to `${PROVIDER}_API_KEY`
`api_base`	no	—	Override the API base URL
`temperature`	no	`0.0`	Sampling temperature
`max_tokens`	no	—	Cap on the selector's response

`policy`

Field	Type	Default	Notes
`type`	enum	`cheapest_capable`	One of `cheapest_capable`, `highest_confidence`, `best_value`
`threshold`	int	`85`	Confidence threshold for `cheapest_capable`
`cost_per_million`	bool	`true`	Hint that costs are USD per 1M tokens (affects reports only)

`models[]`

Field	Required	Default	Notes
`id`	yes	—	Stable identifier used in output. Use model names (e.g. `gpt5mini`, `sonnet`) — not tier labels like `cheap`/`premium`. The engine doesn't know what tiers mean.
`name`	yes	—	LiteLLM-format model name (the actual call)
`display_name`	no	`name`	Human-friendly label
`description`	no	`""`	Strongly recommended. Free-form context for the selector — small LLMs may not recognise model names.
`strengths`	no	`[]`	Strongly recommended. Structured task-type tags (e.g. `coding`, `architecture`, `long_context`). More reliable for small selectors than relying on the model name.
`capabilities`	yes	—	`reasoning`, `coding`, `context` required, `creativity` optional (0-100)
`priority`	no	`1`	Lower = preferred. Tie-breaker for `cheapest_capable`.
`cost`	yes	—	Per-1M-token cost in USD. `cost: {input: 0.15, output: 0.60}` for new configs. A bare number is accepted for back-compat and treated as `input=output=x`.

Interfaces

Interface	Use it for	Entry point
Python SDK	Embedding in an app	`from modeldirector import ModelDirector`
CLI	Shell scripts, cron jobs	`modeldirector select -c config.yaml -p "..."`
REST	Microservices, polyglot stacks	`uvicorn modeldirector.adapters.rest:app`
MCP	Claude Desktop, Continue, Roo Code	`modeldirector-mcp config.yaml`

The same engine powers all four. Adapters are thin translation layers.

Tests

$ pytest tests/ -q -m "not integration"
47 passed in 0.06s

Includes:

43 unit tests for the policy engine, config loader, cost estimator, and selector (no network) — pytest tests/
4 integration tests that hit OpenRouter with a real selector LLM — pytest tests/test_integration.py (skipped automatically if OPENROUTER_API_KEY isn't set)

CI on every commit runs the unit suite. The integration suite is opt-in (pytest -m integration).

Benchmark

benchmarks/run_benchmark.py runs 30 real tasks (translation, Q&A, code generation, architecture design, prose) through the full ModelDirector pipeline and reports:

which model was selected per task
per-task USD cost
per-category cost savings vs an "always-opus" baseline
selector latency (p50 / p95 / max)

Run it:

OPENROUTER_API_KEY=*** python -m benchmarks.run_benchmark

Latest results are committed under benchmarks/output/results.json and re-rendered below.

Latest run

Captured 2026-06-07 with selector: anthropic/claude-3.5-haiku and a 3-model candidate set (gpt5mini = gpt-4o-mini, sonnet = claude-3.5-sonnet, opus = claude-opus-4) with real per-1M-token USD costs. 30 tasks, 8 categories.

Headline numbers

Metric	Value
Tasks completed	30 / 30
Errors	0
Mean cost savings vs always-opus	100.0 %
Tasks that picked `gpt5mini`	11 / 30
Tasks that picked `sonnet`	19 / 30
Tasks that picked `opus`	0 / 30
Selector p50 latency	3.9 s
Selector p95 latency	4.5 s
Selector max latency	5.1 s

Per-task model picks

Category	n	Mean savings	Picked
trivial	5	100.0 %	4× gpt5mini, 1× sonnet
simple_qa	5	100.0 %	4× gpt5mini, 1× sonnet
summarisation	3	100.0 %	2× gpt5mini, 1× sonnet
light_coding	4	100.0 %	4× sonnet
medium_coding	4	100.0 %	4× sonnet
reasoning	3	100.0 %	1× gpt5mini, 2× sonnet
hard_coding	4	100.0 %	4× sonnet
writing	2	100.0 %	2× sonnet

What this means in real USD. Per 1M input tokens:

Model	$/1M in	$/1M out	vs opus (input)
gpt5mini	$0.15	$0.60	100× cheaper
sonnet	$3.00	$15.00	5× cheaper
opus	$15.00	$75.00	baseline

For a typical 200-token prompt:

gpt5mini ≈ $0.00003 input / $0.00006 output
sonnet ≈ $0.0006 input / $0.003 output
opus ≈ $0.003 input / $0.015 output

Why no opus picks? Sonnet's overall score cleared the configured threshold (80) for every task in the battery. The selector correctly identified that the extra cost of the frontier model wasn't justified for this task mix. Lower the threshold to 60 or tighten the description/ strengths on opus, and the selector will start to pick opus for the hardest tasks.

Raw data: benchmarks/output/results.json — every score, every reason, every latency, every USD estimate.

Architecture

modeldirector/
├── modeldirector/
│   ├── config.py            # Pydantic config models (Cost, ModelProfile, PolicyConfig, ...)
│   ├── loader.py            # YAML / dict loader, ${ENV} expansion
│   ├── models.py            # Public types: ModelScore, SelectionResult (with estimated_cost_usd)
│   ├── policy.py            # Policy engine (3 built-ins + custom)
│   ├── selector.py          # Selector engine (LiteLLM, dynamic prompt) + token estimation
│   └── adapters/
│       ├── cli.py           # click-based CLI
│       ├── rest.py          # FastAPI server
│       └── mcp.py           # FastMCP tool
├── benchmarks/
│   └── run_benchmark.py     # 30-task real-LLM benchmark
├── examples/
│   └── config.yaml          # ready-to-use config
├── tests/                   # 43 unit + 4 integration tests
├── docs/
│   ├── hero.jpg             # README hero (top)
│   └── how-it-works.jpg     # "How it works" section diagram
├── prd.md
├── pyproject.toml
└── README.md

The engine is the product. Adapters are derived. Adding a new interface (Discord bot, Slack command, custom agent, etc.) is a matter of constructing a ModelDirector and translating the input/output to the adapter's protocol.

Design principles

Stateless — no database, no storage, no learning, no telemetry. Every request is independent.
Model-agnostic — never hardcodes model names. You define the candidate set; ModelDirector works with any provider LiteLLM supports (OpenAI, Anthropic, Ollama, vLLM, Bedrock, Azure, etc.).
Cost-transparent — cost is a first-class field, estimated_cost_usd is in every response. The trade-off is visible, not hidden.
Configuration first — every threshold, every policy, every model is config. No magic defaults that hide the cost trade-off.
Auditable — the response includes per-model scores, per-model cost, the reason, and the policy that was applied. You always know why a model was picked.

License

MIT. See LICENSE.

Contributing

PRs welcome. Bug reports and feature requests go in GitHub Issues.

For substantial changes, open an issue first to discuss the design — this project prizes a small, stable API surface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelDirector

The product: a model selection engine

Why

Install

Quickstart

How it works

Configuration reference

`selector`

`policy`

`models[]`

Interfaces

Tests

Benchmark

Latest run

Architecture

Design principles

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmarks		benchmarks
docs		docs
examples		examples
modeldirector		modeldirector
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prd.md		prd.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ModelDirector

The product: a model selection engine

Why

Install

Quickstart

How it works

Configuration reference

selector

policy

models[]

Interfaces

Tests

Benchmark

Latest run

Architecture

Design principles

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`selector`

`policy`

`models[]`

Packages