release: v0.10.0 — per-stage model overrides (plan / synth / critic) by askalf · Pull Request #31 · askalf/deepdive

askalf · 2026-05-18T13:59:13Z

What does this PR do?

Lets you pick a different model for each pipeline stage — typically cheap (haiku) for plan + critic, expensive (sonnet/opus) for synth. The pipeline has been tagging LLM calls with `phase: "plan" | "synth" | "critique"` since the v0.6.0 cost telemetry landed; v0.10.0 lets you actually act on that tag.

Cost win

A `--deep` run typically fires 1× plan + 3× synth + 3× critique = 7 LLM calls. Plan and critique are structurally simple (decompose / review); synth is where quality matters. Moving plan + critique to haiku while keeping synth on sonnet/opus gets ~60-70% cost cut at synth-quality parity.

```bash
deepdive "what changed in the python 3.13 GIL?" --deep
--model=claude-sonnet-4-6
--plan-model=claude-haiku-4-5
--critic-model=claude-haiku-4-5

cost · ~$0.034 · 12.1k in / 4.2k out · 7 LLM calls · multi-model

· ~$0.003 · 800 in / 200 out · 1 LLM call · claude-haiku-4-5

· ~$0.030 · 10.7k in / 3.85k out · 3 LLM calls · claude-sonnet-4-6

· ~$0.001 · 600 in / 150 out · 3 LLM calls · claude-haiku-4-5

```

Surface

Flag	Env	Fallback
`--plan-model`	`DEEPDIVE_PLAN_MODEL`	`--model` / `DEEPDIVE_MODEL` / `claude-sonnet-4-6`
`--synth-model`	`DEEPDIVE_SYNTH_MODEL`	same
`--critic-model`	`DEEPDIVE_CRITIC_MODEL`	same

Resolution precedence: per-stage flag > per-stage env > base flag > base env > default.

Internal

`src/llm.ts`: new pure `withModel(base, model)` helper
`src/pricing.ts`: `estimateCostMultiModel(usageByModel, env)` + `MultiModelCostEstimate` (strict superset of `CostEstimate`)
`src/agent.ts`: per-stage `LLMConfig` map built at run start; per-model usage accumulator; `llm.call` event union gains a `model: string` field
`src/cli.ts`: `renderMultiModelCostSummary` — single-model output byte-identical to pre-v0.10.0; multi-model adds an aggregate line + per-model breakdown lines
Library consumers reading `AgentResult.cost.amountUsd` etc. keep working unchanged; the new `byModel` field is additive

Tests

+18 new across:

`test/config.test.mjs` (7): per-stage defaults, individual flag overrides, env vars, precedence ladder, all-three combo, base-env fallthrough
`test/pricing.test.mjs` (7): empty input, single-model equivalence to `estimateCost`, two-known sum, stable ordering, unknown-model fall-through, zero-bucket skip, env override semantics (`knownModel` stays false for env-priced unknowns — correct CLI semantic for `~$X` vs `$?` display)
`test/parse-args.test.mjs` (4): each flag captured, all-three plus base coexist

396/396 default suite green (378 baseline + 18 new).

How to test

```bash
git fetch origin feat/v0.10.0-per-stage-models
git checkout feat/v0.10.0-per-stage-models
npm run build && npm test # 396/396

Smoke test against dario (or any anthropic-compat endpoint):

deepdive "what is rust's borrow checker"
--model=claude-sonnet-4-6
--plan-model=claude-haiku-4-5
--critic-model=claude-haiku-4-5
--deep=2

verify the cost line lists per-model breakdown

```

The pipeline has three LLM stages with very different cost/quality trade-offs: - plan: structurally simple (decomposes question → sub-queries) - synth: where quality actually matters (writes the cited answer) - critic: structurally simple (--deep only — reviews drafts) Pre-v0.10.0 used one model for all three, so a --deep run paid opus prices on plan and critic calls that would be quality- neutral on haiku. Typical ~60-70% cost cut at synth-quality parity. CLI: --plan-model, --synth-model, --critic-model flags Env: DEEPDIVE_PLAN_MODEL, DEEPDIVE_SYNTH_MODEL, DEEPDIVE_CRITIC_MODEL Each falls back to --model / DEEPDIVE_MODEL / default. Internal: - src/llm.ts gains pure helper withModel(base, model) - src/pricing.ts gains estimateCostMultiModel(usageByModel, env) + MultiModelCostEstimate (strict superset of CostEstimate) - src/agent.ts replaces single llmTotals with llmTotalsByModel; builds per-stage LLMConfig at run start; threads the right one to planQueries / synthesize / critique - src/cli.ts gains renderMultiModelCostSummary; single-model output is byte-identical to pre-v0.10.0 - llm.call event union gains a `model: string` field Tests: +18 covering precedence ladder (per-stage flag > per- stage env > base flag > base env > default), multi-model cost sum, stable ordering, env-priced unknown models (knownModel correctly stays false — semantic preserved). 396/396 green (378 baseline + 18 new).

askalf merged commit ee83840 into master May 18, 2026
5 checks passed

askalf deleted the feat/v0.10.0-per-stage-models branch May 18, 2026 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v0.10.0 — per-stage model overrides (plan / synth / critic)#31

release: v0.10.0 — per-stage model overrides (plan / synth / critic)#31
askalf merged 1 commit into
masterfrom
feat/v0.10.0-per-stage-models

askalf commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

askalf commented May 18, 2026

What does this PR do?

Cost win

cost · ~$0.034 · 12.1k in / 4.2k out · 7 LLM calls · multi-model

· ~$0.003 · 800 in / 200 out · 1 LLM call · claude-haiku-4-5

· ~$0.030 · 10.7k in / 3.85k out · 3 LLM calls · claude-sonnet-4-6

· ~$0.001 · 600 in / 150 out · 3 LLM calls · claude-haiku-4-5

Surface

Internal

Tests

How to test

Smoke test against dario (or any anthropic-compat endpoint):

verify the cost line lists per-model breakdown

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant