release: v0.10.0 — per-stage model overrides (plan / synth / critic)#31
Merged
Conversation
The pipeline has three LLM stages with very different cost/quality trade-offs: - plan: structurally simple (decomposes question → sub-queries) - synth: where quality actually matters (writes the cited answer) - critic: structurally simple (--deep only — reviews drafts) Pre-v0.10.0 used one model for all three, so a --deep run paid opus prices on plan and critic calls that would be quality- neutral on haiku. Typical ~60-70% cost cut at synth-quality parity. CLI: --plan-model, --synth-model, --critic-model flags Env: DEEPDIVE_PLAN_MODEL, DEEPDIVE_SYNTH_MODEL, DEEPDIVE_CRITIC_MODEL Each falls back to --model / DEEPDIVE_MODEL / default. Internal: - src/llm.ts gains pure helper withModel(base, model) - src/pricing.ts gains estimateCostMultiModel(usageByModel, env) + MultiModelCostEstimate (strict superset of CostEstimate) - src/agent.ts replaces single llmTotals with llmTotalsByModel; builds per-stage LLMConfig at run start; threads the right one to planQueries / synthesize / critique - src/cli.ts gains renderMultiModelCostSummary; single-model output is byte-identical to pre-v0.10.0 - llm.call event union gains a `model: string` field Tests: +18 covering precedence ladder (per-stage flag > per- stage env > base flag > base env > default), multi-model cost sum, stable ordering, env-priced unknown models (knownModel correctly stays false — semantic preserved). 396/396 green (378 baseline + 18 new).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Lets you pick a different model for each pipeline stage — typically cheap (haiku) for plan + critic, expensive (sonnet/opus) for synth. The pipeline has been tagging LLM calls with `phase: "plan" | "synth" | "critique"` since the v0.6.0 cost telemetry landed; v0.10.0 lets you actually act on that tag.
Cost win
A `--deep` run typically fires 1× plan + 3× synth + 3× critique = 7 LLM calls. Plan and critique are structurally simple (decompose / review); synth is where quality matters. Moving plan + critique to haiku while keeping synth on sonnet/opus gets ~60-70% cost cut at synth-quality parity.
```bash
deepdive "what changed in the python 3.13 GIL?" --deep
--model=claude-sonnet-4-6
--plan-model=claude-haiku-4-5
--critic-model=claude-haiku-4-5
cost · ~$0.034 · 12.1k in / 4.2k out · 7 LLM calls · multi-model
· ~$0.003 · 800 in / 200 out · 1 LLM call · claude-haiku-4-5
· ~$0.030 · 10.7k in / 3.85k out · 3 LLM calls · claude-sonnet-4-6
· ~$0.001 · 600 in / 150 out · 3 LLM calls · claude-haiku-4-5
```
Surface
Resolution precedence: per-stage flag > per-stage env > base flag > base env > default.
Internal
Tests
+18 new across:
396/396 default suite green (378 baseline + 18 new).
How to test
```bash
git fetch origin feat/v0.10.0-per-stage-models
git checkout feat/v0.10.0-per-stage-models
npm run build && npm test # 396/396
Smoke test against dario (or any anthropic-compat endpoint):
deepdive "what is rust's borrow checker"
--model=claude-sonnet-4-6
--plan-model=claude-haiku-4-5
--critic-model=claude-haiku-4-5
--deep=2
verify the cost line lists per-model breakdown
```