Skip to content

release: v0.10.0 — per-stage model overrides (plan / synth / critic)#31

Merged
askalf merged 1 commit into
masterfrom
feat/v0.10.0-per-stage-models
May 18, 2026
Merged

release: v0.10.0 — per-stage model overrides (plan / synth / critic)#31
askalf merged 1 commit into
masterfrom
feat/v0.10.0-per-stage-models

Conversation

@askalf
Copy link
Copy Markdown
Owner

@askalf askalf commented May 18, 2026

What does this PR do?

Lets you pick a different model for each pipeline stage — typically cheap (haiku) for plan + critic, expensive (sonnet/opus) for synth. The pipeline has been tagging LLM calls with `phase: "plan" | "synth" | "critique"` since the v0.6.0 cost telemetry landed; v0.10.0 lets you actually act on that tag.

Cost win

A `--deep` run typically fires 1× plan + 3× synth + 3× critique = 7 LLM calls. Plan and critique are structurally simple (decompose / review); synth is where quality matters. Moving plan + critique to haiku while keeping synth on sonnet/opus gets ~60-70% cost cut at synth-quality parity.

```bash
deepdive "what changed in the python 3.13 GIL?" --deep
--model=claude-sonnet-4-6
--plan-model=claude-haiku-4-5
--critic-model=claude-haiku-4-5

cost · ~$0.034 · 12.1k in / 4.2k out · 7 LLM calls · multi-model

· ~$0.003 · 800 in / 200 out · 1 LLM call · claude-haiku-4-5

· ~$0.030 · 10.7k in / 3.85k out · 3 LLM calls · claude-sonnet-4-6

· ~$0.001 · 600 in / 150 out · 3 LLM calls · claude-haiku-4-5

```

Surface

Flag Env Fallback
`--plan-model` `DEEPDIVE_PLAN_MODEL` `--model` / `DEEPDIVE_MODEL` / `claude-sonnet-4-6`
`--synth-model` `DEEPDIVE_SYNTH_MODEL` same
`--critic-model` `DEEPDIVE_CRITIC_MODEL` same

Resolution precedence: per-stage flag > per-stage env > base flag > base env > default.

Internal

  • `src/llm.ts`: new pure `withModel(base, model)` helper
  • `src/pricing.ts`: `estimateCostMultiModel(usageByModel, env)` + `MultiModelCostEstimate` (strict superset of `CostEstimate`)
  • `src/agent.ts`: per-stage `LLMConfig` map built at run start; per-model usage accumulator; `llm.call` event union gains a `model: string` field
  • `src/cli.ts`: `renderMultiModelCostSummary` — single-model output byte-identical to pre-v0.10.0; multi-model adds an aggregate line + per-model breakdown lines
  • Library consumers reading `AgentResult.cost.amountUsd` etc. keep working unchanged; the new `byModel` field is additive

Tests

+18 new across:

  • `test/config.test.mjs` (7): per-stage defaults, individual flag overrides, env vars, precedence ladder, all-three combo, base-env fallthrough
  • `test/pricing.test.mjs` (7): empty input, single-model equivalence to `estimateCost`, two-known sum, stable ordering, unknown-model fall-through, zero-bucket skip, env override semantics (`knownModel` stays false for env-priced unknowns — correct CLI semantic for `~$X` vs `$?` display)
  • `test/parse-args.test.mjs` (4): each flag captured, all-three plus base coexist

396/396 default suite green (378 baseline + 18 new).

How to test

```bash
git fetch origin feat/v0.10.0-per-stage-models
git checkout feat/v0.10.0-per-stage-models
npm run build && npm test # 396/396

Smoke test against dario (or any anthropic-compat endpoint):

deepdive "what is rust's borrow checker"
--model=claude-sonnet-4-6
--plan-model=claude-haiku-4-5
--critic-model=claude-haiku-4-5
--deep=2

verify the cost line lists per-model breakdown

```

The pipeline has three LLM stages with very different cost/quality
trade-offs:
- plan: structurally simple (decomposes question → sub-queries)
- synth: where quality actually matters (writes the cited answer)
- critic: structurally simple (--deep only — reviews drafts)

Pre-v0.10.0 used one model for all three, so a --deep run paid
opus prices on plan and critic calls that would be quality-
neutral on haiku. Typical ~60-70% cost cut at synth-quality
parity.

CLI: --plan-model, --synth-model, --critic-model flags
Env:  DEEPDIVE_PLAN_MODEL, DEEPDIVE_SYNTH_MODEL, DEEPDIVE_CRITIC_MODEL
Each falls back to --model / DEEPDIVE_MODEL / default.

Internal:
- src/llm.ts gains pure helper withModel(base, model)
- src/pricing.ts gains estimateCostMultiModel(usageByModel, env)
  + MultiModelCostEstimate (strict superset of CostEstimate)
- src/agent.ts replaces single llmTotals with llmTotalsByModel;
  builds per-stage LLMConfig at run start; threads the right one
  to planQueries / synthesize / critique
- src/cli.ts gains renderMultiModelCostSummary; single-model
  output is byte-identical to pre-v0.10.0
- llm.call event union gains a `model: string` field

Tests: +18 covering precedence ladder (per-stage flag > per-
stage env > base flag > base env > default), multi-model cost
sum, stable ordering, env-priced unknown models (knownModel
correctly stays false — semantic preserved). 396/396 green
(378 baseline + 18 new).
@askalf askalf merged commit ee83840 into master May 18, 2026
5 checks passed
@askalf askalf deleted the feat/v0.10.0-per-stage-models branch May 18, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant