Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new reasoning_replay knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is none. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting.
Added
reasoning_replay {full, keep-last, none}onWorkflowRunner(reasoning_replay=…)and the proxy (--reasoning-replay).fullreplays every captured reasoning block (the historical behavior),keep-lastonly the most recent,nonekeeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces inon_messageand internal history. In OpenAI-compatible proxy responses,keep-lastexposes current reasoning asreasoning_contentrather than assistantcontent, so clients that preserve reasoning fields can replay just the latest block. See ADR-017.- Reasoning-replay eval grid (
eval_results_v0.7.5.jsonl, a new eval generation): the full 8–14B lineup re-swept across all three policies × both ablations × native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry:keep-last/:fulltags (untagged =none), the dashboard gains a Reasoning Replay filter, the report a--reasoning-replayfilter, and a dedicated reasoning-replay view compares policies per config. A wire-level counter (reasoning_wire) validates each policy's on-wire behavior (none→ exactly 0 replayed reasoning across every run). - Anthropic extended thinking —
AnthropicClient(thinking=…)— request-side extended-thinking config (e.g.{"type": "adaptive"}). When set, a forcedtool_choiceis suppressed (the API requiresautowith thinking on) andmax_tokensis raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking. - Anthropic prompt caching —
AnthropicClient(prompt_caching=True)— marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema).TokenUsagegains genericcache_creation_input_tokens/cache_read_input_tokenscounters, and eval cost accounting prices cache writes (1.25×) and reads (0.1×) at their actual rates.
Changed
- Captured reasoning is no longer replayed to the backend by default. Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to
reasoning_replay="full"); the default is now"none". On the published eval suite,noneis statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, wherenoneandkeep-lastare indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration:--reasoning-replay full(proxy) orWorkflowRunner(reasoning_replay="full")restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only underfull— forge does not synthesize signed Anthropic thinking blocks.