Skip to content

v0.7.5: reasoning replay is a bounded policy, default none

Latest

Choose a tag to compare

@antoinezambelli antoinezambelli released this 12 Jun 01:47
· 1 commit to main since this release
5d01dfd

Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new reasoning_replay knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is none. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting.

Added

  • reasoning_replay {full, keep-last, none} on WorkflowRunner(reasoning_replay=…) and the proxy (--reasoning-replay). full replays every captured reasoning block (the historical behavior), keep-last only the most recent, none keeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces in on_message and internal history. In OpenAI-compatible proxy responses, keep-last exposes current reasoning as reasoning_content rather than assistant content, so clients that preserve reasoning fields can replay just the latest block. See ADR-017.
  • Reasoning-replay eval grid (eval_results_v0.7.5.jsonl, a new eval generation): the full 8–14B lineup re-swept across all three policies × both ablations × native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry :keep-last / :full tags (untagged = none), the dashboard gains a Reasoning Replay filter, the report a --reasoning-replay filter, and a dedicated reasoning-replay view compares policies per config. A wire-level counter (reasoning_wire) validates each policy's on-wire behavior (none → exactly 0 replayed reasoning across every run).
  • Anthropic extended thinking — AnthropicClient(thinking=…) — request-side extended-thinking config (e.g. {"type": "adaptive"}). When set, a forced tool_choice is suppressed (the API requires auto with thinking on) and max_tokens is raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking.
  • Anthropic prompt caching — AnthropicClient(prompt_caching=True) — marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema). TokenUsage gains generic cache_creation_input_tokens / cache_read_input_tokens counters, and eval cost accounting prices cache writes (1.25×) and reads (0.1×) at their actual rates.

Changed

  • Captured reasoning is no longer replayed to the backend by default. Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to reasoning_replay="full"); the default is now "none". On the published eval suite, none is statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where none and keep-last are indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration: --reasoning-replay full (proxy) or WorkflowRunner(reasoning_replay="full") restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only under full — forge does not synthesize signed Anthropic thinking blocks.