Expose extended thinking as a per-request flag#18
Conversation
Adds `useThinking: bool = false` to ExplainRequest. When true the explainer enables adaptive extended thinking and bumps max_tokens to the documented floor; otherwise behaviour is identical to today. Why opt-in rather than default-on: thinking measurably improves correctness (errors 5 -> 0 across the medium-sized cases in our 21-case suite, eliminates the recurring imul-eax-edi-edi invention) but pushes the largest queries past the 30s API Gateway v2 / Lambda timeout — about 3/16 of our reasonably-sized cases would 504. Per- request opt-in lets the CE frontend choose where the quality win is worth the latency, with no risk of timing out users who don't ask. Implementation: - ExplainRequest gains a `useThinking` field (camelCase to match the rest of the API; default false). - New `Prompt.generate_messages_for_request()` is the single entry point that applies per-request overrides on top of the YAML-loaded defaults. Currently the only override is thinking, but it's the natural seam for future per-request knobs. - Both `_call_anthropic_api` and `generate_cache_key` route through the new method, so cache keys split on `useThinking` and the API call always sees the same payload that was hashed. - CLAUDE.md updated to document the opt-in pattern, the 30s timeout ceiling, and the longer-term architectural fixes if we ever want default-on (Lambda Function URL response streaming or async pattern). - README documents the new request field. Tests: 96 -> 98 passing. New tests cover (a) `useThinking=True` overrides prompt config and bumps max_tokens, (b) cache keys split on the flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Findings from a follow-up review: - Rename `Prompt.generate_messages_for_request` -> `build_api_payload`. The old name read as "generate messages, given a request" which is what `generate_messages` already does. The actual purpose is to apply per-request runtime overrides on top of the YAML defaults. - Always return a fresh dict from `build_api_payload` so callers can mutate without affecting prompt internals or each other. The previous shape returned the original dict in the no-override path and a shallow copy in the override path — a foot-gun in waiting. - Move `test_use_thinking_changes_cache_key` from `test_explain.py` to `test_cache.py` next to its sibling `test_generate_cache_key_bypass_cache_ignored`. Both tests answer "what affects the cache key" and belong together. - Add `test_generate_cache_key_use_thinking_default_matches_omitted` to lock in the invariant the PR depends on: a request with `useThinking` absent (Pydantic default) must produce the same key as one with `useThinking=False` explicit. Without this, a future refactor could silently invalidate the existing S3 cache. - Use `MIN_MAX_TOKENS_WITH_THINKING` constant in the test rather than hard-coding `4096`, so a future bump to the floor doesn't silently miss the test. Tests: 98 -> 99 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a per-request opt-in flag to enable extended “thinking” for higher accuracy while keeping the default behavior/latency unchanged for normal interactive requests.
Changes:
- Extend
ExplainRequestwithuseThinking: bool = falseand document the new field in README and CLAUDE.md. - Introduce
Prompt.build_api_payload()to apply per-request overrides (enable adaptive thinking + enforce amax_tokensfloor) and use it in both the Anthropic API call path and cache-key generation. - Add/extend tests to ensure
useThinkingchanges API kwargs appropriately and splits cache keys.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents useThinking in the request example. |
| CLAUDE.md | Documents the opt-in thinking pattern and the 30s timeout constraint. |
| app/explain_api.py | Adds useThinking to ExplainRequest with a detailed description. |
| app/prompt.py | Adds build_api_payload() to apply per-request thinking overrides and max_tokens floor. |
| app/explain.py | Switches API call prep to build_api_payload() and logs whether thinking is enabled. |
| app/cache.py | Generates cache keys from build_api_payload() so thinking on/off requests don’t collide. |
| app/test_explain.py | Adds test ensuring useThinking=True enables thinking, omits temperature, and bumps max_tokens. |
| app/test_cache.py | Adds tests validating cache key splits on useThinking, and default-vs-omitted equivalence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Build the API call payload, applying per-request runtime overrides. | ||
|
|
||
| This is the entry point used by both the explain handler and the | ||
| cache-key generator so they stay in sync. Currently the only | ||
| override is ``useThinking``: when set, enable adaptive extended | ||
| thinking and bump ``max_tokens`` to satisfy the floor. | ||
|
|
||
| Always returns a fresh dict so callers can mutate without affecting | ||
| prompt internals or each other. | ||
| """ | ||
| prompt_data = dict(self.generate_messages(request)) | ||
| if request.useThinking: | ||
| prompt_data["thinking"] = {"type": "adaptive"} | ||
| prompt_data["max_tokens"] = max(prompt_data["max_tokens"], MIN_MAX_TOKENS_WITH_THINKING) |
There was a problem hiding this comment.
Sharp catch — fixed in 98154cf. Refactored build_api_payload to return exactly the kwargs messages.create will receive (thinking-vs-temperature exclusivity resolved inside, not by callers). Both consumers now become trivial: _call_anthropic_api does await client.messages.create(**payload) and generate_cache_key does {**payload, "prompt_version": ...}. Net -7 lines and the drift trap is gone for any future per-request overrides.
Per Copilot review: previously `build_api_payload()` returned a dict
that included both `thinking` and `temperature` when useThinking was
true, and `_call_anthropic_api()` had to re-derive the actual API
kwargs (dropping `temperature` because the API rejects it alongside
thinking). The cache key meanwhile hashed the un-derived dict — so
the cache key included `temperature` for requests that wouldn't
actually send it. Today this happened to be deterministic and
internally consistent, but it's drift-in-waiting: any future
per-request override with exclusivity rules would face the same trap.
Refactor so `build_api_payload()` returns exactly the kwargs
`messages.create` will receive, including the thinking-vs-temperature
exclusivity. Then both call sites become trivial:
- `_call_anthropic_api`: `await client.messages.create(**payload)`
- `generate_cache_key`: `{**payload, "prompt_version": ...}`
Net diff: -7 lines, fewer code paths to keep aligned.
Cache key shape changes (no `temperature` field when thinking is on,
no null `thinking` field when off), which invalidates existing S3
cache entries — acceptable per maintainer's stated tolerance for
cache flushes when the alternative is uglier code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
useThinking: bool = falsetoExplainRequest. Whentrue, the explainer enables adaptive extended thinking and bumpsmax_tokensto the floor (MIN_MAX_TOKENS_WITH_THINKING = 4096); whenfalse(or absent) the behaviour is identical to today.Why opt-in rather than default-on
Earlier eval data: thinking eliminates the recurring factual errors (errors 5 → 0 across the medium-sized cases in the 21-case suite, including the persistent
imul eax, edi, ediinvention). But the 30s API Gateway v2 / Lambda timeout (infra/terraform/lambda_explain.tf:25) is a hard ceiling — HTTP API can't go beyond 30s. With thinking on:complex_branch_001(1.3K)edge_long_asm_001(3K)complex_vectorization_001(5.6K)loop_experienced_assembly(8K)About 3/16 of "reasonably-sized" cases would hard-fail. Per-request opt-in lets the CE frontend pick where the quality win is worth the latency, without timing out users who don't ask.
Implementation
ExplainRequest.useThinking: bool = False(camelCase, matches existing API).Prompt.generate_messages_for_request()is the single seam that applies per-request overrides on top of YAML-loaded defaults. Currently only handlesuseThinking, but is the natural place for future per-request knobs._call_anthropic_apiandgenerate_cache_keygo through this method, so cache keys split onuseThinking— on/off requests don't collide.Test plan
uv run pytest— 98 tests passing (96 → 98)uv run pre-commit run --all-filestest_use_thinking_overrides_prompt:useThinking=Trueenables adaptive thinking, dropstemperature, bumpsmax_tokens≥ 4096.test_use_thinking_changes_cache_key: on/off requests produce different cache keys.Follow-ups (not in this PR)
🤖 Generated with Claude Code