Expose extended thinking as a per-request flag by mattgodbolt-molty · Pull Request #18 · compiler-explorer/explain

mattgodbolt-molty · 2026-05-06T23:07:56Z

Summary

Adds useThinking: bool = false to ExplainRequest. When true, the explainer enables adaptive extended thinking and bumps max_tokens to the floor (MIN_MAX_TOKENS_WITH_THINKING = 4096); when false (or absent) the behaviour is identical to today.

Why opt-in rather than default-on

Earlier eval data: thinking eliminates the recurring factual errors (errors 5 → 0 across the medium-sized cases in the 21-case suite, including the persistent imul eax, edi, edi invention). But the 30s API Gateway v2 / Lambda timeout (infra/terraform/lambda_explain.tf:25) is a hard ceiling — HTTP API can't go beyond 30s. With thinking on:

Case	Latency	Verdict
Small/medium (≤1.5K input)	+5-12s	fits
`complex_branch_001` (1.3K)	25s	barely fits
`edge_long_asm_001` (3K)	42s	504
`complex_vectorization_001` (5.6K)	66s	504
`loop_experienced_assembly` (8K)	124s	504

About 3/16 of "reasonably-sized" cases would hard-fail. Per-request opt-in lets the CE frontend pick where the quality win is worth the latency, without timing out users who don't ask.

Implementation

ExplainRequest.useThinking: bool = False (camelCase, matches existing API).
New Prompt.generate_messages_for_request() is the single seam that applies per-request overrides on top of YAML-loaded defaults. Currently only handles useThinking, but is the natural place for future per-request knobs.
Both _call_anthropic_api and generate_cache_key go through this method, so cache keys split on useThinking — on/off requests don't collide.
CLAUDE.md updated with the opt-in pattern, the 30s ceiling, and the architectural paths needed for default-on (Lambda Function URL response streaming, or async submit-and-poll).
README documents the new request field.

Test plan

uv run pytest — 98 tests passing (96 → 98)
uv run pre-commit run --all-files
New tests:
- test_use_thinking_overrides_prompt: useThinking=True enables adaptive thinking, drops temperature, bumps max_tokens ≥ 4096.
- test_use_thinking_changes_cache_key: on/off requests produce different cache keys.

Follow-ups (not in this PR)

Future: bump Lambda + API Gateway pattern to support thinking-by-default. Cleanest path is Lambda Function URL with response streaming (no API GW 30s cap).

🤖 Generated with Claude Code

Adds `useThinking: bool = false` to ExplainRequest. When true the explainer enables adaptive extended thinking and bumps max_tokens to the documented floor; otherwise behaviour is identical to today. Why opt-in rather than default-on: thinking measurably improves correctness (errors 5 -> 0 across the medium-sized cases in our 21-case suite, eliminates the recurring imul-eax-edi-edi invention) but pushes the largest queries past the 30s API Gateway v2 / Lambda timeout — about 3/16 of our reasonably-sized cases would 504. Per- request opt-in lets the CE frontend choose where the quality win is worth the latency, with no risk of timing out users who don't ask. Implementation: - ExplainRequest gains a `useThinking` field (camelCase to match the rest of the API; default false). - New `Prompt.generate_messages_for_request()` is the single entry point that applies per-request overrides on top of the YAML-loaded defaults. Currently the only override is thinking, but it's the natural seam for future per-request knobs. - Both `_call_anthropic_api` and `generate_cache_key` route through the new method, so cache keys split on `useThinking` and the API call always sees the same payload that was hashed. - CLAUDE.md updated to document the opt-in pattern, the 30s timeout ceiling, and the longer-term architectural fixes if we ever want default-on (Lambda Function URL response streaming or async pattern). - README documents the new request field. Tests: 96 -> 98 passing. New tests cover (a) `useThinking=True` overrides prompt config and bumps max_tokens, (b) cache keys split on the flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Findings from a follow-up review: - Rename `Prompt.generate_messages_for_request` -> `build_api_payload`. The old name read as "generate messages, given a request" which is what `generate_messages` already does. The actual purpose is to apply per-request runtime overrides on top of the YAML defaults. - Always return a fresh dict from `build_api_payload` so callers can mutate without affecting prompt internals or each other. The previous shape returned the original dict in the no-override path and a shallow copy in the override path — a foot-gun in waiting. - Move `test_use_thinking_changes_cache_key` from `test_explain.py` to `test_cache.py` next to its sibling `test_generate_cache_key_bypass_cache_ignored`. Both tests answer "what affects the cache key" and belong together. - Add `test_generate_cache_key_use_thinking_default_matches_omitted` to lock in the invariant the PR depends on: a request with `useThinking` absent (Pydantic default) must produce the same key as one with `useThinking=False` explicit. Without this, a future refactor could silently invalidate the existing S3 cache. - Use `MIN_MAX_TOKENS_WITH_THINKING` constant in the test rather than hard-coding `4096`, so a future bump to the floor doesn't silently miss the test. Tests: 98 -> 99 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a per-request opt-in flag to enable extended “thinking” for higher accuracy while keeping the default behavior/latency unchanged for normal interactive requests.

Changes:

Extend ExplainRequest with useThinking: bool = false and document the new field in README and CLAUDE.md.
Introduce Prompt.build_api_payload() to apply per-request overrides (enable adaptive thinking + enforce a max_tokens floor) and use it in both the Anthropic API call path and cache-key generation.
Add/extend tests to ensure useThinking changes API kwargs appropriately and splits cache keys.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
README.md	Documents `useThinking` in the request example.
CLAUDE.md	Documents the opt-in thinking pattern and the 30s timeout constraint.
app/explain_api.py	Adds `useThinking` to `ExplainRequest` with a detailed description.
app/prompt.py	Adds `build_api_payload()` to apply per-request thinking overrides and `max_tokens` floor.
app/explain.py	Switches API call prep to `build_api_payload()` and logs whether thinking is enabled.
app/cache.py	Generates cache keys from `build_api_payload()` so thinking on/off requests don’t collide.
app/test_explain.py	Adds test ensuring `useThinking=True` enables thinking, omits temperature, and bumps `max_tokens`.
app/test_cache.py	Adds tests validating cache key splits on `useThinking`, and default-vs-omitted equivalence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mattgodbolt-molty · 2026-05-06T23:43:08Z

+        """Build the API call payload, applying per-request runtime overrides.
+
+        This is the entry point used by both the explain handler and the
+        cache-key generator so they stay in sync. Currently the only
+        override is ``useThinking``: when set, enable adaptive extended
+        thinking and bump ``max_tokens`` to satisfy the floor.
+
+        Always returns a fresh dict so callers can mutate without affecting
+        prompt internals or each other.
+        """
+        prompt_data = dict(self.generate_messages(request))
+        if request.useThinking:
+            prompt_data["thinking"] = {"type": "adaptive"}
+            prompt_data["max_tokens"] = max(prompt_data["max_tokens"], MIN_MAX_TOKENS_WITH_THINKING)


Sharp catch — fixed in 98154cf. Refactored build_api_payload to return exactly the kwargs messages.create will receive (thinking-vs-temperature exclusivity resolved inside, not by callers). Both consumers now become trivial: _call_anthropic_api does await client.messages.create(**payload) and generate_cache_key does {**payload, "prompt_version": ...}. Net -7 lines and the drift trap is gone for any future per-request overrides.

Per Copilot review: previously `build_api_payload()` returned a dict that included both `thinking` and `temperature` when useThinking was true, and `_call_anthropic_api()` had to re-derive the actual API kwargs (dropping `temperature` because the API rejects it alongside thinking). The cache key meanwhile hashed the un-derived dict — so the cache key included `temperature` for requests that wouldn't actually send it. Today this happened to be deterministic and internally consistent, but it's drift-in-waiting: any future per-request override with exclusivity rules would face the same trap. Refactor so `build_api_payload()` returns exactly the kwargs `messages.create` will receive, including the thinking-vs-temperature exclusivity. Then both call sites become trivial: - `_call_anthropic_api`: `await client.messages.create(**payload)` - `generate_cache_key`: `{**payload, "prompt_version": ...}` Net diff: -7 lines, fewer code paths to keep aligned. Cache key shape changes (no `temperature` field when thinking is on, no null `thinking` field when off), which invalidates existing S3 cache entries — acceptable per maintainer's stated tolerance for cache flushes when the alternative is uglier code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mattgodbolt-molty and others added 2 commits May 6, 2026 18:07

mattgodbolt requested a review from Copilot May 6, 2026 23:19

Copilot started reviewing on behalf of mattgodbolt May 6, 2026 23:19 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

mattgodbolt merged commit 2c53854 into main May 6, 2026
2 checks passed

mattgodbolt deleted the expose-thinking-flag branch May 6, 2026 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose extended thinking as a per-request flag#18

Expose extended thinking as a per-request flag#18
mattgodbolt merged 3 commits intomainfrom
expose-thinking-flag

mattgodbolt-molty commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mattgodbolt-molty May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mattgodbolt-molty commented May 6, 2026

Summary

Why opt-in rather than default-on

Implementation

Test plan

Follow-ups (not in this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mattgodbolt-molty May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants