Skip to content

Expose extended thinking as a per-request flag#18

Merged
mattgodbolt merged 3 commits intomainfrom
expose-thinking-flag
May 6, 2026
Merged

Expose extended thinking as a per-request flag#18
mattgodbolt merged 3 commits intomainfrom
expose-thinking-flag

Conversation

@mattgodbolt-molty
Copy link
Copy Markdown
Contributor

Summary

Adds useThinking: bool = false to ExplainRequest. When true, the explainer enables adaptive extended thinking and bumps max_tokens to the floor (MIN_MAX_TOKENS_WITH_THINKING = 4096); when false (or absent) the behaviour is identical to today.

Why opt-in rather than default-on

Earlier eval data: thinking eliminates the recurring factual errors (errors 5 → 0 across the medium-sized cases in the 21-case suite, including the persistent imul eax, edi, edi invention). But the 30s API Gateway v2 / Lambda timeout (infra/terraform/lambda_explain.tf:25) is a hard ceiling — HTTP API can't go beyond 30s. With thinking on:

Case Latency Verdict
Small/medium (≤1.5K input) +5-12s fits
complex_branch_001 (1.3K) 25s barely fits
edge_long_asm_001 (3K) 42s 504
complex_vectorization_001 (5.6K) 66s 504
loop_experienced_assembly (8K) 124s 504

About 3/16 of "reasonably-sized" cases would hard-fail. Per-request opt-in lets the CE frontend pick where the quality win is worth the latency, without timing out users who don't ask.

Implementation

  • ExplainRequest.useThinking: bool = False (camelCase, matches existing API).
  • New Prompt.generate_messages_for_request() is the single seam that applies per-request overrides on top of YAML-loaded defaults. Currently only handles useThinking, but is the natural place for future per-request knobs.
  • Both _call_anthropic_api and generate_cache_key go through this method, so cache keys split on useThinking — on/off requests don't collide.
  • CLAUDE.md updated with the opt-in pattern, the 30s ceiling, and the architectural paths needed for default-on (Lambda Function URL response streaming, or async submit-and-poll).
  • README documents the new request field.

Test plan

  • uv run pytest — 98 tests passing (96 → 98)
  • uv run pre-commit run --all-files
  • New tests:
    • test_use_thinking_overrides_prompt: useThinking=True enables adaptive thinking, drops temperature, bumps max_tokens ≥ 4096.
    • test_use_thinking_changes_cache_key: on/off requests produce different cache keys.

Follow-ups (not in this PR)

  • Future: bump Lambda + API Gateway pattern to support thinking-by-default. Cleanest path is Lambda Function URL with response streaming (no API GW 30s cap).

🤖 Generated with Claude Code

mattgodbolt-molty and others added 2 commits May 6, 2026 18:07
Adds `useThinking: bool = false` to ExplainRequest. When true the
explainer enables adaptive extended thinking and bumps max_tokens to
the documented floor; otherwise behaviour is identical to today.

Why opt-in rather than default-on: thinking measurably improves
correctness (errors 5 -> 0 across the medium-sized cases in our
21-case suite, eliminates the recurring imul-eax-edi-edi invention)
but pushes the largest queries past the 30s API Gateway v2 / Lambda
timeout — about 3/16 of our reasonably-sized cases would 504. Per-
request opt-in lets the CE frontend choose where the quality win is
worth the latency, with no risk of timing out users who don't ask.

Implementation:

- ExplainRequest gains a `useThinking` field (camelCase to match the
  rest of the API; default false).
- New `Prompt.generate_messages_for_request()` is the single entry
  point that applies per-request overrides on top of the YAML-loaded
  defaults. Currently the only override is thinking, but it's the
  natural seam for future per-request knobs.
- Both `_call_anthropic_api` and `generate_cache_key` route through
  the new method, so cache keys split on `useThinking` and the API
  call always sees the same payload that was hashed.
- CLAUDE.md updated to document the opt-in pattern, the 30s timeout
  ceiling, and the longer-term architectural fixes if we ever want
  default-on (Lambda Function URL response streaming or async pattern).
- README documents the new request field.

Tests: 96 -> 98 passing. New tests cover (a) `useThinking=True`
overrides prompt config and bumps max_tokens, (b) cache keys split on
the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Findings from a follow-up review:

- Rename `Prompt.generate_messages_for_request` -> `build_api_payload`.
  The old name read as "generate messages, given a request" which is
  what `generate_messages` already does. The actual purpose is to
  apply per-request runtime overrides on top of the YAML defaults.
- Always return a fresh dict from `build_api_payload` so callers can
  mutate without affecting prompt internals or each other. The
  previous shape returned the original dict in the no-override path
  and a shallow copy in the override path — a foot-gun in waiting.
- Move `test_use_thinking_changes_cache_key` from `test_explain.py`
  to `test_cache.py` next to its sibling
  `test_generate_cache_key_bypass_cache_ignored`. Both tests answer
  "what affects the cache key" and belong together.
- Add `test_generate_cache_key_use_thinking_default_matches_omitted`
  to lock in the invariant the PR depends on: a request with
  `useThinking` absent (Pydantic default) must produce the same key
  as one with `useThinking=False` explicit. Without this, a future
  refactor could silently invalidate the existing S3 cache.
- Use `MIN_MAX_TOKENS_WITH_THINKING` constant in the test rather
  than hard-coding `4096`, so a future bump to the floor doesn't
  silently miss the test.

Tests: 98 -> 99 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a per-request opt-in flag to enable extended “thinking” for higher accuracy while keeping the default behavior/latency unchanged for normal interactive requests.

Changes:

  • Extend ExplainRequest with useThinking: bool = false and document the new field in README and CLAUDE.md.
  • Introduce Prompt.build_api_payload() to apply per-request overrides (enable adaptive thinking + enforce a max_tokens floor) and use it in both the Anthropic API call path and cache-key generation.
  • Add/extend tests to ensure useThinking changes API kwargs appropriately and splits cache keys.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
README.md Documents useThinking in the request example.
CLAUDE.md Documents the opt-in thinking pattern and the 30s timeout constraint.
app/explain_api.py Adds useThinking to ExplainRequest with a detailed description.
app/prompt.py Adds build_api_payload() to apply per-request thinking overrides and max_tokens floor.
app/explain.py Switches API call prep to build_api_payload() and logs whether thinking is enabled.
app/cache.py Generates cache keys from build_api_payload() so thinking on/off requests don’t collide.
app/test_explain.py Adds test ensuring useThinking=True enables thinking, omits temperature, and bumps max_tokens.
app/test_cache.py Adds tests validating cache key splits on useThinking, and default-vs-omitted equivalence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/prompt.py Outdated
Comment on lines +239 to +252
"""Build the API call payload, applying per-request runtime overrides.

This is the entry point used by both the explain handler and the
cache-key generator so they stay in sync. Currently the only
override is ``useThinking``: when set, enable adaptive extended
thinking and bump ``max_tokens`` to satisfy the floor.

Always returns a fresh dict so callers can mutate without affecting
prompt internals or each other.
"""
prompt_data = dict(self.generate_messages(request))
if request.useThinking:
prompt_data["thinking"] = {"type": "adaptive"}
prompt_data["max_tokens"] = max(prompt_data["max_tokens"], MIN_MAX_TOKENS_WITH_THINKING)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharp catch — fixed in 98154cf. Refactored build_api_payload to return exactly the kwargs messages.create will receive (thinking-vs-temperature exclusivity resolved inside, not by callers). Both consumers now become trivial: _call_anthropic_api does await client.messages.create(**payload) and generate_cache_key does {**payload, "prompt_version": ...}. Net -7 lines and the drift trap is gone for any future per-request overrides.

Per Copilot review: previously `build_api_payload()` returned a dict
that included both `thinking` and `temperature` when useThinking was
true, and `_call_anthropic_api()` had to re-derive the actual API
kwargs (dropping `temperature` because the API rejects it alongside
thinking). The cache key meanwhile hashed the un-derived dict — so
the cache key included `temperature` for requests that wouldn't
actually send it. Today this happened to be deterministic and
internally consistent, but it's drift-in-waiting: any future
per-request override with exclusivity rules would face the same trap.

Refactor so `build_api_payload()` returns exactly the kwargs
`messages.create` will receive, including the thinking-vs-temperature
exclusivity. Then both call sites become trivial:
- `_call_anthropic_api`: `await client.messages.create(**payload)`
- `generate_cache_key`: `{**payload, "prompt_version": ...}`

Net diff: -7 lines, fewer code paths to keep aligned.

Cache key shape changes (no `temperature` field when thinking is on,
no null `thinking` field when off), which invalidates existing S3
cache entries — acceptable per maintainer's stated tolerance for
cache flushes when the alternative is uglier code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mattgodbolt mattgodbolt merged commit 2c53854 into main May 6, 2026
2 checks passed
@mattgodbolt mattgodbolt deleted the expose-thinking-flag branch May 6, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants