You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thinking-Budget Enforcement (closes T5–T8 at temp=0)
New ThinkingBudgetProcessor — production-grade thinking-budget enforcement for reasoning models at temperature=0. When models like Qwen3.6 greedy-decode complex prompts, they can lock into chain-of-thought phase and exhaust max_tokens without producing response content. The processor counts generated tokens and forces close-marker emission when budget is exceeded — same primitive as vLLM/SGLang/llama.cpp max_thinking_tokens.
Smart default: min(1024, max(256, maxTokens/2)) at temp=0 for thinking models. Opt-out via thinking_budget=0, custom via thinking_budget=N.
isImplicitThinkingModel rewrite — fixed mis-classification of Qwen3.6 as explicit-thinking when it's actually implicit (chat template injects <think\n). 4-step decision tree now correctly handles Qwen3.6, DeepSeek-R1, and non-thinking models.