runtime: switch to Qwen 3 1.7B with thinking mode#3
Merged
Conversation
User requested most-intelligent mode (thinking ON) + hide <think>
content to avoid bloating KV cache across the tool-call loop's
multi-turn flow. Sibling fula-ota PR ships the matching model file
(qwen3-1.7b-rk3588-w8a8.rkllm via GitHub release) + download_model.sh
URL/SHA pinning.
Changes:
- DEFAULT_MODEL_FILENAME flips to qwen3-1.7b-rk3588-w8a8.rkllm.
- New _is_qwen3_model(path) filename detector (matches qwen3 / qwen-3
case-insensitive). Used by try_load() to wire the new
_enable_thinking flag on the backend. Rollout safety: devices that
still have an old Qwen 2.5 cached (the new file not yet downloaded)
keep thinking OFF so the model does not get a <think> prefix it
cannot parse.
- _build_chat_prompt gains enable_thinking parameter. When ON, the
assistant prefix gets `<think>\n` injected so the model starts
inside the think block. Matches apply_chat_template(enable_thinking=True)
from the HF tokenizer config.
- New _strip_think(text) drops the chain-of-thought portion:
* normal case: splits on first </think>, keeps the tail
* self-wrapped pair after main close: defensive sub
* trailing unclosed <think>: cut to end
* truncated mid-think (no </think> anywhere): returns empty so
caller treats as a prose-only turn and force-verdicts
- RKLLMBackend.run_troubleshoot in Qwen 3 mode:
* pre-strips think for the history rewrite (KV cache stays
bounded across the tool-call loop per Qwen 3 model-card
guidance: "historical output should not include the thinking")
* pre-strips think before parsing tool_call / verdict / recommendation
so stray XML mentions inside reasoning prose cannot pollute
the parse
* strips think from the SSE thought event payload too (user
preference: hide CoT from UI). When post-think prose is empty,
emits a synthetic "Analyzing diagnostics..." marker so BLE
transports do not show silent stretches.
- max_new_tokens bumped from 2048 to 3072 in init_model() because
thinking blocks empirically run 500-1500 tokens; structured response
adds 200-500 more. The prior 2048 was tight enough to truncate
mid-verdict on hard prompts, manifesting as missing </think> in the
output and an empty post-strip result.
Tests added (10 new in tests/test_rkllm_runtime.py):
- _is_qwen3_model: canonical filename, hyphen variant, case-insensitive,
rejects Qwen 2.5 and Deepseek (rollout-safety regression guard).
- _strip_think: full block, truncated-mid-think, self-wrapped pair,
trailing unclosed open.
- _build_chat_prompt with enable_thinking=True injects the prefix;
default leaves legacy Qwen 2.5 path alone.
- try_load() sets _enable_thinking based on resolved model path.
- run_troubleshoot strips <think> from history before the next turn
(KV bloat regression guard) AND from SSE thought events (UI
preference). When post-think prose is empty, the synthetic marker
fills the gap.
244/244 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Switch the on-device runtime from Qwen 2.5 1.5B to Qwen 3 1.7B with thinking mode. User requested most-intelligent mode + hide
<think>content so chain-of-thought doesn't bloat the KV cache across the tool-call loop's multi-turn flow.DEFAULT_MODEL_FILENAME→qwen3-1.7b-rk3588-w8a8.rkllm._is_qwen3_model(path)filename gate so devices on old cached Qwen 2.5 keep thinking OFF (rollout safety)._build_chat_prompt(enable_thinking=True)injects<think>\ninto the assistant prefix._strip_think()drops chain-of-thought content from the post-</think>tail.RKLLMBackend.run_troubleshootin Qwen 3 mode:<think>from history before the next turn (KV cache stays bounded — Qwen 3 model-card guidance).<think>from the SSE thought event (user preference: hide CoT from UI)."Analyzing diagnostics..."marker when the post-think prose is empty so BLE transports don't show silent stretches.max_new_tokensbumped 2048 → 3072 (thinking blocks empirically run 500–1500 tokens; structured response adds another 200–500).Test plan
pytest tests/)._is_qwen3_modelfilename detection (canonical, hyphen variant, case-insensitive, rejects Qwen 2.5 + Deepseek)._strip_thinkall four shapes (full block, truncated mid-think, self-wrapped pair, trailing unclosed open)._build_chat_prompt(enable_thinking=True)injects prefix; default leaves legacy Qwen 2.5 path alone.try_load()wires_enable_thinkingfrom the resolved model path.run_troubleshootstrips<think>from history AND from SSE thought events./uniondrive/blox-ai/model/qwen3-1.7b-rk3588-w8a8.rkllmon lab device.blox-ai.service; expectthinking=Truein init log./troubleshootsession; confirm SSE stream has no<think>content but does have post-think reasoning + structured events.Sibling work
Sibling fula-ota commit (held locally until the publisher uploads the .rkllm + provides the SHA) bumps
download_model.shURL/SHA,info.jsonversion + model name,.envBLOX_AI_MODEL_PATH,start.shSIZE_LIMIT, and adds Qwen 2.5 1.5B cleanup logic (mirroring the existing 3B cleanup pattern).🤖 Generated with Claude Code