runtime: switch to Qwen 3 1.7B with thinking mode by ehsan6sha · Pull Request #3 · functionland/blox-ai

ehsan6sha · 2026-05-27T02:04:32Z

Summary

Switch the on-device runtime from Qwen 2.5 1.5B to Qwen 3 1.7B with thinking mode. User requested most-intelligent mode + hide <think> content so chain-of-thought doesn't bloat the KV cache across the tool-call loop's multi-turn flow.

DEFAULT_MODEL_FILENAME → qwen3-1.7b-rk3588-w8a8.rkllm.
New _is_qwen3_model(path) filename gate so devices on old cached Qwen 2.5 keep thinking OFF (rollout safety).
_build_chat_prompt(enable_thinking=True) injects <think>\n into the assistant prefix.
New _strip_think() drops chain-of-thought content from the post-</think> tail.
RKLLMBackend.run_troubleshoot in Qwen 3 mode:
- Strips <think> from history before the next turn (KV cache stays bounded — Qwen 3 model-card guidance).
- Strips <think> from the SSE thought event (user preference: hide CoT from UI).
- Emits a synthetic "Analyzing diagnostics..." marker when the post-think prose is empty so BLE transports don't show silent stretches.
- Pre-strips think before parsing tool_call / verdict / recommendation so stray XML mentions inside reasoning prose can't pollute the parse.
max_new_tokens bumped 2048 → 3072 (thinking blocks empirically run 500–1500 tokens; structured response adds another 200–500).

Test plan

244/244 existing tests pass (pytest tests/).
10 new tests cover the Qwen 3 swap:
- _is_qwen3_model filename detection (canonical, hyphen variant, case-insensitive, rejects Qwen 2.5 + Deepseek).
- _strip_think all four shapes (full block, truncated mid-think, self-wrapped pair, trailing unclosed open).
- _build_chat_prompt(enable_thinking=True) injects prefix; default leaves legacy Qwen 2.5 path alone.
- try_load() wires _enable_thinking from the resolved model path.
- run_troubleshoot strips <think> from history AND from SSE thought events.
- Synthetic marker fills empty post-think turns.
Lab verification (deferred until .rkllm exists):
- Convert Qwen 3 1.7B to W8A8 RKLLM format on build host.
- Place at /uniondrive/blox-ai/model/qwen3-1.7b-rk3588-w8a8.rkllm on lab device.
- Restart blox-ai.service; expect thinking=True in init log.
- Run a /troubleshoot session; confirm SSE stream has no <think> content but does have post-think reasoning + structured events.
- Verify next turn's prompt (via debug log) doesn't contain prior-turn CoT.

Sibling work

Sibling fula-ota commit (held locally until the publisher uploads the .rkllm + provides the SHA) bumps download_model.sh URL/SHA, info.json version + model name, .env BLOX_AI_MODEL_PATH, start.sh SIZE_LIMIT, and adds Qwen 2.5 1.5B cleanup logic (mirroring the existing 3B cleanup pattern).

🤖 Generated with Claude Code

User requested most-intelligent mode (thinking ON) + hide <think> content to avoid bloating KV cache across the tool-call loop's multi-turn flow. Sibling fula-ota PR ships the matching model file (qwen3-1.7b-rk3588-w8a8.rkllm via GitHub release) + download_model.sh URL/SHA pinning. Changes: - DEFAULT_MODEL_FILENAME flips to qwen3-1.7b-rk3588-w8a8.rkllm. - New _is_qwen3_model(path) filename detector (matches qwen3 / qwen-3 case-insensitive). Used by try_load() to wire the new _enable_thinking flag on the backend. Rollout safety: devices that still have an old Qwen 2.5 cached (the new file not yet downloaded) keep thinking OFF so the model does not get a <think> prefix it cannot parse. - _build_chat_prompt gains enable_thinking parameter. When ON, the assistant prefix gets `<think>\n` injected so the model starts inside the think block. Matches apply_chat_template(enable_thinking=True) from the HF tokenizer config. - New _strip_think(text) drops the chain-of-thought portion: * normal case: splits on first </think>, keeps the tail * self-wrapped pair after main close: defensive sub * trailing unclosed <think>: cut to end * truncated mid-think (no </think> anywhere): returns empty so caller treats as a prose-only turn and force-verdicts - RKLLMBackend.run_troubleshoot in Qwen 3 mode: * pre-strips think for the history rewrite (KV cache stays bounded across the tool-call loop per Qwen 3 model-card guidance: "historical output should not include the thinking") * pre-strips think before parsing tool_call / verdict / recommendation so stray XML mentions inside reasoning prose cannot pollute the parse * strips think from the SSE thought event payload too (user preference: hide CoT from UI). When post-think prose is empty, emits a synthetic "Analyzing diagnostics..." marker so BLE transports do not show silent stretches. - max_new_tokens bumped from 2048 to 3072 in init_model() because thinking blocks empirically run 500-1500 tokens; structured response adds 200-500 more. The prior 2048 was tight enough to truncate mid-verdict on hard prompts, manifesting as missing </think> in the output and an empty post-strip result. Tests added (10 new in tests/test_rkllm_runtime.py): - _is_qwen3_model: canonical filename, hyphen variant, case-insensitive, rejects Qwen 2.5 and Deepseek (rollout-safety regression guard). - _strip_think: full block, truncated-mid-think, self-wrapped pair, trailing unclosed open. - _build_chat_prompt with enable_thinking=True injects the prefix; default leaves legacy Qwen 2.5 path alone. - try_load() sets _enable_thinking based on resolved model path. - run_troubleshoot strips <think> from history before the next turn (KV bloat regression guard) AND from SSE thought events (UI preference). When post-think prose is empty, the synthetic marker fills the gap. 244/244 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ehsan6sha merged commit 8218617 into main May 27, 2026
2 checks passed

ehsan6sha deleted the qwen3-1.7b-thinking-mode branch May 27, 2026 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: switch to Qwen 3 1.7B with thinking mode#3

runtime: switch to Qwen 3 1.7B with thinking mode#3
ehsan6sha merged 1 commit into
mainfrom
qwen3-1.7b-thinking-mode

ehsan6sha commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ehsan6sha commented May 27, 2026

Summary

Test plan

Sibling work

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant