Skip to content

Eval bug: Thinking mode broken with Qwen3-30B-A3B-Thinking in new webuiΒ #17430

@woheller69

Description

@woheller69

Name and Version

version: 7124 (f1ffbba)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

CPU only / 36 GB Ram, CPU i5-8250U

Models

Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K

Problem description & steps to reproduce

llama-server -m GPT4ALL/Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf -c 32768 --reasoning-format none --host 192.168.2.218 --port 16060

When I interrupt the model while thinking, and enter a new user prompt, the tokens generated in previous thinking are discarded, the model does not remember what it thought before. No matter what I set as --reasoning-format. Neither none nor deepseek nor legacy-deepseek works. It can also be seen from the context size in the generation statistics. When the new thought process starts the context size drops below the level after previous thinking.

It worked perfectly this summer with:
build: 5985 (3f4fc97) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
and this command line:
llama-server -m GPT4ALL/Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf -c 32768 --host 192.168.2.218 --port 6061
and in webui "Exclude thought process when sending requests to API" setting OFF

First Bad Commit

No response

Relevant log output

main: model loaded
main: server is listening on http://192.168.2.218:16060
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 192.168.2.218 200
srv  log_server_r: request: GET /props 192.168.2.218 200
srv  log_server_r: request: GET /props 192.168.2.218 200
srv  log_server_r: request: GET /props 192.168.2.218 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 16
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 16, batch.n_tokens = 16, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 16, batch.n_tokens = 16
srv  log_server_r: request: GET /props 192.168.2.218 200
srv  log_server_r: request: GET /slots 192.168.2.218 200
srv  log_server_r: request: GET /slots 192.168.2.218 200
srv  log_server_r: request: POST /v1/chat/completions 192.168.2.218 200
srv          stop: cancel task, id_task = 0
slot      release: id  3 | task 0 | stop processing: n_tokens = 103, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.400 (> 0.100 thold), f_keep = 0.155
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 103, total state size = 9.659 MiB
srv          load:  - looking for better prompt, base f_keep = 0.155, sim = 0.400
srv        update:  - cache state: 1 prompts, 9.659 MiB (limits: 8192.000 MiB, 32768 tokens, 87360 est)
srv        update:    - prompt 0x557356dcf790:     103 tokens, checkpoints:  0,     9.659 MiB
srv  get_availabl: prompt cache update took 8.09 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  3 | task 92 | processing task
slot update_slots: id  3 | task 92 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 40
slot update_slots: id  3 | task 92 | n_tokens = 16, memory_seq_rm [16, end)
slot update_slots: id  3 | task 92 | prompt processing progress, n_tokens = 40, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id  3 | task 92 | prompt done, n_tokens = 40, batch.n_tokens = 24
srv  log_server_r: request: GET /props 192.168.2.218 200
srv  log_server_r: request: GET /slots 192.168.2.218 200
srv  log_server_r: request: GET /slots 192.168.2.218 200
srv  log_server_r: request: POST /v1/chat/completions 192.168.2.218 200
srv          stop: cancel task, id_task = 92
slot      release: id  3 | task 92 | stop processing: n_tokens = 91, truncated = 0
srv  update_slots: all slots are idle

{
  "conv": {
    "id": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
    "name": "what is 3 + 3?",
    "lastModified": 1763734870803,
    "currNode": "71c251c7-e00e-4e3d-8a13-46833e6e946a"
  },
  "messages": [
    {
      "convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
      "role": "user",
      "content": "what is 3 + 3?",
      "type": "text",
      "timestamp": 1763734870909,
      "thinking": "",
      "toolCalls": "",
      "children": [],
      "extra": [],
      "id": "515e601e-561f-4fce-a86e-e2366ad19ede",
      "parent": "7b514752-3ecc-4bf2-b9fb-f9a56b47ec19"
    },
    {
      "convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
      "type": "text",
      "role": "assistant",
      "content": "",
      "timestamp": 1763734870942,
      "thinking": "Okay, the user asked, \"what is 3 + 3?\" That seems really straightforward. Let me think... 3 plus 3 equals 6. It's basic arithmetic. \n\nHmm, why would someone ask this? Maybe they're testing if I can handle simple math. Or perhaps they're a young child just learning addition. Could also be someone checking if I'm functioning properly. \n\nI should keep it",
      "toolCalls": "",
      "children": [],
      "model": "Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf",
      "id": "26a674f0-27e1-44b3-9b90-9ebdb88fe552",
      "parent": "515e601e-561f-4fce-a86e-e2366ad19ede"
    },
    {
      "convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
      "role": "user",
      "content": "No I do not want to test you. Just provide the answer.",
      "type": "text",
      "timestamp": 1763734904570,
      "thinking": "",
      "toolCalls": "",
      "children": [],
      "extra": [],
      "id": "71c251c7-e00e-4e3d-8a13-46833e6e946a",
      "parent": "26a674f0-27e1-44b3-9b90-9ebdb88fe552"
    },
    {
      "convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
      "type": "text",
      "role": "assistant",
      "content": "",
      "timestamp": 1763734904662,
      "thinking": "Okay, the user is asking for a straightforward answer to \"what is 3 + 3?\" but with a note that they don't want to test me. Hmm, they seem a bit impatient or maybe frustrated, given the \"No",
      "toolCalls": "",
      "children": [],
      "model": "Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf",
      "id": "82bc0b9e-6121-4ee0-ac1d-1d71c3852299",
      "parent": "71c251c7-e00e-4e3d-8a13-46833e6e946a"
    }
  ]
}

After the first turn we have 103 tokens, with the next turn it starts at 16 and ends with 91 tokens. So the previous thinking is not taken into account.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions