-
Notifications
You must be signed in to change notification settings - Fork 14k
Description
Name and Version
version: 7124 (f1ffbba)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CPU
Hardware
CPU only / 36 GB Ram, CPU i5-8250U
Models
Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K
Problem description & steps to reproduce
llama-server -m GPT4ALL/Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf -c 32768 --reasoning-format none --host 192.168.2.218 --port 16060
When I interrupt the model while thinking, and enter a new user prompt, the tokens generated in previous thinking are discarded, the model does not remember what it thought before. No matter what I set as --reasoning-format. Neither none nor deepseek nor legacy-deepseek works. It can also be seen from the context size in the generation statistics. When the new thought process starts the context size drops below the level after previous thinking.
It worked perfectly this summer with:
build: 5985 (3f4fc97) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
and this command line:
llama-server -m GPT4ALL/Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf -c 32768 --host 192.168.2.218 --port 6061
and in webui "Exclude thought process when sending requests to API" setting OFF
First Bad Commit
No response
Relevant log output
main: model loaded
main: server is listening on http://192.168.2.218:16060
main: starting the main loop...
srv update_slots: all slots are idle
srv log_server_r: request: GET /props 192.168.2.218 200
srv log_server_r: request: GET /props 192.168.2.218 200
srv log_server_r: request: GET /props 192.168.2.218 200
srv log_server_r: request: GET /props 192.168.2.218 200
srv params_from_: Chat format: Content-only
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 16
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 16, batch.n_tokens = 16, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 16, batch.n_tokens = 16
srv log_server_r: request: GET /props 192.168.2.218 200
srv log_server_r: request: GET /slots 192.168.2.218 200
srv log_server_r: request: GET /slots 192.168.2.218 200
srv log_server_r: request: POST /v1/chat/completions 192.168.2.218 200
srv stop: cancel task, id_task = 0
slot release: id 3 | task 0 | stop processing: n_tokens = 103, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.400 (> 0.100 thold), f_keep = 0.155
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 103, total state size = 9.659 MiB
srv load: - looking for better prompt, base f_keep = 0.155, sim = 0.400
srv update: - cache state: 1 prompts, 9.659 MiB (limits: 8192.000 MiB, 32768 tokens, 87360 est)
srv update: - prompt 0x557356dcf790: 103 tokens, checkpoints: 0, 9.659 MiB
srv get_availabl: prompt cache update took 8.09 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 92 | processing task
slot update_slots: id 3 | task 92 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 40
slot update_slots: id 3 | task 92 | n_tokens = 16, memory_seq_rm [16, end)
slot update_slots: id 3 | task 92 | prompt processing progress, n_tokens = 40, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id 3 | task 92 | prompt done, n_tokens = 40, batch.n_tokens = 24
srv log_server_r: request: GET /props 192.168.2.218 200
srv log_server_r: request: GET /slots 192.168.2.218 200
srv log_server_r: request: GET /slots 192.168.2.218 200
srv log_server_r: request: POST /v1/chat/completions 192.168.2.218 200
srv stop: cancel task, id_task = 92
slot release: id 3 | task 92 | stop processing: n_tokens = 91, truncated = 0
srv update_slots: all slots are idle
{
"conv": {
"id": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
"name": "what is 3 + 3?",
"lastModified": 1763734870803,
"currNode": "71c251c7-e00e-4e3d-8a13-46833e6e946a"
},
"messages": [
{
"convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
"role": "user",
"content": "what is 3 + 3?",
"type": "text",
"timestamp": 1763734870909,
"thinking": "",
"toolCalls": "",
"children": [],
"extra": [],
"id": "515e601e-561f-4fce-a86e-e2366ad19ede",
"parent": "7b514752-3ecc-4bf2-b9fb-f9a56b47ec19"
},
{
"convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
"type": "text",
"role": "assistant",
"content": "",
"timestamp": 1763734870942,
"thinking": "Okay, the user asked, \"what is 3 + 3?\" That seems really straightforward. Let me think... 3 plus 3 equals 6. It's basic arithmetic. \n\nHmm, why would someone ask this? Maybe they're testing if I can handle simple math. Or perhaps they're a young child just learning addition. Could also be someone checking if I'm functioning properly. \n\nI should keep it",
"toolCalls": "",
"children": [],
"model": "Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf",
"id": "26a674f0-27e1-44b3-9b90-9ebdb88fe552",
"parent": "515e601e-561f-4fce-a86e-e2366ad19ede"
},
{
"convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
"role": "user",
"content": "No I do not want to test you. Just provide the answer.",
"type": "text",
"timestamp": 1763734904570,
"thinking": "",
"toolCalls": "",
"children": [],
"extra": [],
"id": "71c251c7-e00e-4e3d-8a13-46833e6e946a",
"parent": "26a674f0-27e1-44b3-9b90-9ebdb88fe552"
},
{
"convId": "05c34f99-b6f1-47b4-bff3-a9f0009fd6f9",
"type": "text",
"role": "assistant",
"content": "",
"timestamp": 1763734904662,
"thinking": "Okay, the user is asking for a straightforward answer to \"what is 3 + 3?\" but with a note that they don't want to test me. Hmm, they seem a bit impatient or maybe frustrated, given the \"No",
"toolCalls": "",
"children": [],
"model": "Qwen_Qwen3-30B-A3B-Thinking-2507-Q6_K.gguf",
"id": "82bc0b9e-6121-4ee0-ac1d-1d71c3852299",
"parent": "71c251c7-e00e-4e3d-8a13-46833e6e946a"
}
]
}
After the first turn we have 103 tokens, with the next turn it starts at 16 and ends with 91 tokens. So the previous thinking is not taken into account.