-
Couldn't load subscription status.
- Fork 13.4k
Description
Name and Version
llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: no
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: no
Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: no
version: 1 (e60f01d)
built with MSVC 19.44.35217.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
E5-2620v3 x 2
tesla p100 x 3
Models
DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf
Problem description & steps to reproduce
First of all, thank you for the amazing project!
However, I’ve run into a strange problem that looks like a serious bug.
Hardware:
X99 motherboard
E5-2620v3 x 2 CPU
3 × Tesla P100 compute cards
Windows 10
Latest llama.cpp compiled from source (2025-10-10 release b6730)
Problem description:
When I run the model DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf (downloaded from ModelScope) with llama-cli.exe and --split-mode row, multi-turn conversations work perfectly.
But if I start the same model with the same flags through llama-server.exe , only the first chat turn is correct.
From the second turn onwards the model immediately returns garbled text, e.g.:
trained(chain;rAstAstAstAstAstAstAstAstAstAstAstAstAst;r;r;r;r;r;r;r;r;r;r;r;r…
or
jorn;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r…
or
ipAddress(chain ipAddressAstAstAstAstAstAstAstAstAstAstAstAstAstAstAstAst ipAddress ipAddress ipAddress ipAddress …
The exact garbled sequence differs each time, but it always starts with nonsense tokens followed by endless repetitions of some words .
Experiments done:
Removed --split-mode row ➜ garbling disappears, but generation speed drops dramatically.
Tested with text-generation-webui (versions 3.13 & 3.6) ➜ identical behaviour.
Tried LM Studio ➜ does not detect the P100 cards at all.
Repeating the test many times gives the same result: row-split tensor parallelism works fine in llama-cli, but breaks multi-turn context in llama-server.
The question I asked is the same:1000 word summary of world history
Have tried this:
set CUDA_VISIBLE_DEVICES=0,1,2
cd /d D:\llama.cpp\build\bin\Release
llama-server.exe ^
-m G:\AiModels\001\002\DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf ^
--threads 12 ^
-ngl 999 ^
--split-mode row ^
--tensor-split 14,14,14 ^
-c 8192 ^
--batch-size 2560 ^
--no-mmap ^
--no-warmup ^
--repeat-penalty 1.0 ^
--port 8080
And this:
set CUDA_VISIBLE_DEVICES=0,1,2
cd /d D:\llama.cpp\build\bin\Release
llama-server.exe ^
-m G:\AiModels\001\002\DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf ^
--threads 12 ^
-ngl 999 ^
--split-mode row ^
--tensor-split 14,14,14 ^
-c 8192 ^
--batch-size 2560 ^
--no-mmap ^
--no-warmup ^
--repeat-penalty 1.0 ^
--port 8080 ^
--parallel 1 ^
--cache-type-k f16 ^
--cache-type-v f16 ^
--no-context-shift
Could you please have a look?
Let me know if any additional logs or reproduction steps are needed
First Bad Commit
No response
Relevant log output
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 12
slot update_slots: id 0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 12, n_tokens = 12, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 12, n_tokens = 12
srv log_server_r: request: GET /slots 127.0.0.1 200
slot print_timing: id 0 | task 0 |
prompt eval time = 757.37 ms / 12 tokens ( 63.11 ms per token, 15.84 tokens per second)
eval time = 107779.86 ms / 1501 tokens ( 71.81 ms per token, 13.93 tokens per second)
total time = 108537.23 ms / 1513 tokens
slot release: id 0 | task 0 | stop processing: n_past = 1512, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.008
srv get_availabl: updating prompt cache
�[0msrv prompt_save: - saving prompt with length 1512, total state size = 378.019 MiB
�[0msrv load: - looking for better prompt, base f_keep = 0.008, sim = 1.000
�[0msrv update: - cache state: 1 prompts, 378.019 MiB (limits: 8192.000 MiB, 8192 tokens)
�[0msrv update: - prompt 00000196C68E9980: 1512 tokens, checkpoints: 0, 378.019 MiB
�[0msrv get_availabl: prompt cache update took 209.31 ms
�[0mslot launch_slot_: id 0 | task 1503 | processing task
slot update_slots: id 0 | task 1503 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 12
slot update_slots: id 0 | task 1503 | old: ... of world history<|Assistant|>
�[0mslot update_slots: id 0 | task 1503 | new: ... of world history<|Assistant|>
�[0mslot update_slots: id 0 | task 1503 | 315 1879 3840 151645
�[0mslot update_slots: id 0 | task 1503 | 315 1879 3840 151645
�[0mslot update_slots: id 0 | task 1503 | need to evaluate at least 1 token for each active slot (n_past = 12, n_prompt_tokens = 12)
�[0mslot update_slots: id 0 | task 1503 | n_past was set to 11
�[0mslot update_slots: id 0 | task 1503 | n_past = 11, memory_seq_rm [11, end)
slot update_slots: id 0 | task 1503 | prompt processing progress, n_past = 12, n_tokens = 1, progress = 1.000000
slot update_slots: id 0 | task 1503 | prompt done, n_past = 12, n_tokens = 1
srv log_server_r: request: GET /slots 127.0.0.1 200
slot print_timing: id 0 | task 1503 |
prompt eval time = 78.19 ms / 1 tokens ( 78.19 ms per token, 12.79 tokens per second)
eval time = 143811.56 ms / 2008 tokens ( 71.62 ms per token, 13.96 tokens per second)
total time = 143889.75 ms / 2009 tokens
slot release: id 0 | task 1503 | stop processing: n_past = 2019, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 1945513492
srv get_availabl: updating prompt cache
�[0msrv prompt_save: - saving prompt with length 2019, total state size = 504.775 MiB
�[0msrv load: - looking for better prompt, base f_keep = 0.006, sim = 0.016
�[0msrv update: - cache state: 2 prompts, 882.793 MiB (limits: 8192.000 MiB, 8192 tokens)
�[0msrv update: - prompt 00000196C68E9980: 1512 tokens, checkpoints: 0, 378.019 MiB
�[0msrv update: - prompt 00000196C68E97A0: 2019 tokens, checkpoints: 0, 504.775 MiB
�[0msrv get_availabl: prompt cache update took 284.21 ms
�[0mslot launch_slot_: id 0 | task 3513 | processing task
slot update_slots: id 0 | task 3513 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 743
slot update_slots: id 0 | task 3513 | old: ... of world history<|Assistant|> | <think>
Okay, so I
�[0mslot update_slots: id 0 | task 3513 | new: ... of world history<|Assistant|> | **World History Summary**
**
�[0mslot update_slots: id 0 | task 3513 | 315 1879 3840 151645 151648 198 32313 11 773 358
�[0mslot update_slots: id 0 | task 3513 | 315 1879 3840 151645 334 10134 11099 21517 56177 334
�[0mslot update_slots: id 0 | task 3513 | n_past = 12, memory_seq_rm [12, end)
slot update_slots: id 0 | task 3513 | prompt processing progress, n_past = 743, n_tokens = 731, progress = 1.000000
slot update_slots: id 0 | task 3513 | prompt done, n_past = 743, n_tokens = 731
srv cancel_tasks: cancel task, id_task = 3513
�[0msrv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot release: id 0 | task 3513 | stop processing: n_past = 843, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.986 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id 0 | task 3616 | processing task
slot update_slots: id 0 | task 3616 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 854
slot update_slots: id 0 | task 3616 | old: ... ipAddress ipAddress ipAddress ipAddress | ipAddress
�[0mslot update_slots: id 0 | task 3616 | new: ... ipAddress ipAddress ipAddress ipAddress | <|end▁of▁sentence|>
�[0mslot update_slots: id 0 | task 3616 | 91715 91715 91715 91715 91715
�[0mslot update_slots: id 0 | task 3616 | 91715 91715 91715 91715 151643
�[0mslot update_slots: id 0 | task 3616 | n_past = 842, memory_seq_rm [842, end)
slot update_slots: id 0 | task 3616 | prompt processing progress, n_past = 854, n_tokens = 12, progress = 1.000000
slot update_slots: id 0 | task 3616 | prompt done, n_past = 854, n_tokens = 12
srv cancel_tasks: cancel task, id_task = 3616
�[0msrv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot release: id 0 | task 3616 | stop processing: n_past = 1877, truncated = 0
srv update_slots: all slots are idle