Skip to content

Eval bug: Garbled output when using llama-server.exe with --split-mode row #16517

@key-sh

Description

@key-sh

Name and Version

llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: no
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: no
Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: no
version: 1 (e60f01d)
built with MSVC 19.44.35217.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

E5-2620v3 x 2
tesla p100 x 3

Models

DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf

Problem description & steps to reproduce

First of all, thank you for the amazing project!
However, I’ve run into a strange problem that looks like a serious bug.
Hardware:
X99 motherboard
E5-2620v3 x 2 CPU
3 × Tesla P100 compute cards
Windows 10
Latest llama.cpp compiled from source (2025-10-10 release b6730)
Problem description:
When I run the model DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf (downloaded from ModelScope) with llama-cli.exe and --split-mode row, multi-turn conversations work perfectly.
But if I start the same model with the same flags through llama-server.exe , only the first chat turn is correct.
From the second turn onwards the model immediately returns garbled text, e.g.:
trained(chain;rAstAstAstAstAstAstAstAstAstAstAstAstAst;r;r;r;r;r;r;r;r;r;r;r;r…
or
jorn;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r;r…
or
ipAddress(chain ipAddressAstAstAstAstAstAstAstAstAstAstAstAstAstAstAstAst ipAddress ipAddress ipAddress ipAddress …

The exact garbled sequence differs each time, but it always starts with nonsense tokens followed by endless repetitions of some words .
Experiments done:
Removed --split-mode row ➜ garbling disappears, but generation speed drops dramatically.
Tested with text-generation-webui (versions 3.13 & 3.6) ➜ identical behaviour.
Tried LM Studio ➜ does not detect the P100 cards at all.
Repeating the test many times gives the same result: row-split tensor parallelism works fine in llama-cli, but breaks multi-turn context in llama-server.

The question I asked is the same:1000 word summary of world history

Have tried this:
set CUDA_VISIBLE_DEVICES=0,1,2
cd /d D:\llama.cpp\build\bin\Release
llama-server.exe ^
-m G:\AiModels\001\002\DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf ^
--threads 12 ^
-ngl 999 ^
--split-mode row ^
--tensor-split 14,14,14 ^
-c 8192 ^
--batch-size 2560 ^
--no-mmap ^
--no-warmup ^
--repeat-penalty 1.0 ^
--port 8080

And this:
set CUDA_VISIBLE_DEVICES=0,1,2
cd /d D:\llama.cpp\build\bin\Release
llama-server.exe ^
-m G:\AiModels\001\002\DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf ^
--threads 12 ^
-ngl 999 ^
--split-mode row ^
--tensor-split 14,14,14 ^
-c 8192 ^
--batch-size 2560 ^
--no-mmap ^
--no-warmup ^
--repeat-penalty 1.0 ^
--port 8080 ^
--parallel 1 ^
--cache-type-k f16 ^
--cache-type-v f16 ^
--no-context-shift

Could you please have a look?
Let me know if any additional logs or reproduction steps are needed

First Bad Commit

No response

Relevant log output

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 12
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12, n_tokens = 12, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 12, n_tokens = 12
srv  log_server_r: request: GET /slots 127.0.0.1 200
slot print_timing: id  0 | task 0 |
prompt eval time =     757.37 ms /    12 tokens (   63.11 ms per token,    15.84 tokens per second)
       eval time =  107779.86 ms /  1501 tokens (   71.81 ms per token,    13.93 tokens per second)
      total time =  108537.23 ms /  1513 tokens
slot      release: id  0 | task 0 | stop processing: n_past = 1512, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.008
srv  get_availabl: updating prompt cache
�[0msrv   prompt_save:  - saving prompt with length 1512, total state size = 378.019 MiB
�[0msrv          load:  - looking for better prompt, base f_keep = 0.008, sim = 1.000
�[0msrv        update:  - cache state: 1 prompts, 378.019 MiB (limits: 8192.000 MiB, 8192 tokens)
�[0msrv        update:    - prompt 00000196C68E9980:    1512 tokens, checkpoints:  0,   378.019 MiB
�[0msrv  get_availabl: prompt cache update took 209.31 ms
�[0mslot launch_slot_: id  0 | task 1503 | processing task
slot update_slots: id  0 | task 1503 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 12
slot update_slots: id  0 | task 1503 | old: ...  of world history<|Assistant|>
�[0mslot update_slots: id  0 | task 1503 | new: ...  of world history<|Assistant|>
�[0mslot update_slots: id  0 | task 1503 |      315    1879    3840  151645
�[0mslot update_slots: id  0 | task 1503 |      315    1879    3840  151645
�[0mslot update_slots: id  0 | task 1503 | need to evaluate at least 1 token for each active slot (n_past = 12, n_prompt_tokens = 12)
�[0mslot update_slots: id  0 | task 1503 | n_past was set to 11
�[0mslot update_slots: id  0 | task 1503 | n_past = 11, memory_seq_rm [11, end)
slot update_slots: id  0 | task 1503 | prompt processing progress, n_past = 12, n_tokens = 1, progress = 1.000000
slot update_slots: id  0 | task 1503 | prompt done, n_past = 12, n_tokens = 1
srv  log_server_r: request: GET /slots 127.0.0.1 200
slot print_timing: id  0 | task 1503 |
prompt eval time =      78.19 ms /     1 tokens (   78.19 ms per token,    12.79 tokens per second)
       eval time =  143811.56 ms /  2008 tokens (   71.62 ms per token,    13.96 tokens per second)
      total time =  143889.75 ms /  2009 tokens
slot      release: id  0 | task 1503 | stop processing: n_past = 2019, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 1945513492
srv  get_availabl: updating prompt cache
�[0msrv   prompt_save:  - saving prompt with length 2019, total state size = 504.775 MiB
�[0msrv          load:  - looking for better prompt, base f_keep = 0.006, sim = 0.016
�[0msrv        update:  - cache state: 2 prompts, 882.793 MiB (limits: 8192.000 MiB, 8192 tokens)
�[0msrv        update:    - prompt 00000196C68E9980:    1512 tokens, checkpoints:  0,   378.019 MiB
�[0msrv        update:    - prompt 00000196C68E97A0:    2019 tokens, checkpoints:  0,   504.775 MiB
�[0msrv  get_availabl: prompt cache update took 284.21 ms
�[0mslot launch_slot_: id  0 | task 3513 | processing task
slot update_slots: id  0 | task 3513 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 743
slot update_slots: id  0 | task 3513 | old: ...  of world history<|Assistant|> | <think>
Okay, so I
�[0mslot update_slots: id  0 | task 3513 | new: ...  of world history<|Assistant|> | **World History Summary**

**
�[0mslot update_slots: id  0 | task 3513 |      315    1879    3840  151645  151648     198   32313      11     773     358
�[0mslot update_slots: id  0 | task 3513 |      315    1879    3840  151645     334   10134   11099   21517   56177     334
�[0mslot update_slots: id  0 | task 3513 | n_past = 12, memory_seq_rm [12, end)
slot update_slots: id  0 | task 3513 | prompt processing progress, n_past = 743, n_tokens = 731, progress = 1.000000
slot update_slots: id  0 | task 3513 | prompt done, n_past = 743, n_tokens = 731
srv  cancel_tasks: cancel task, id_task = 3513
�[0msrv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  0 | task 3513 | stop processing: n_past = 843, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.986 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  0 | task 3616 | processing task
slot update_slots: id  0 | task 3616 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 854
slot update_slots: id  0 | task 3616 | old: ...  ipAddress ipAddress ipAddress ipAddress |  ipAddress
�[0mslot update_slots: id  0 | task 3616 | new: ...  ipAddress ipAddress ipAddress ipAddress | <|end▁of▁sentence|>
�[0mslot update_slots: id  0 | task 3616 |    91715   91715   91715   91715   91715
�[0mslot update_slots: id  0 | task 3616 |    91715   91715   91715   91715  151643
�[0mslot update_slots: id  0 | task 3616 | n_past = 842, memory_seq_rm [842, end)
slot update_slots: id  0 | task 3616 | prompt processing progress, n_past = 854, n_tokens = 12, progress = 1.000000
slot update_slots: id  0 | task 3616 | prompt done, n_past = 854, n_tokens = 12
srv  cancel_tasks: cancel task, id_task = 3616
�[0msrv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  0 | task 3616 | stop processing: n_past = 1877, truncated = 0
srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions