Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows)

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: yes
version: 4548 (5f0db952)
built with clang version 19.0.0git (git@github.amd.com:Compute-Mirrors/llvm-project 5353ca3e0e5ae54a31eeebe223da212fa405567a) for x86_64-pc-windows-msvc

### Operating systems

Windows

### GGML backends

HIP

### Hardware

Ryzen 9 5900X w/ RX 7900XT

### Models

DeepSeek-R1 Llama3.1 8B quant (q6_k)
Hermes Llama3.2 3B quant (q8)
quanted using `llama-quantize` from raw weights
however, model probably doesn't matter in this case.

### Problem description & steps to reproduce

Model generates a single letter in a loop, after trying to stop it - the server halts indefinitely and stops responding, stopping the generation via web UI does not stop it (even though the "stop" event is logged), the GPU keeps working. It's also impossible to kill via Ctrl+C, killing the parent process is required (in some cases, even that doesn't help and i have to kill it from task manager).

UPDATE: The halting issue is already resolved thanks to @ngxson
However, the main generation issue still persists.

![Image](https://github.com/user-attachments/assets/52fe56bf-7596-4722-98f8-c0c18bc094d6)

This is how i build `llama.cpp`:

```powershell
Function llama-cpp-build-rocm {
    llm-venv-activate
    Write-Host "Building llama.cpp for ROCm..."

    Push-Location $env:LLAMA_CPP_PATH
    cmake -S . -B build -G Ninja `
        -DCMAKE_BUILD_TYPE=Release `
        -DCMAKE_CXX_COMPILER=clang++ `
        -DCMAKE_C_COMPILER=clang `
        -DCMAKE_INSTALL_PREFIX="C:/Users/phoen/AppData/Local/llama-cpp" `
        -DLLAMA_BUILD_TESTS=OFF `
        -DLLAMA_BUILD_EXAMPLES=ON `
        -DLLAMA_BUILD_SERVER=ON `
        -DLLAMA_STANDALONE=ON `
        -DLLAMA_CURL=OFF `
        -DGGML_CCACHE=ON `
        -DGGML_NATIVE=ON `
        -DGGML_OPENMP=ON `
        -DGGML_AVX=ON `
        -DGGML_AVX2=ON `
        -DGGML_FMA=ON `
        -DGGML_HIP=ON `
        -DAMDGPU_TARGETS=gfx1100 `
        -DGGML_CUDA_FA_ALL_QUANTS=ON 

    cmake --build build --config Release --parallel 24
    cmake --install build --config Release
    Pop-Location
    Write-Host "llama.cpp build completed!"
}
```

### First Bad Commit

I've pin-pointed it to the b4548 release, previous one works fine.
https://github.com/ggerganov/llama.cpp/commit/5f0db9522f347b095f84c3033d6c1c1895402e25

### Relevant log output

```shell
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 65536
llama_init_from_model: n_ctx_per_seq = 65536
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 256
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65536, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  7168.00 MiB
llama_init_from_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_init_from_model:      ROCm0 compute buffer size =   128.25 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    67.00 MiB
llama_init_from_model: graph nodes  = 791
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 65536
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://steelph0enix.pc:51536 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 22
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 22, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 22, n_tokens = 22
srv  cancel_tasks: cancel task, id_task = 0
request: POST /v1/chat/completions 192.168.0.150 200
srv  cancel_tasks: cancel task, id_task = 597
request: POST /v1/chat/completions 192.168.0.150 200
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions