Skip to content

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

@SteelPh0enix

Description

@SteelPh0enix

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: yes
version: 4548 (5f0db95)
built with clang version 19.0.0git (git@github.amd.com:Compute-Mirrors/llvm-project 5353ca3e0e5ae54a31eeebe223da212fa405567a) for x86_64-pc-windows-msvc

Operating systems

Windows

GGML backends

HIP

Hardware

Ryzen 9 5900X w/ RX 7900XT

Models

DeepSeek-R1 Llama3.1 8B quant (q6_k)
Hermes Llama3.2 3B quant (q8)
quanted using llama-quantize from raw weights
however, model probably doesn't matter in this case.

Problem description & steps to reproduce

Model generates a single letter in a loop, after trying to stop it - the server halts indefinitely and stops responding, stopping the generation via web UI does not stop it (even though the "stop" event is logged), the GPU keeps working. It's also impossible to kill via Ctrl+C, killing the parent process is required (in some cases, even that doesn't help and i have to kill it from task manager).

UPDATE: The halting issue is already resolved thanks to @ngxson
However, the main generation issue still persists.

Image

This is how i build llama.cpp:

Function llama-cpp-build-rocm {
    llm-venv-activate
    Write-Host "Building llama.cpp for ROCm..."

    Push-Location $env:LLAMA_CPP_PATH
    cmake -S . -B build -G Ninja `
        -DCMAKE_BUILD_TYPE=Release `
        -DCMAKE_CXX_COMPILER=clang++ `
        -DCMAKE_C_COMPILER=clang `
        -DCMAKE_INSTALL_PREFIX="C:/Users/phoen/AppData/Local/llama-cpp" `
        -DLLAMA_BUILD_TESTS=OFF `
        -DLLAMA_BUILD_EXAMPLES=ON `
        -DLLAMA_BUILD_SERVER=ON `
        -DLLAMA_STANDALONE=ON `
        -DLLAMA_CURL=OFF `
        -DGGML_CCACHE=ON `
        -DGGML_NATIVE=ON `
        -DGGML_OPENMP=ON `
        -DGGML_AVX=ON `
        -DGGML_AVX2=ON `
        -DGGML_FMA=ON `
        -DGGML_HIP=ON `
        -DAMDGPU_TARGETS=gfx1100 `
        -DGGML_CUDA_FA_ALL_QUANTS=ON 

    cmake --build build --config Release --parallel 24
    cmake --install build --config Release
    Pop-Location
    Write-Host "llama.cpp build completed!"
}

First Bad Commit

I've pin-pointed it to the b4548 release, previous one works fine.
5f0db95

Relevant log output

llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 65536
llama_init_from_model: n_ctx_per_seq = 65536
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 256
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65536, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  7168.00 MiB
llama_init_from_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_init_from_model:      ROCm0 compute buffer size =   128.25 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    67.00 MiB
llama_init_from_model: graph nodes  = 791
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 65536
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://steelph0enix.pc:51536 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 22
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 22, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 22, n_tokens = 22
srv  cancel_tasks: cancel task, id_task = 0
request: POST /v1/chat/completions 192.168.0.150 200
srv  cancel_tasks: cancel task, id_task = 597
request: POST /v1/chat/completions 192.168.0.150 200

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions