-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: yes
version: 4548 (5f0db95)
built with clang version 19.0.0git (git@github.amd.com:Compute-Mirrors/llvm-project 5353ca3e0e5ae54a31eeebe223da212fa405567a) for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
HIP
Hardware
Ryzen 9 5900X w/ RX 7900XT
Models
DeepSeek-R1 Llama3.1 8B quant (q6_k)
Hermes Llama3.2 3B quant (q8)
quanted using llama-quantize from raw weights
however, model probably doesn't matter in this case.
Problem description & steps to reproduce
Model generates a single letter in a loop, after trying to stop it - the server halts indefinitely and stops responding, stopping the generation via web UI does not stop it (even though the "stop" event is logged), the GPU keeps working. It's also impossible to kill via Ctrl+C, killing the parent process is required (in some cases, even that doesn't help and i have to kill it from task manager).
UPDATE: The halting issue is already resolved thanks to @ngxson
However, the main generation issue still persists.
This is how i build llama.cpp:
Function llama-cpp-build-rocm {
llm-venv-activate
Write-Host "Building llama.cpp for ROCm..."
Push-Location $env:LLAMA_CPP_PATH
cmake -S . -B build -G Ninja `
-DCMAKE_BUILD_TYPE=Release `
-DCMAKE_CXX_COMPILER=clang++ `
-DCMAKE_C_COMPILER=clang `
-DCMAKE_INSTALL_PREFIX="C:/Users/phoen/AppData/Local/llama-cpp" `
-DLLAMA_BUILD_TESTS=OFF `
-DLLAMA_BUILD_EXAMPLES=ON `
-DLLAMA_BUILD_SERVER=ON `
-DLLAMA_STANDALONE=ON `
-DLLAMA_CURL=OFF `
-DGGML_CCACHE=ON `
-DGGML_NATIVE=ON `
-DGGML_OPENMP=ON `
-DGGML_AVX=ON `
-DGGML_AVX2=ON `
-DGGML_FMA=ON `
-DGGML_HIP=ON `
-DAMDGPU_TARGETS=gfx1100 `
-DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build --config Release --parallel 24
cmake --install build --config Release
Pop-Location
Write-Host "llama.cpp build completed!"
}First Bad Commit
I've pin-pointed it to the b4548 release, previous one works fine.
5f0db95
Relevant log output
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 65536
llama_init_from_model: n_ctx_per_seq = 65536
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 256
llama_init_from_model: flash_attn = 1
llama_init_from_model: freq_base = 500000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65536, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: ROCm0 KV buffer size = 7168.00 MiB
llama_init_from_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_init_from_model: ROCm_Host output buffer size = 0.49 MiB
llama_init_from_model: ROCm0 compute buffer size = 128.25 MiB
llama_init_from_model: ROCm_Host compute buffer size = 67.00 MiB
llama_init_from_model: graph nodes = 791
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 65536
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://steelph0enix.pc:51536 - starting the main loop
srv update_slots: all slots are idle
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 22
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 22, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 22, n_tokens = 22
srv cancel_tasks: cancel task, id_task = 0
request: POST /v1/chat/completions 192.168.0.150 200
srv cancel_tasks: cancel task, id_task = 597
request: POST /v1/chat/completions 192.168.0.150 200