-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Name and Version
$ ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6793 (38355c6)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu
Built with:
cmake -S . -B build
-DGGML_HIP=ON
-DAMDGPU_TARGETS=gfx1100
-DCMAKE_BUILD_TYPE=Release
-DGGML_NATIVE=ON
-DGGML_HIP_ROCWMMA_FATTN=ON
-DGGML_HIP_GRAPHS=ON
This also occurs with the HIP build on Windows using the same hardware.
Operating systems
Linux (and Windows)
GGML backends
HIP
Hardware
Radeon RX 7900 XTX
Models
Qwen3-30b-a3b-thinking-2507 Q4_K_XL (Unsloth)
Problem description & steps to reproduce
When I run
llama-server
--threads 12
--gpu-layers 99
--flash-attn auto
--jinja
--hf-repo unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_XL
--ctx-size 40960
--temp 0.6
--top-k 20
--top-p 0.95
--min-p 0.0
--ubatch-size 2048
on b6792 output is as normal as you would expect. Outputs are well thought-out, detailed, and somewhat lengthy.
When I run the same on b6793, I get shorter answers with less accurate/detailed information. It is also less inclined to format the output with Markdown.
First Bad Commit
Relevant log output
N/A, logs look normal.