-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 4906 (60c9029)
built with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server
--port 9002
--metrics
--slots
-m /models/Qwen_QwQ-32B-IQ4_XS.gguf
-ngl 999
--ctx-size 32768
--no-context-shift
-fa
-ctv q8_0
-ctk q8_0
-md /models/Qwen2.5-0.5B-Instruct-IQ4_XS.gguf
-ngld 99
--draft-p-min 0.5
--draft-min 0
--draft-max 15Problem description & steps to reproduce
When I run llama-server in the output the KV cache for the main model is initialized first as expected:
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1
init: ROCm0 KV buffer size = 4352.00 MiB
llama_context: KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
llama_context: ROCm0 compute buffer size = 325.08 MiB
llama_context: ROCm_Host compute buffer size = 74.01 MiB
llama_context: graph nodes = 1991
llama_context: graph splits = 2
However, then the KV cache for the draft model is initialized and the output seems to be double:
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: ROCm_Host output buffer size = 0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 24, can_shift = 1
init: ROCm0 KV buffer size = 204.00 MiB
llama_context: KV self size = 204.00 MiB, K (q8_0): 102.00 MiB, V (q8_0): 102.00 MiB
llama_context: ROCm0 compute buffer size = 300.26 MiB
llama_context: ROCm_Host compute buffer size = 65.76 MiB
llama_context: graph nodes = 751
llama_context: graph splits = 50
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 32768
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: ROCm_Host output buffer size = 0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init: ROCm0 KV buffer size = 384.00 MiB
llama_context: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_context: ROCm0 compute buffer size = 300.25 MiB
llama_context: ROCm_Host compute buffer size = 65.76 MiB
llama_context: graph nodes = 751
llama_context: graph splits = 2
Interestingly, the first output shows it being initialized with q8_0 quantization, the same as what I am using for the main model, while the second output shows FP16 KV cache. My questions are:
- Is the output correct, and is it really initializing the KV cache for the draft model twice? Or is this a display error?
- Is it using KV quantization for the draft model or not? Because the log output is contradicting itself
I am using QwQ-32B as the main model, and using Qwen2.5-0.5B-Instruct with QwQ's tokenizer to ensure I can use it as a draft model. I am seeing a good ~1.5-2x speedup, so things seem to be working fine. But if it's really initializing the draft model twice, it might mean it's wasting VRAM for no reason.
First Bad Commit
No response